Wiki Home

Star Schema


Namespace: Wiki
A basic structure used in Data Warehouse databases is a Star Schema. The star schema has a single object (table) in the middle radially connected to other surrounding objects (tables) like a star. The star schema mirrors the end user's view of a business query such as a sales fact that is qualified by one or more dimensions (e.g., product, store, time, region, etc.).

For example
Figure 1

The object in the center of the star is called the Fact Table. This fact table contains the basic business measurements and can consist of millions of rows. The objects surrounding the fact table (which appear as the points of the star) are called the Dimension Tables. These dimension tables contain business attributes that can be used as SQL criteria, and they are typically small. The star schema itself can be simple or complex. A simple star schema consists of one fact table and several dimension tables. A complex star schema can have more than one fact table and hundreds of dimension tables.

Keep in mind that any Star with more than 25 Dimensions should be a red flag. Many theorists (Dr. Ralph Kimball is one) suggest that anything larger than this defeats the purpose of the schema's simplicity and efficiency. It is very likely that several dimensions may be combined, or separate star schemas can be created to accommodate the vast number of dimensions. -- Tod McKenna

An advantage of this type of schema is its simplicity; it's understandable by end users. Moreover queries based on such schemas are quick (since the data in the Dimension Tables is highly denormalized). Other advantages are low maintenance (since the diagram is simple), it is relatively easy to define new hierarchies and the number of connections is low.

Some content moved to __ Star Schema Denormalization Discussion
Speed tip: the Fact Table Primary Key should be a compound index of the Dimension Tables keys and it should basically contain every combination of them.

Please elaborate on this. To me it doesn't make sense. It sounds like: concatenate all of the foreign keys to make the primary key. That would restrict having two fact records for the same set of dimensions, which doesn't sound like a good thing. -- ?CFK

The Fact Table indexing strategy has a huge impact on performance since there can easily be several million rows in this table. The Fact Table is summary data -- maybe one record summarizes a week's worth of transactions in a store for a particular SkuId. So maybe you stick the RegionId and TerritoryId and SKUCategoryId in that table, to avoid excessive joins when querying the data. Add the WeekId and make a single clustered index comprising all these keys and now you're cooking with gas.

It is very important to define a grain when designing a Fact Table. The finer the grain, the more slicing and dicing you can do with the data. I think it is inaccurate to say that the Fact Table is summary data. In fact, a single row in a fact table may represent a single line on an invoice. -- Tod McKenna

Note that the Foreign Key s thus added need not be proximate. In other words the Fact Table may be many times removed. For example, in a classic Outlet-Territory-Zone-Area location hierarchy, we could put the AreaID in the fact table and eliminate lookups through TerritoryId and ZoneId. This is a huge savings.
Key difficulty of any Data Warehouse design: the Data Warehouse database may have records describing products we don't have anymore, in SKU sizes that may have changed, sold in stores that may not exist anymore, regions and territories whose boundaries change occasionally, and data that came from software systems that don't exist anymore. It's summary data shoe-horned into summary and conceptual structures for decision support purposes. Moreover, tuning the database can be an ongoing activity depending on what's needed this week by some high muckety-muck somewhere...-- Steven Black
Some problems appear when you want to have aggregated data (summaries) in the Fact Table - which you usually do, since the Fact Table tends to get very big.

When this happens you have to store a level indicator in the dimensions. For example, in Figure 1, the Stores dimension may hold stores which are independent but also stores which are part of a chain, or we may have regions for the stores etc. The problem is this inhibits the flexibility of the model - especially since we force a physical structure that can change later. Another problem is that the structure is less obvious, and users are more prone to construct queries that will result in bad answers (by forgetting to separate levels they'll get summary and details together)

There are 2 common solutions which are detailed in Snowflake Schema and Constellation Schema
Contributors: Arnon Gal - Oz, Steven Black, Carl Karsten
Category Data Warehousing, Category Modeling, Category Data
( Topic last updated: 2006.10.11 03:31:33 PM )