on target
ROLAP: Starring your schema
In previous post we have seen the many problems when trying to use operational databases for strategic purposes. Now we look at a practical and simple solution based upon proven and scalable technology.
ROLAP is a data model that specifically built to support strategic, business-driven multidimensional analysis.
The term ROLAP comes from R (Relational) + OLAP and (as you probably guess) is based on relational databases. Using relational technology has many advantages:
Is supported by hundreds of different database engines.
Comfortably handles huge amount of data
Seamlessly integrates into current IT shop.
Supports innovative state-of-the art engines like column databases
No new languages to learn.
Star schema
The so called “star schema” is the key in ROLAP models. It is so called because its representation remembers a star, like this one that can hold exactly same data of normalized model shown in previous post:
Each box of such model becomes a SQL table. The center box is called fact table, usually very large, containing million or even billion of rows. It holds business measures plus foreign keys that relate each measurement to appropriate rows of other tables.

Dimensions
The other tables are called dimension tables and contain properties of different business dimension or view perspective. So we have client dimension holding properties associated with a client and whatever relates to it (Name, branch office, ...). For example:
|
ID |
CLIENT_NAME |
CORPORATION |
ACCOUNT_REP_NAME |
BRANCH_OFFICE |
|
1 |
Bell Inc. |
ACME |
Jeff |
LA |
|
2 |
Lilly Inc. |
ACME |
Bill |
NY |
|
3 |
Bubble Ltd. |
(None) |
Bill |
NY |
|
... |
... |
... |
... |
... |
Please notice that:
Each dimension table uses as primary key a surrogate key (ID) through which it is linked to fact table. ID column shall be a surrogate key and not any attribute of the table we will see later why.
Dimensions are not normalized, Corporation, for example, repeats into the table as well as Account Responsible name and most other properties. This is intentional! You shall not try to link a dimension table to other tables (for example linking client dimension to branch office). Doing that is called snowflaking because instead of a star the schema would take the shape of snow crystal. Snowflaking is a bad practice since we would be back to problems of normalized schema.
Dimension tables are relatively very small if compared with the fact table. In our case client fact table would initially have one row per client.
A dimension deserves special attention and is usually present in any star schema: time dimension. Such table is usually absent from operational database but is very important to overcome the problems of date math.
Time dimension usually contains one row per day (thus a very little table), for example:
|
|
|
|
|
|
|
|
|
1 |
20010 |
1 |
January |
Friday |
2009 |
1 |
|
2 |
20010 |
1 |
January |
Saturday |
2009 |
1 |
|
3 |
20010 |
1 |
January |
Sunday |
2009 |
1 |
|
4 |
20010 |
1 |
January |
Monday |
2009 |
2 |
|
5 |
20010 |
1 |
January |
Tuesday |
2009 |
3 |
|
... |
... |
... |
... |
... |
... |
... |
|
... |
20010 |
1 |
March |
Monday |
20010 |
... |
Notice the columns FISCAL_YEAR
that tells us which fiscal year each day refers to and WORK_DAYS
that tells us how many work days passed from an arbitrary initial
date. Saturday and Sunday and holidays do not count as work days,
thus column value is not incremented.
Fact Table
Now let's look in more detail fact table:
|
|
|
|
|
|
|
1 |
1 |
1 |
100 |
12400 |
|
12 |
13 |
9 |
70 |
99450 |
|
39 |
6 |
3 |
9 |
10887 |
|
... |
... |
... |
... |
... |
As you can see the fact table is solely composed of;
References to dimension tables (
CLIENT_ID, PRODUCT_IDandTIME_ID).Additive and semi-additive business measures
QTY_SOLD, AMOUNT_SOLD).
A measure is called additive
when aggregating it along all dimensions make sense. This is the case
of AMOUNT_SOLD. You can aggregate AMOUNT_SOLD
along time to get, for example, yearly sales, or along product to get
sales by brand, or along client to compute sales by corporation.
A semi-additive measure can
usually be aggregated along some (but not all) dimension.
Aggregating QTY_SOLD, for example, would not make much
sense along different products.
A non-additive measure is a
measure that is neither additive nor semi-additive, like, for example
unit price. You can't aggregate unit price along any dimension, but
you can always compute unit price by dividing AMOUNT_SOLD
by QTY_SOLD.
When designing fact table, you shall look at measures specification and replace any non-additive measure into (semi)additive measures used to compute it.
The “grain” of a fact table is its detail level. In this case the grain is client × product × day. Should we have two invoices lines for the same client, same product, same day, they would be aggregated in a single row of the fact table.
In case of a finer grain, for example invoice level, two different lines would be used. If, on the other side, we choose a larger grain like month instead of day, our fact table would be considerably smaller (up to 1/30) but we would loose daily analytics.
Note: fact table can also contain the so called “degenerate dimensions”. Those are simple attributes that would not make sense to store on separate table. Suppose you need to know if payment was cash or credit card. Creating a dimension for such a simple flag would be excessive overhead and you can simply store it in fact table. There is nothing wrong in using degenerate dimensions. you can always change your mind later and refactor them to normal dimension if you need.
Now let's appreciate some advantages of star schema over normalized alternative of previous post.
Simplified queries
First advantage is query simplicity. Any query, in this case, would involve no more than 4 tables. Given the aggregate fields you need you can easily automate the constraints to be injected into your query. So it is extremely simple to write a query generator given retrieved fields, and SQL nightmare is gone away forever. In addition there are plenty of query tools that can write that code for you with minimal configuration.
Remember the problem of fiscal year? It
is elegantly solved through an extra column all what we need is
aggregate by fiscal year. Time dimension also let us easily do
very complex date computations. Say you would like to know
average sales by worked day in a period. We can simply divide
AMOUNT_SOLD by the difference of WORK_DAYS
between end and start of that period.
And what if you need add geographic information to client? Simply add needed columns and new analysis will be immediately available including history data!
Dynamic dimensions
But main issue has still to be solved. What happens when Jeff is replaced from LA to NY. Slowly changing dimensions elegantly solve that problem.
Whenever we detect such change we insert a new rows into client dimension table:
|
ID |
CLIENT_NAME |
CORPORATION |
ACCOUNT_REP_NAME |
BRANCH_OFFICE |
|
1 |
Bell Inc. |
ACME |
Jeff |
LA |
|
2 |
Lilly Inc. |
ACME |
Bill |
NY |
|
3 |
Bubble Ltd. |
(None) |
Bill |
NY |
|
... |
... |
... |
... |
... |
|
137 |
Lilly Inc. |
ACME |
Jeff |
NY |
Now that Jeff has now taken Lilly account. New invoices will reference row 137 and all data and history wold be kept consistent and correct at client, branch and account rep level. Should Lilly be sold to another corporation a similar duplication will take place for the corporation column.
Systematically applying such changes can seem tricky but can be done through a simple algorithm described in references (see books below).
Now you understand why all ID shall be surrogate keys and why a normalized approach would not allow this implementation.
Aggregate tables
Now let's face performance issue. Fact table can quickly reach billions of rows in certain cases (just think in terms of telecom traffic). How can we speed up retrieval?
Aggregate tables solve, again elegantly, this issue. An aggregate table is a fact table that has been pre-aggregated, but that has exactly the same structure of fact table.
If, for example, we decide to use a yearly aggregated fact table we shall make two things:
Add rows to time dimension, leaving null all fields but year and insert one row for each year.
Create a table with same structure (columns) of fact table call it say SALES_FACT_2 and aggregate fact table at year level. Thus TIME_ID entries will point to new entries created in pass 1.
Note that this table would be about 260 times smaller that fact table (supposing there are about 260 work days per year).
The same way there can be a second fact table aggregated by branch office, and a third table aggregated by year and branch office. We can build how many aggregate tables as we wish depending on usual query patterns executed by users.
Now performing retrieval is very simple. The query is always the same the only thing that changes is the name of fact table. Which fact table to use is determined by a simple algorithm (see books) that picks the table with smaller number of rows that has the detail level we need. Here is a typical scenario:
Suppose we request total sales this year. Table aggregated by branch and office would be used. It will probably be a small table and, if we have 30 branches our query will retrieve only 30 rows that will be added to accrue total sales this year.
User drills down by branch (requires branch detail). Same table will still be used because it contains branch detail.
Now user drill down by Sales Rep. The only table that contain sales rep detail is the non-aggregated fact table, thus that table will be used. But at this point we have drill down constraints in our query that reduce the number of retrieved columns.
All that is transparent and automatic. The only thing that changes is the name of possibly aggregated fact table. Again, no SQL to write and no more performance problems!
Keep in mind than using bitmap indexes
or column-oriented
DBMS can further greatly speed up retrieval.
Again, you can't do the same with a normalized database.
References
For more detail on ROLAP and star schema approaches I suggest you the following books:
Ralph Kimball – The Data Warehouse Toolkit (2nd ed.) : best reference on ROLAP and star schema. Also available from Amazon.
Ralph Kimball – The Data Warehouse ETL Toolkit: Practical strategies and algorithms for extracting clean data and feed fact and dimension tables. Also available from Amazon.
Christopher Adamson – Mastering Data Warehouse Aggregates: how to choose and use aggregate tables. Also available from Amazon.
Looking Ahead
Congratulations! You are now able to build what is usually called a data mart: this is a piece of a data warehouse specifically designed to provide strategic information about a particular aspect of business: sales, in our case. As you have seen building a data mart is very simple and can easily fit agile approach.
But what about production? And human resources? And the rest? You shall keep a consistent and integrated approach in order to provide a long term solution. Can you adopt a dynamic style creating data marts as needed) or do you need extra planning time that will possibly delay immediate tangible results?
In next post I will describe an agile approach that will cause no project overhead and will lead to a scalable and consistent data warehouse.
Interested in partnership?
Would like to try or coach this technique with your clients, do you need any additional or technical detail? Please let me know! Component Bases Solutions has great interest in partnership with consultants. We can help you automate your proposed solutions in a very short time. We can also help to increase your visibility through links from many management tools we make freely available on the Web. Please contact for more detail.
Did you like this post?
Please, spend a few seconds and click on button below to socially bookmark it or tweet it. It will help to promote this blog and improve its content, Thank you!
|
Posted at 12:00AM Oct 20, 2009 by admin in Business Intelligence | Comments[4]







Posted by Doug on October 21, 2009 at 07:16 PM UTC #
Posted by Colin Canfield on January 08, 2010 at 01:31 AM UTC #
Posted by 93.34.214.181 on January 21, 2010 at 08:34 AM UTC #
Colin,
I fully agree with you. This is the reason we developed PISA. It is common to see a simple PISA query being mapped to half a page of MDX statements scattered with a lot of WITH's. On the other side, as you said, most MDX generation tools were too limited, so we needed something that could map generic user constructs to a query language and obviously SQL is too far away. So MDX was best alternative.
Posted by Franco Graziosi on January 21, 2010 at 12:53 PM UTC #