on target

Tuesday Oct 20, 2009

ROLAP: Starring your schema

In previous post we have seen the many problems when trying to use operational databases for strategic purposes. Now we look at a practical and simple solution based upon proven and scalable technology.

ROLAP is a data model that specifically built to support strategic, business-driven multidimensional analysis.

The term ROLAP comes from R (Relational) + OLAP and (as you probably guess) is based on relational databases. Using relational technology has many advantages:

  • Is supported by hundreds of different database engines.

  • Comfortably handles huge amount of data

  • Seamlessly integrates into current IT shop.

  • Supports innovative state-of-the art engines like column databases

  • No new languages to learn.

Star schema

The so called “star schema” is the key in ROLAP models. It is so called because its representation remembers a star, like this one that can hold exactly same data of normalized model shown in previous post:


Each box of such model becomes a SQL table. The center box is called fact table, usually very large, containing million or even billion of rows. It holds business measures plus foreign keys that relate each measurement to appropriate rows of other tables.


Dimensions

The other tables are called dimension tables and contain properties of different business dimension or view perspective. So we have client dimension holding properties associated with a client and whatever relates to it (Name, branch office, ...). For example:


ID

CLIENT_NAME

CORPORATION

ACCOUNT_REP_NAME

BRANCH_OFFICE

1

Bell Inc.

ACME

Jeff

LA

2

Lilly Inc.

ACME

Bill

NY

3

Bubble Ltd.

(None)

Bill

NY

...

...

...

...

...


Please notice that:

  1. Each dimension table uses as primary key a surrogate key (ID) through which it is linked to fact table. ID column shall be a surrogate key and not any attribute of the table we will see later why.

  1. Dimensions are not normalized, Corporation, for example, repeats into the table as well as Account Responsible name and most other properties. This is intentional! You shall not try to link a dimension table to other tables (for example linking client dimension to branch office). Doing that is called snowflaking because instead of a star the schema would take the shape of snow crystal. Snowflaking is a bad practice since we would be back to problems of normalized schema.

  1. Dimension tables are relatively very small if compared with the fact table. In our case client fact table would initially have one row per client.


A dimension deserves special attention and is usually present in any star schema: time dimension. Such table is usually absent from operational database but is very important to overcome the problems of date math.

 Time dimension usually contains one row per day (thus a very little table), for example:


ID

YEAR

QUARTER

MONTH

DAY

FISCAL_YEAR

WORK_DAYS

1

20010

1

January

Friday

2009

1

2

20010

1

January

Saturday

2009

1

3

20010

1

January

Sunday

2009

1

4

20010

1

January

Monday

2009

2

5

20010

1

January

Tuesday

2009

3

...

...

...

...

...

...

...

...

20010

1

March

Monday

20010

...


Notice the columns FISCAL_YEAR that tells us which fiscal year each day refers to and WORK_DAYS that tells us how many work days passed from an arbitrary initial date. Saturday and Sunday and holidays do not count as work days, thus column value is not incremented.

 

Fact Table

Now let's look in more detail fact table:


CLIENT_ID

PRODUCT_ID

TIME_ID

QTY_SOLD

AMOUNT_SOLD

1

1

1

100

12400

12

13

9

70

99450

39

6

3

9

10887

...

...

...

...

...


As you can see the fact table is solely composed of;

  • References to dimension tables (CLIENT_ID, PRODUCT_ID and TIME_ID).

  • Additive and semi-additive business measures QTY_SOLD, AMOUNT_SOLD).


A measure is called additive when aggregating it along all dimensions make sense. This is the case of AMOUNT_SOLD. You can aggregate AMOUNT_SOLD along time to get, for example, yearly sales, or along product to get sales by brand, or along client to compute sales by corporation.

A semi-additive measure can usually be aggregated along some (but not all) dimension. Aggregating QTY_SOLD, for example, would not make much sense along different products.

A non-additive measure is a measure that is neither additive nor semi-additive, like, for example unit price. You can't aggregate unit price along any dimension, but you can always compute unit price by dividing AMOUNT_SOLD by QTY_SOLD.

When designing fact table, you shall look at measures specification and replace any non-additive measure into (semi)additive measures used to compute it.

The “grain” of a fact table is its detail level. In this case the grain is client × product × day. Should we have two invoices lines for the same client, same product, same day, they would be aggregated in a single row of the fact table.

In case of a finer grain, for example invoice level, two different lines would be used. If, on the other side, we choose a larger grain like month instead of day, our fact table would be considerably smaller (up to 1/30) but we would loose daily analytics.

Note: fact table can also contain the so called “degenerate dimensions”. Those are simple attributes that would not make sense to store on separate table. Suppose you need to know if payment was cash or credit card. Creating a dimension for such a simple flag would be excessive overhead and you can simply store it in fact table. There is nothing wrong in using degenerate dimensions. you can always change your mind later and refactor them to normal dimension if you need.

Now let's appreciate some advantages of star schema over normalized alternative of previous post.

 

Simplified queries

First advantage is query simplicity. Any query, in this case, would involve no more than 4 tables. Given the aggregate fields you need you can easily automate the constraints to be injected into your query. So it is extremely simple to write a query generator given retrieved fields, and SQL nightmare is gone away forever. In addition there are plenty of query tools that can write that code for you with minimal configuration.

Remember the problem of fiscal year? It is elegantly solved through an extra column all what we need is aggregate by fiscal year. Time dimension also let us easily do very complex date computations. Say you would like to know average sales by worked day in a period. We can simply divide AMOUNT_SOLD by the difference of WORK_DAYS between end and start of that period.

And what if you need add geographic information to client? Simply add needed columns and new analysis will be immediately available including history data!


Dynamic dimensions

But main issue has still to be solved. What happens when Jeff is replaced from LA to NY. Slowly changing dimensions elegantly solve that problem.

Whenever we detect such change we insert a new rows into client dimension table:


ID

CLIENT_NAME

CORPORATION

ACCOUNT_REP_NAME

BRANCH_OFFICE

1

Bell Inc.

ACME

Jeff

LA

2

Lilly Inc.

ACME

Bill

NY

3

Bubble Ltd.

(None)

Bill

NY

...

...

...

...

...

137

Lilly Inc.

ACME

Jeff

NY


Now that Jeff has now taken Lilly account. New invoices will reference row 137 and all data and history wold be kept consistent and correct at client, branch and account rep level. Should Lilly be sold to another corporation a similar duplication will take place for the corporation column.

Systematically applying such changes can seem tricky but can be done through a simple algorithm described in references (see books below).

Now you understand why all ID shall be surrogate keys and why a normalized approach would not allow this implementation.

 

Aggregate tables

Now let's face performance issue. Fact table can quickly reach billions of rows in certain cases (just think in terms of telecom traffic). How can we speed up retrieval?

Aggregate tables solve, again elegantly, this issue. An aggregate table is a fact table that has been pre-aggregated, but that has exactly the same structure of fact table.

If, for example, we decide to use a yearly aggregated fact table we shall make two things:

  1. Add rows to time dimension, leaving null all fields but year and insert one row for each year.

  2. Create a table with same structure (columns) of fact table call it say SALES_FACT_2 and aggregate fact table at year level. Thus TIME_ID entries will point to new entries created in pass 1.


Note that this table would be about 260 times smaller that fact table (supposing there are about 260 work days per year).

The same way there can be a second fact table aggregated by branch office, and a third table aggregated by year and branch office. We can build how many aggregate tables as we wish depending on usual query patterns executed by users.

Now performing retrieval is very simple. The query is always the same the only thing that changes is the name of fact table. Which fact table to use is determined by a simple algorithm (see books) that picks the table with smaller number of rows that has the detail level we need. Here is a typical scenario:

  1. Suppose we request total sales this year. Table aggregated by branch and office would be used. It will probably be a small table and, if we have 30 branches our query will retrieve only 30 rows that will be added to accrue total sales this year.

  2. User drills down by branch (requires branch detail). Same table will still be used because it contains branch detail.

  3. Now user drill down by Sales Rep. The only table that contain sales rep detail is the non-aggregated fact table, thus that table will be used. But at this point we have drill down constraints in our query that reduce the number of retrieved columns.


All that is transparent and automatic. The only thing that changes is the name of possibly aggregated fact table. Again, no SQL to write and no more performance problems!

Keep in mind than using bitmap indexes or column-oriented DBMS can further greatly speed up retrieval.

Again, you can't do the same with a normalized database.

References

For more detail on ROLAP and star schema approaches I suggest you the following books:


Looking Ahead

Congratulations! You are now able to build what is usually called a data mart: this is a piece of a data warehouse specifically designed to provide strategic information about a particular aspect of business: sales, in our case. As you have seen building a data mart is very simple and can easily fit agile approach.

But what about production? And human resources? And the rest? You shall keep a consistent and integrated approach in order to provide a long term solution. Can you adopt a dynamic style creating data marts as needed) or do you need extra planning time that will possibly delay immediate tangible results?

In next post I will describe an agile approach that will cause no project overhead and will lead to a scalable and consistent data warehouse.

 

Interested in partnership?

Would like to try or coach this technique with your clients, do you need any additional or technical detail? Please let me know! Component Bases Solutions has great interest in partnership with consultants. We can help you automate your proposed solutions in a very short time. We can also help to increase your visibility through links from many management tools we make freely available on the Web. Please contact for more detail.

Did you like this post?

Please, spend a few seconds and click on button below to socially bookmark it or tweet it. It will help to promote this blog and improve its content, Thank you!

Print

Comments:

Thanks for your post. I am doing the similar thing for our data mart. But in my case the aggregated tables are all with different names and column structures. What's the benefit of having the aggregated tables' names the same with the fact table? I guess own scenario forces us to do differently. Doug

Posted by Doug on October 21, 2009 at 07:16 PM UTC #

Great article, You describe well the problems of doing warehouse type queries using relational technology but you haven't really touched on what I think is one the major obstacles for wider OLAP adoption. SQL is simple and straightforward, you pretty much can see exactly what you will get with time and thought. MDX is horrific; you can loose days trying to interpret small vendor quirks, ambiguities in syntax, result just not returning etc. While there are visual explorer tools, they are always limited. Until MDX has somewhat the same level of maturity as SQL it is hard to imagine OLAP will proliferate to the degree it should for the business problems it can solve.

Posted by Colin Canfield on January 08, 2010 at 01:31 AM UTC #

Sorry for late reply (I didn't noticed the comments queue). Doug: the advantage of using same column names and modifying table name is that the query is always the same so it can be generated automatically from an algorithm that only chooses best table depending on size. A simple algorithm to do that is described in Kimbal's book Data Warehouse Toolkit.

Posted by 93.34.214.181 on January 21, 2010 at 08:34 AM UTC #

Colin,

I fully agree with you. This is the reason we developed PISA. It is common to see a simple PISA query being mapped to half a page of MDX statements scattered with a lot of WITH's. On the other side, as you said, most MDX generation tools were too limited, so we needed something that could map generic user constructs to a query language and obviously SQL is too far away. So MDX was best alternative.

Posted by Franco Graziosi on January 21, 2010 at 12:53 PM UTC #

Post a Comment:
Comments are closed for this entry.

Subscribe & Feeds

join our mailing list

Recent Posts

Search

Visitors

Links