MapReduce vs Data Warehouse

Did you recently hear about MapReduce? Do you know what it is? If you work in business intelligence this technology will play a key role in your field during next years. Let me show how it can help you right now.

What is MapReduce

In recent years we have watched a real “explosion” of data. IDC forecast of the size of “digital universe” in 2011 is 0.18 zettabytes. A zettabyte is one billion terabytes! To handle such huge amount of data a new approach was used by Google. It is called MapReduce and, although a patent is pending, open sources implementations are already available. The most widely use is probably Hadoop.

MapReduce can crunch huge amount of data by splitting the task over multiple computers that can operate in parallel. This way no matter how large the problem is, you can always increase the number of processors (that today are relatively cheap).



Suppose you would like to know the exact balance of a client today. Instead of looking into a database, you can quickly fetch all his transactions from plain text files (kind of “log” files) and accrue his balance.

MapReduce has proven to be an effective and practical solution, but, as any new technology, first adopters circle is still limited to certain large corporations (Google, Yahoo...) to which data is a critical issue.

On the other side from a Business Intelligence viewpoint MapReduce can play a key role in any company.

How MapReduce change things

In a sense MapReduce can be a substitute for certain kind of “classic” Data Warehouses. Not an OLAP DW (tuned for multidimensional queries), but he typical normalized one that was built to store all corporate history, just in case somebody could need them.

Instead of using a normalized database that requires great design and tuning effort, simply dump all information in plain text files and retrieve it on demand. For analytic applications, for example, MapReduce can digest data to a reduced set easily manageable in memory.

Let's make an example. Instead of keeping a database with all produces invoices, you can simply dump they all into a text file. If you need at any moment make a sales analysis of last years, you can simply digest the required information through MapReduce and submit the result with R for the almost exotic analysis.

Take action now!

Even if you do not have immediate application for MapReduce, you can prepare your company for tomorrow by saving important data in simple text files. Just systematically dump and store all transaction data. Storage is quite cheap today and transaction data is always available.

This strategy offers an additional advantage if you use an ERP, CRM or other kind of packaged application: your data is safely stored in a neutral format, even if you decide tomorrow to change the application.

Guidelines

Let me give you only two very important guidelines:

  1. Focus transactions and all data about transactions. Do not store master data and computed values, you can always rebuild them through MapReduce. Let's see an example from CRM. You do not need to store all prospect data, just all history and collected information from which you can rebuild prospect data any time.

  2. Store metadata with data. This is very important: one of major issues with historical data is to decode a format that is outdated. If you store your data on text files in field-value pairs or XML, you will never have this problem. You can face missing fields but you will always be able to read old data with minimum effort.

Last modified on 2011-05-24 by Administrator