Databases: relational vs object vs graph vs document
Is relational database dying? Why a model affirmed and finally consolidated over years is now object of discussion? Which alternatives exist and how they can match your needs in practical situations?
Follow me, in this post, to understand how databases evolved and how to choose between new technological alternatives that have already produced valuable results.
Note: this post focus operational (OLTP) databases. I have already addressed OLAP approach for strategic applications in many precedent posts.
A little history
Let's go back to 70s. At that time everybody was discussing which one (out of three) database model should adopt: hierarchical, network or relational. At that time everybody realized that a database system was a must, but there was still no winner.
The hierarchical model organized data into records of different types, each having a predefined set of fields. Records were arranged in a tree structure navigable through parent-child relationships. Despite the “hierarchical” label data could be organized exactly like the network model. A typical implementation was IMS that I think had (at that time) the largest database market share, at least in mainframe world.
The network model was a flexible alternative natively supporting many-to-many relationships. It was based on CODASYL consortium and implemented in products like IDMS. Data were also organized into records navigable through sets.
As you probably know the relational model organizes data into tables. Each table contains a set of records with common attributes or fields represented by columns. There is no predefined navigation and data can be accessed through relational algebra usually expressed in SQL language. The relational model was also born inside IBM which heavily promoted it, but today, in addition to DB2, many other implementations are available like Oracle (de facto market leader).
No matter the approach any database system is built to guarantee data consistency under concurrent usage by mean of ACID transaction support.
Relational won the battle
In the 80s the discussion was practically over. Relational model won the battle. Why? Was the relational model really superior? Did IBM itself kill his other alternative (IMS)? Did SQL standard play a key role?
Maybe all those reasons contributed to heavy adoption of relational model, but frankly I believe that the real killer factor is that it was extremely intuitive. It is quite intuitive for humans to organize data into tables: just consider how much spreadsheets are used and overused. The table paradigm was still more intuitive at that time with all those COBOL minded programmers.
On the other side information is by nature interrelated and not “contained in tables” thus better represented as a graph, exactly as network and hierarchical model foresight! An invoice and its items, for example, are naturally related independently on any shared key exists (or you are forced to invent) in relational implementation.
But, like in history, winner is always right and the rest are villain, so everybody started porting their applications to SQL (I still miss an excellent network product from Raima called DbVista).
The relational mismatch
Relational adoption was almost 100% at the end of millennium, but it is still used more like an
access method than as a database. Tables were so easy to create that
produced a chaotic proliferation with consequent data redundancy. Keys incompatibility and meaning conflicts between departments of the
same organization are common situations.
Maybe for this reason a new buzzword was introduced: the Data Warehouse whose goal was exactly the same for which database were originally conceived (organize data), but with a more pompous name everybody was hoping a more serious approach.
SQL itself soon left to be a standard, giving rise to a multitude of incompatible variations and dialects.
In addition to all such problem a new one was quickly taking shape: the heavy adoption of object oriented development. Objects reflect information and are naturally organized in networks or graphs and not tables. New developers with better abstraction that old COBOL school, realized that there was an Object Relational Impedance Mismatch.
On the other side most application are now stacked over mature and reliable relational engines. So what could be done? First (and still most widely used approach) was to automate conversion through what is usually called ORM (Object Relational Mapping).
ORM takes care of translating object-oriented navigation to specific SQL (and SQL dialect) you use. A well known example of ORM (possibly the widely adopted) is Hibernate. Far from perfect, we will surely inhabit with ORM for many years on.
But from now on, shall we continue using relational databases for new applications? Here we are facing a new crucial choice: shall we continue using what we bet on 30 years ago?
A cleaner alternative to ORM are object-oriented databases that directly add database functionality to the object oriented model and language itself. Object persistence becomes almost native and application code is tremendously simplified.
So why aren't object databases more widely used? Here are main reasons. If none of them is applicable to your case object database is probably a good choice:
Object oriented models are usually better bound to a specific environment. If you use exclusively Java, for example you have a lot of choices. But if your environment is hybrid (Java, Cobol, C++...) there is no common denominator.
Relational models are still more efficient in batch processing, and are backed by an infinity of tools for reporting and data exploration. Object databases are still limited in ad hoc reporting tools. On the other side, do you really need ad hoc reporting on operational data? Operational applications (OLTP) usually provide a predefined set of possibly customizable reports.
Your client or his policies can constrain or suggest a specific relational technology (shall run on …).
Relational models are usually backed by larger suppliers (what is considered more reliable at corporate level).
With massive advent of Internet, storing large amount of documents became a must. Such documents range from images to more or less structured text, including large chunks of information encoded in XML. Once again relational technology was not natively prepared to support such kind of data. BLOGs and CLOBs were later introduced in SQL but at the same time document oriented databases appeared.
Documents have other peculiarities. The first one is life cycle. While most business records are frequently updated (for example client's balance), documents are almost never updated (you don't update an image, you replace it). In case of text documents, replacement implies in versioning for both storage and auditability reasons. In addition, document transactions are extremely granular involving a single record: a perfect fit for REST services.
But what makes document databases really different, is the fact that documents are usually retrieved through dunamic and unpredictable queries. Thus document databases can usually associate any number of fields of any length to a document. This way you can store, together with a medical image, patient name and birth data. If you late decide to add also sex and profession, you can do it even if it wasn't originally conceived.
Document databases like eXist are usually schema-less: there is no predefined data model (the schema defines how database is organized). We will go back to schema-less databases about graph databases. Many document database also parse and index its textual content to allow quick retrieval under ad hoc classification and selection criteria.
In practice, you shall consider a document database only if your application is document based.
During last 3 years IT offering has dramatically changed its shape. Many applications are now offered as SAAS (software as a service), possibly supported by some kind of cloud computing. The big challenge is: different users, hosted on same database, have different data needs.
Consider a typical CRM application. Information collected about clients depends on product sold, market and many other factors. How can you handle that in a relational model? If you provide columns for every possibility you end up with an unmanageable system. Some application partially solve this issue through a small collection of extra fields, like CUSTOM1, CUSTOM2 … It is usually an unhappy and limited choice.
Conventional schema where columns or fields are predefined do not provide enough flexibility. Here graph databases come in hand. Graph databases are usually schema-less and allow a set of nodes (object instances) with dynamic properties (corresponding to columns or attributes) to be arbitrary linked to other nodes through edges (associations).
At this point you will surely ask: what about database integrity and consistency rules in an arbitrary network where you can stuff everything? Such rules can be conveniently implemented and enforced through API, quite better than using relational model.
So graph database like InfoGrid and Noe4J can be your choice if you need model flexibility. This is not limited to SAAS application, but to any application that can evolve over time and, if applications shall follow business, this more the rule than the exception.
Last modified on 2011-05-28 by Administrator