NoSQL, as a buzzword, is raging hot nowadays. As with any hype, it gets hard for the layperson to separate the wheat from the chaff. It means different things to different people and prognostications announcing the early demise of relational database vendors abound although they can easily acquire some of the companies/technology or build it in-house as a new product line.
I am going to open the hood and attempt to explain some of the business/technical reasons for a new set of requirements to have emerged. Quite a few of the requirements can in theory be met equally effectively by relational vendors or newer vendors who all have assembled under the marketing term NoSQL. Here are some technical reasons why traditional relational database vendors are getting challenged.
i) Scale: The biggest driver of the new generation of data management systems was unprecedented scale. The scale forced companies to rely on a cluster of commodity machines as their main workhorse and this was way more cost effective than special purpose hardware that could handle such scale. A direct corollary of using a large number of commodity machines is that the probability of failure of some part of the system or the other grows exponentially with the size of the cluster. Thus, systems built at such scale treat failures as a norm rather than an exception. In the context of a distributed system, with replicas of a data item this meant that system chose “Availability” over “Consistency” in the trade-off that the CAP theorem suggests one has to make. In contrast, traditional database systems is supposed to provide ACID support that gives the effect to all the customers as though they are interacting with the DBMS in isolation and their effects on the database will be seen in entirety by those that follow them and they will see the effects of customers who precede them (and there is usually a crisp definition of who follows and who precedes based on the actual algorithms used). While this is a very important and non-negotiable feature when dealing with real Rupees and Paisa (or dollars and cents; or pounds and pence, etc.), for most web and Internet applications, customers as well as applications can live with some loss in consistency, (different replicas having different values at the same time) but with the promise that eventually the system will become consistent.
ii) Disk Capacities have exploded and Random Disk Accesses have gotten cheaper: Back in the 1970s and 1980s, a terabyte of data was HUGE (hence the company TeraData, the benchmark TeraSort etc.). Further, random access to data on the disk was (and is) so slow that Usain Bolt and traffic in Namma Bengaluru (a city in South India) appear to be in the same league from the vantage point of a disk. Fortunately, thanks to all the Japanese tourists, with their mandatory Canon and Nikon cameras in tow, the cost of solid state disks has plummeted so much that they now make feasible an attractive point on the efficient frontier of the price performance curve – hitherto unavailable to practitioners. As a result, quite a few of the nimbler start-ups are optimizing themselves for usage with Solid State Disks and although there is “nothing NoSQL” about this, they are garnering the marketing imagination. As an aside, a veteran relational database professor and a good friend of mine remarked tongue in cheek, as is his usual practice, that if there is NoSQL, there is NoRel and NoData and NoLife . To make this essay balanced I should hurry and point out that NoSQL may not stand for No SQL but may stand for Not Only SQL!
iii) Fixed Schema is too rigid: Internet companies love to experiment rapidly and “fail fast”. As a result they want their applications to evolve fast and rigid schema that are difficult to evolve come in the way of such rapid experimentation
. Schema evolution on a large data store is very time consuming and further ensuring no downtime during this period may be non-trivial. The notion of column families with the ability to add columns within them at will, as introduced by BigTable, as well as the notion of imposing a schema at query time, as is done in PIG Latin, reduces this pain to a great extent.
iv) First Normal Form is too restrictive/rich: Yes, you read that right. There are a class of applications that just seem to require a persistent key value store for which the full fledged relational model is an overkill. Quite a few of the NoSQL data stores belong to this genre. On the other hand, there are a class of applications that want to model the real world more faithfully and want complex structure in the objects they store. Traditional databases treat the values in a column as an atomic value and do not ascribe any structure to them. In practice, the real world hardly fits a two dimensional tabular structure and thus programmers love the ability to stores sets and other complex containers as values for a column.
v) Columnar Stores: Traditionally, relational database vendors have chosen to store entire rows together on a disk since they made a lot of sense for well normalized data stores. It may sometimes be desirable to store together related columns which are accessed together. These set of columns that are stored together embody one related entity in the real world. Such storage leads to better compression of data as well as more efficient access. Relational databases always allowed this by letting users vertically partition tables in their schema. In the context of the new regime where first normal form is not a must, we essentially have a situation where different entities can be stored in one logical table but yet stored physically separate. Bringing corresponding related columns of a row thus corresponds to a join in a traditional relational data store. Thus, it is not entirely fair to say joins do not exist in NoSQL stores.
vi)Programmatic Access is also important: A declarative or a high level language such as SQL is imperative for non technical users to be able to access the data in a data store. However, when the data is not in first normal form or the schema is not known apriori SQL is not the most appropriate choice. Vast legions of programmers who want to access large data stores sometimes prefer programmatic APIs such as those provided by MapReduce to write their own code. Newer data stores that allow both a programmatic API as well as high level API thus find favor with different user classes.
In summary, although the word NoSQL is used to represent newer generation data stores that may only have a subset of the features mentioned above, not all of the features represent a radical departure from the SQL language. Of course, if SQL is used as a proxy for current, generation, relational database vendors then they should be busy reinventing themselves trying to address this important set of requirements.