Thursday, June 07, 2012

From the ashes of the database revolution...

With NoSQL and Hadoop, the database world has undergone a revolution. The fighting reached its peak a couple of years ago, but things have calmed down since, and now is a good time to take stock of old and new style data management technologies.

From this revolution, we can learn a lot about what databases should, and should not be. At the end of this post, I propose a system, called Optiq, that would restore to NoSQL/Hadoop systems some of the good features of databases.

Learning from history

Revolutions tend to follow patterns. George Orwell allegorized the progress of the Russian Revolution in his novel Animal Farm. He described the injustices that were the trigger for the revolution, the new egalitarian value system established after the revolution, and the eventual corruption of those values. Revolutions are an opportunity to introduce new ideas, not all of them good ones. For example, the French revolution put in place a decimal system, and though they kept the kilogramme and the metre, they were forced to quickly relinquish the 10 hour day and the 10 day week when the workers discovered that they'd been conned out of 30% of their weekend time.

We see all of these forces at play in the database revolution. The triggers for the revolution were the requirements that traditional RDBMSs could not meet (or were not meeting at the time). The revolution adopted a new paradigm, introduced some new ideas, and threw out some old ideas. I am interested in which of those old ideas should be reinstated under the new regime.

I am a database guy. I was initially skeptical about the need for a revolution, but a couple of years ago I saw that Hadoop and NoSQL were gaining traction, had some good ideas, growing momentum, and were here to stay. My prediction is that traditional and new data management systems will grow more and more similar in appearance over the next 5-10 years. Traditional RDBMSs will adopt some of the new ideas, and the new systems will support features that make them palatable to customers accustomed to using traditional databases.

But first, some terminology. (As George Orwell would agree, what you call something is almost as important as what it is.)

  • I call the new breed of systems "data management systems", not "databases". The shift implies something less centralized, more distributed, and about processing as well as just storing and querying data. Or maybe I'm confusing terminology with substance.
  • I distinguish NoSQL systems from Hadoop, because Hadoop is not a data management system. Hadoop is a substrate upon which many awesome things can, and will, be built, including ETL and data management systems.
  • NoSQL systems are indeed databases, but they throw out several of the key assumptions of traditional databases.
  • I'm steering clear of the term "Big data" for reasons I've already made clear.

The good stuff

In the spirit of post-revolutionary goodwill, let's steer clear of our pet gripes and list out what is best about the old and new systems.

Good features from databases

  • SQL language allows integration with other components, especially components that generate queries and need to work on multiple back-ends.
  • Management of redundant data (such as indexes and materialized views), and physically advantageous data layout (sorted data, clustered data, partitioned tables)
  • ACID transactions
  • High-performance implementations of relational operators
  • Explicit schema, leading to concise, efficient queries

Good features from Hadoop/NoSQL systems

  • Easy scale-out on commodity hardware
  • Non-relational data
  • User-defined and non-relational operators
  • Data is not constrained by schema

Scale and transactions

Scale-out is key. The new data systems all run at immense scale. If traditional databases scaled easily and cheaply, the revolution would probably not have happened.

There are strong arguments for and against supporting ACID transactions. Everyone agrees that transactions have high value: without them, it is more difficult to write bug-free applications. But the revolutionaries assert that ACID transactions have to go, because it is impossible to implement them efficiently. Newer research suggests that there are ways to implement transactions at acceptable cost.

In my opinion, transactions are not the main issue, but are being scapegoated because of the underlying problem of scalability. We would not be having the debate — indeed, the whole NoSQL movement may not have occurred — if conventional databases had been able to scale as their users wanted.

To be honest, I don't have a lot of skin in this game. As an analytic database technology, Optiq is concerned more with scalability than transactions. But it's interesting that transactions, like the SQL language, were at first declared to be enemies of the revolution, and are now being rehabilitated.

Schema

Relational databases require a fixed schema. If your data has a schema, and the schema does not change over time, this is a good thing. Your queries can be more concise because you are not defining the same fields, types, and relationships every time you write a query.

Hadoop data does not have a schema (although you can impose one, after the event, using tools such as Pig and Hive).

The ideal would seem to be that you can provide a schema if the data conforms to a fixed format, provide a loose schema if, say, records have variable numbers of fields, or operate without one. In Hadoop, as in ETL tools, data is schema-less in early stages of a pipeline, stronger typing is applied in later stages of the pipeline, as fields are parsed and assigned names, and records that do not conform to the required schema are eliminated.

Location, control and organization of data

Traditional databases own their storage. Their data resides in files, sometimes entire file systems, that can only be accessed by the database. This allows the database to tightly control the access to and organization of the data. But it means that the data cannot be shared between systems, even between databases made by the same vendor.

Hadoop clusters are expensive, but if several applications share the same cluster, the utilization is kept high, and the cost is spread across more departments' budgets. Applications may share data sets, not just processing resources, but they access the data in place. (That place may or may not be HDFS.) Compared to copying data into an RDBMS, sharing data reduces the saves both time and money.

Lastly, the assumption that data is shared encourages applications to use straightforward formats for their data. A wide variety of applications can read the data, even those not envisioned when the data format was designed.

SQL and query planning

SQL is the hallmark of an RDBMS (at for those of us too young to remember QUEL). SQL is complicated to implement, so upstart open source projects, in their quest to implement the simplest thing that could possibly work have been inclined to make do with less powerful "SQL-like" languages. Those languages tend to disappoint, when it comes to interoperability and predictability.

But I contend that SQL support is a consequence of a solid data management architecture, not the end in itself. A data management system needs to accept new data structures and organizations, and apply them without rewriting application code. It therefore needs a query planner. A query planner, in turn, requires a metadata catalog and a theoretically well-behaved logical language, usually based on relational algebra, for representing queries. Once you have built these pieces, it is not a great leap to add SQL support.

The one area that SQL support is essential is tool integration. Tools, unless written for that specific database, want to generate SQL as close to the SQL standard as possible. (I speak from personal experience, having written Mondrian dialects for more than a dozen "standards compliant" databases.) Computer-generated SQL is not very smart — for example, you will often see trivial conditions like "WHERE 1 = 1" and duplicate expressions in the SELECT clause — and therefore needs to be optimized.

Flat relational data

There is no question that (so-called "flat") relational data is easier for the database to manage. And, we are told, Ted Codd decreed forty years ago that relational data is all we should ever want. Yet I think that database users deserve better.

Codd's rules about normalization have been used to justify a religious war, but I think his point was this. If you maintain multiple copies of the same information, you'll get into trouble when you try to update it. One particular, and insidious, form of redundant information is the implicit information in the ordered or nested data.

That said, we're grown ups. We know that there are risks to redundancy, but there are significant benefits. The risks are reduced if the DBMS helps you manage that redundancy (what are indexes, anyway?), and the benefits are greater if your database is read much more often than it is updated. Why should the database not return record sets with line-items nested inside their parent orders, if that's what the application wants? No reason that I can think of.

In summary, a data management system should allow "non-flat" data, and operations on that data, while keeping a semantics based, as far as possible, on the relational algebra.

Introducing Optiq

Optiq aims to add the "good ideas" from traditional databases onto a new-style Hadoop or NoSQL architecture.

To a client application, Optiq appears to be a database that speaks SQL and JDBC, but Optiq is not a database. Whereas a database controls storage, processing, resource allocation and scheduling, Optiq cedes these powers to the back-end systems, which we call data providers.

Optiq is not a whole data management system. It is a framework that can mediate with one or more data management systems. (Optiq could be configured and distributed with a scheduler, metadata layer, data structures, and algorithms, so that it comes out of the box looking like a database. In fact, we hope and expect that some people will use it that way. But that is not the only way it is intended to be used.)

The core of the framework is the extensible query planner. It allows providers to specify their own type systems, operators, and optimizations (for example, switching to a materialized view, or eliminating a sort if the underlying file is already sorted). It also allows applications to define their own functions and operators, so that their application logic can run in the query-processing fabric.

An example

You might describe Optiq as a database with the hood open, accessible to anyone who wants to tinker with the engine. Here is a simple example:

Class.forName("net.hydromatic.optiq.jdbc.Driver");
Connection connection =
  DriverManager.getConnection("jdbc:optiq:");
OptiqConnection optiqConnection =
  connection.unwrap(OptiqConnection.class);
JavaTypeFactory typeFactory = optiqConnection.getTypeFactory();
optiqConnection.getRootSchema().add(
  "HR",
  new CsvSchema("/var/flatfiles/hr", typeFactory));
ResultSet resultSet =
  connection.createStatement().executeQuery(
    "SELECT e.name, e.sal, d.name AS department\n" +
    "FROM hr.emps AS e, hr.depts AS d\n" +
    "WHERE e.deptno = d.deptno\n" +
    "ORDER BY e.empno");
while (resultSet.next()) {
  System.out.println(
    "emp=" + resultSet.getString(1) +
    ", sal=" + resultSet.getInt(2) +
    ", department=" + resultSet.getString(3));
}
resultSet.close();

The program requires a directory, /var/flatfiles/hr, containing the files EMPS.csv and DEPTS.csv. Each file has a header record describing the fields, followed by several records of data.

There is no other data or metadata, and in fact CsvSchema is an extension, not a built-in part of the system.

When the connection is opened, the virtual database is empty. There are no tables, nor even any schemas. The getRootSchema().add( ... ) call registers a schema with a given name. It is like mounting a file-system.

Once the CsvSchema is registered with the connection with the name "HR", Optiq can retrieve the table and column metadata to parse and optimize the query. When the query is executed, Optiq calls CsvSchema's implementations of linq4j's Enumerable interface to get the contents of each table, applies built-in Java operators to join and sort the records, and returns the results through the usual JDBC ResultSet interface.

This example shows that Optiq contains a full SQL parser, planner and implementations of query operators, but it makes so few assumptions about the form of data and location of metadata that you can drop in a new storage plugin in a few lines of code.

Design principles

The design of the Optiq framework is guided by the following principles.

  • Do not try to control the data, but if you know about the data organization, leverage it.
  • Do not require a schema, but if you know about the shape of the data, leverage it.
  • Provide the SQL query language and JDBC interface, but allow other languages/interfaces.
  • Support linq4j as a backend, but allow other protocols.
  • Delegate policy to the data providers.

Let's see how Optiq brings the "good ideas" of databases to a NoSQL/Hadoop provider.

Applying these principles to schemas, Optiq can operate with no, partial, or full schema. Data providers can determine their own type system, but are generally expected to be able to operate on records of any type: that may be a single string or binary field, and may contain nested collections of records. Since Optiq does not control the data, if operating on a schema-less provider like Hadoop, Optiq would apply its schema to already loaded data, as Pig and Hive do. If Optiq is assured that the data is clean (for example, a particular field is always an integer) then it may be able to optimize.

Optiq's type system allows records to contain nested records, and provides operators to construct and destruct nested collections. Whereas SQL/JDBC queries do not stretch the type system, linq4j gives Optiq a workout: it needs to support the Java type system and operations such as selectMany and groupBy that operate on collection types.

Lastly, on breaking down the rigid boundary between database and application code.

My goal in data-oriented programming is to allow applications, queries, and extension functions and operators to be written in the same language — and if possible using the same programming model, and on the same page of code — and distributed to where query processing is taking place.

The paradigms should be the same, as far as possible. (MapReduce fails this test. Even though MapReduce is Java, one would not choose to write algorithms in this way if there was not the payoff of a massively scalable, fault-tolerant execution infrastructure. Scalding is an example of a DSL that succeeds in making queries fairly similar to "ordinary programming".)

That said, Optiq is not going to fully solve this problem. It will be a research area for years to come. LINQ made a good start. Optiq has a query planner, and is open and extensible for front-end query languages, user-defined operators, and user-defined rules. Those tools should allow us to efficiently and intelligently push user code into the fabric of the query-processing system.

Conclusion

Optiq attempts to create a high-level abstraction on top of Hadoop/NoSQL systems that behaves like a database but does not dilute the strengths of the data provider. But it brings in only those features of databases necessary to create that abstraction; it is a framework, not a database.

Watch this space for further blog posts and code. Or catch me at Hadoop Summit next week and ask me for a demo.

7 comments:

egwada said...

Hi Julian !

So your proposition is like a sort of abstraction of database, as a virtual DB.

If we look Teiid of JBoss, I think that Optiq shares some ideas with this project.

Teiid project

http://www.jboss.org/teiid

Teiid presentation

https://github.com/derrickhackman/Presentations/raw/master/CDJDN_Teiid.pdf

regards

Julian Hyde said...

Charly,

I'm very familiar with Teiid (even back when it was Metamatrix). It's a nice project, and does a lot of cool things. Optiq is similar to Teiid in quite a few ways, especially in that it can be used to create a virtual database.

However, I'm broadening the scope with Optiq, and coming at the problem with a different design philosophy.

First, Optiq will work with non-relational and schema-free data.

Second, Optiq goes beyond mere query execution to data-oriented processing. That means accepting user code and pushing it into the query execution fabric (just as Hadoop does).

Third, When the operators are non-relational it would be difficult for a fixed query planner to know all of the optimization tricks. Therefore Optiq's planner has to be easily extensible. You can simply drop in extra planning rules, and they work with the existing rules.

Last, I would like Optiq to be used as a building block for a large variety of projects and products, both inside and outside the Hadoop ecosystem. It is therefore designed to be a framework with a small footprint, not a server. Since the architecture is embeddable and pluggable, it should make it easy for a diverse community to use and contribute to it.

In short: not a relational database (even a virtual one). (Not only relational, not a database, not a server. And extremely extensible.)

I'd like Optiq to be able to do some of the cool things that Teiid does (like automatically migrating tables to make joins easier, and automatically creating and maintaining materialized views) and I think I make some mention of those patterns in my blog post. But my goal is to make it possible by plugging rules into the framework, rather than those being core capabilities.

Julian

Paul said...

Have you taken a look at Mesos, Spark, Shark and the rest at UC Berkeley's AmpLab? http://amplab.cs.berkeley.edu/ These provide much of the schema, SQL, query optimization and other features you mention. No transaction support, however. They are running an "AmpCamp" where they describe & demo their projects.

Julian Hyde said...

Paul,

Yes, AmpLab is doing some great stuff. But they had to build the whole stack. Optiq would allow people to mix and match components from several stacks (Spark, Hadoop, Splunk).

simon-french said...

Julian,
We are looking at using Mondrian at the heart of our new reporting system.

Our real time system stores medical related documents as XML. As we were unsure about Oracle XMLs features at the time we opted to store the XML as LOBS, but took key parts of the data (start date, end date, document type, code, etc) and threw that data in alongside the xml and indexed on it for search purposes.

This worked very well up until now! We were told we would never have to report on the data.

We did a little experimentation a while back using Hibernate and EhCache, basically writing Hibernate Criteria queries, then making use of the tight integration between EhCache and hibernate to keep as much data in memory as possible. Upon the Hibernate re-hydration process the XML's associated JaxB object was looked up in a separate cache, and if not present the XML was converted into Jaxb Objects in a new thread and stored for later use.
This got us pretty far as we could make use of the database to filter a good part within the query and then use Java code if necessary later.

But flexible it was not.

It turns out most of that meta data we store in that table will work very well as facts.

I should also point out that although we will probably querying upto 5 million or so facts, only a small proportion will need the additional xml data knowledge, and we are willing to acknowledge the initial hit of the data will be slow, but thereafter fast due to the caching.

From what I have read Optiq might very well bridge the gap and allow us to make use Mondrian with a customized Optiq driver? Presumably we could expose parts of the XML as additional columns in the results sets.

This will work really well as we need to report on real time data, and it pains me to think we might have to introduce an ETL process for the additional data.

Your thoughts would be greatly appreciated

Regards

Simon

Julian Hyde said...

Simon,

Funny you should mention Optiq against Mondrian. I have been trying to keep the projects separate, but there seem to be quite a few cases where they can work together.

I wonder whether your problem might be solved with a "programmatic view", where the first few columns come from an underlying table, and the last few columns come from some programmatic means, such as getCachedDocument(id).getAttributeFoo().

If you think that would solve your problem I can work on it with you.

Optiq is not a runtime system or a storage structure. It doesn't make things faster; it just lets you choose better execution patterns against your existing storage structures. So in general, I'd say that if your data is not organized for efficient reporting access (and especially for efficient OLAP access, since OLAP often needs to read large amounts of data) Optiq is not going to solve the problem. But if your data is basically sound, and the only performance problem is an occasional reference to XML data, then I think the solution I sketched out above will help a lot.

A few weeks ago I prototyped an Optiq schema that loads data into in-memory tables the first time it is read. (I use a compact in-memory representation -- column store, mostly arrays of primitive values, and dictionary-based compression if applicable). I use a small number of large Java objects because I'd like it to work well with a data grid such as ehCache or Infinispan in future.)

I have been testing Mondrian against this in-memory cache cum database. (Initial tests are encouraging, but there is still quite a lot of tuning to be done. If it works out, future versions of Mondrian might use Optiq in-memory tables as a smarter version of Mondrian's distributed cache.)

If you think your data needs to be radically restructured for Mondrian to use it, you might consider this "ETL on read" solution. Again, let me know if I can help out.

simon-french said...

Hi Julian,

Apologies for mixing Mondrian / Optiq together.
My initial impression was that Optiq might well be a vehicle for bringing the best of RDBMS and NO-SQL together.

ETL on read would certainly work for me.

With a little more information being passed along
(key, table, the underlying connection/datasource) and maybe some CDATA from cube representation would be perfect.

As a side note I came across cqengine today. I suspect you already have in memory indexing well taken care of, but you might want to take a brief look at what cqengine has to offer
https://code.google.com/p/cqengine/

Regards

Simon