Tuesday, June 29, 2010

SQL past and future

Ken North, writing in Dr. Dobb's Journal, gives a nice overview of the long and storied history of SQL. The piece helps one understand the wave of mergers among the big database vendors, and make sense of current trends in database and database-like software. And I'd like to offer my opinion about where SQL and database management systems are headed.

North looks into the claims that 'the database is dead' and finds that — yet again — reports of its death were greatly exaggerated:
Forrester Research recently estimated the total database market (licenses, support, consulting) would grow from $27 billion in 2009 to $32 billion by 2012. SQL technology is entrenched in many organizations and across millions of web sites. Perhaps that explains why, during the past decade, IBM, Oracle, Sun and SAP made billion-dollar investments in a ‘dead’ technology.
However, I do believe that the relational database is currently in crisis. Relational databases have been the mainstay of data management for over twenty years, but Oracle and its cohorts have no answer for "Big Data", the massive onslaught of information from the web and sensors.

The NoSQL movement is solving these problems by challenging some of the assumptions held by RDBMS vendors. At SQLstream, we regard ourselves as part of the NoSQL movement even though we are huge fans of SQL, because we are challenging the biggest assumption of them all: that you have to put data on disk before you can analyze it.

It's a shame that North doesn't mention streaming SQL, because it fits perfectly into the grand arc of the SQL language: adopt new problems, express them declaratively, and solve them first with special-purpose database engines and finally by adapting the architecture of the big, general-purpose database engines. This last step sometimes takes many years to happen, but it happened for transaction processing, object database, and data warehousing, and I have no doubt that it will happen for streaming relational data.

One of the reasons that SQL has remained relevant is SQL standards process; products built on one database can be run on another database and, perhaps more important, skill sets acquired on one engine can be applied to another. When the dust settles, and the big databases have learned hard architectural lessons, I think a lot of these new problems will be solved in SQL. 

Unlike Mike Stonebraker, I do think that organizations will want to put all these different forms of data into one database management system. That database will of course be a facade spread over many servers, disks, data organizations and query processing engines, but will offer centralized management and allow the different forms of data to be combined. They will get their wish because the SQL language is so powerful at hiding differences in underlying data organization.

When the dust has settled, the SQL language will have changed and adapted yet again, and maybe there will be some new names at the top of the roster of database vendors, but we will once again be solving most of our data management problems using declarative queries beginning with the word "SELECT". SQL is dead; long live SQL!

7 comments:

rpbouman said...

Nice piece!

I disagree here though:

"One of the reasons that SQL has remained relevant is SQL standards process; products built on one database can be run on another database"

I mean, the last SQL standard is the 2008 edition (AFAIK), yet all major RDBMS-es claim no more than support for a subset of SQL:99. At the same time, all major RDBMS-es offer a ton of idiosyncrasies which you typically need to use to get decent performance for particular tasks.

As a result, it is rare to find non-rivial SQL that actually runs on multiple RDBMS-es with the same or equivalent effect (Sidenote: there's an excellent comparison site here http://troels.arvin.dk/db/rdbms/).

To be fair, most differences are found in DDL and in the functions, so once the schema is defined you can write cross-RDBMS applications without jumping through too much hoops as long as you use SQL only as a storage and retrieval interface (not data transformation)

I do agree with this:

"...perhaps more important, skill sets acquired on one engine can be applied to another."

I've been trying to think of other reasons why SQL is so successful, or at least, ubiquitous, and I must say I have trouble coming up with good reasons. Although I like SQL, it has it's share of problems IMO, especially where it comes to integrating it with application programming languages (object oriented or otherwise). Anyway - I'm going off-topic here, I should write about that myself sometime.

Anonymous said...

kinda surprised that nobody has jumped on the XML support in DB2 (is it also in DB2 Express?). it supports Xpath and optional automatic schema validation but more importantly it supports indexes too. Seems like it could keep a lot of use cases for NoSQL (mainly the less rigid schema) in the context of an SQL RDBMS.

Unknown said...

Long live SQL !

Anonymous said...

@Roland: while SQL RDBMS tend to make schema changes hard, they make changing the queries easy. A lot of what you gain from schema "less" NoSQL databases is bought with having to refactor your schema (or making alterations to it) whenever you add a new query to your app.

Furthermore its easy to just do simple one shot queries to browse through your data, migrate data around etc.

NoSQL solves a lot of problems that are hard to do with SQL RDBMS. But it also requires more time to do many of the things that are easy with an SQL RDBMS.

Julian Hyde said...

I agree with you that vendor compliance levels leave a lot to be desired. But I view it as a glass half full: their core SQL implementations are more or less the same, and when they implement non-core features they tend to use the same syntax.

A few examples:

1. Oracle, DB2, SQL Server and SQLstream support windowed aggregation using syntax very close to that in the SQL:2003 standard (e.g. "SELECT sum(quantity) OVER (PARTITION BY orderId)". The syntax is difficult to learn, but the concept is extremely powerful, and when you've learned it for one database, you can use it on others.

2. Oracle, DB2, SQL Server support the GROUPING SETS mechanism to compute multiple levels of aggregation in the same query. Again, the syntax is very similar among vendors.

3. MySQL allows backslash in character literals. It may help their users, but no other vendor does this. In my opinion, breaking interoperability for such a minor feature is not worth it.

4. Oracle still does not support VALUES in queries (except for INSERT). So, for single-row queries, one has to fall back on the poor old DUAL table. Obviously, these queries fall flat on their face on other databases.

5. Several databases still do not support the BOOLEAN data type.

6. Oracle still does not allow AS keyword in the FROM clause (e.g. 'SELECT * FROM emp AS e').

Your point that "it is rare to find non-trivial SQL that actually runs on multiple RDBMS-es" is well taken. But it is mitigated by the fact that a lot of that SQL is DDL run at install time, and also by the fact that a lot of SQL run these days is generated by engines. Mondrian is one such engine. Mondrian's dialect support more complicated than I would like, but the SQL standard has at least made the problem tractable.

The SQL standard is so rich that no vendor is going to support everything -- even the huge vendors. So we shouldn't say that the standard has failed if not everyone implements every feature. However, it has failed if a vendor adds a feature that is covered by the standard and chooses to do it another way. (Or, to be charitable, hasn't taken the time to read the huge and complex standard.)

Contrast that with the HTML standard, where full compliance is mandatory. A web site can't dictate what web browsers its users will use, so every browser needs to support the full HTML standard. You have criticized Internet Explorer 8 for failing to support new HTML features, and for supporting other features in a non-compatible way, and I agree with you. But SQL is a different kind of standard. A SQL client can choose which servers it will run on, and that makes full compliance much less important. It allows the standard writers to put features into the standard that most vendors will not implement for years, if at all; and that is a good thing, because it allows SQL to grow to meet new challenges, and allows specialized databases to grow up and still stay within the broad parameters of the standard.

As you say, the major vendors claim to support no more than SQL:99, yet they do support features from SQL:2003 (XML, window functions), SQL:2006 (more XML features) and SQL:2008 (TRUNCATE). You infer that they have stopped supporting the standard, but I my reading is different. The vendors view the core standard as more important and the new features, SQL:99, as the last major revision of the standard, has better brand recognition than SQL:2003 or SQL:2008. As I have noted, they continue to add features in more current versions of the standard.

I do wish that they would go back and ensure complete compliance with core, everyday features such as the BOOLEAN datatype, character literals, and the AS keyword. It would allow many more SQL statements to run unchanged on multiple databases, and it would remove a lot of code from Mondrian's code to support dialects. But all in all, I view the SQL standard as a glass much more than half full.

rpbouman said...

Julian,

I think you raise very good points, and I appreciate your point of view on discerning between different kinds of standards, as in the HTML vs SQL example. I think you're right, and I'm glad for the insight.

So, thanks! That was all :)

Unknown said...

Great post, Julian. Great to see technical discussion around the pro's and con's of SQL rather.

For folks interested in a hybrid approach of SQL integrated tightly with MapReduce programming model (trying to fuse the benefits of both approaches), there is some
discussion here.