Tuesday, January 19, 2010

Data in Flight

An article of mine, "Data in Flight," is published in this month's Communications of the ACM. In it, I took the time to explain, in layman's terms, why I think streaming database technology is a game-changer.

Many pundits have latched on to the term CEP (Complex Event Processing) to describe this technology. CEP is a legitimate and important application, and I believe that streaming SQL is a good way to solve it, but the article tries to put a bit of space between the two concepts. There are so many problems that benefit from the declarative, relational approach but where the data arrives incrementally and the problem can be solved much more efficiently by a streaming engine working (mainly) in memory than a database, and CEP is just one application area. My article describes a few of those problems.

I'm all fired up about streaming databases, just as I was when I co-founded SQLstream. I've worked in the database field for over 20 years, and I think it's the most exciting thing to happen in databases in a generation. (Yes, it's more important than data warehousing and, cough, object databases.)

Streaming SQL technology is rapidly becoming part of the standard toolkit for solving data management problems. If you're not familiar with the technology, reading the article is a good way to come up to speed. Enjoy!

8 comments:

Anonymous said...

Great article in one of my favorite journals.

Preston L. Bannister said...

RSS meets SQL (er, excuse me - meant to say "Atom"). Yes. Maybe.

Daniel Lemire said...

Congratulations. This is fantastic exposure.

Hans said...

I like the "continuous ETL" idea, I've used streaming SQL in this way.

There is something to be said for integrating an amount of procedural language capability with SSQL, because some things are much, much harder in SSQL.

Julian Hyde said...

Preston,

Yes, they're both about push, and data syndication. To a first approximation, the web (and social media) run on semi-structured data (XML, RSS, Atom) and business runs on structured data (SQL). So, we are seeing movements to make both forms of data flow efficiently in real time.

RSS and Atom are actually disappointing when you look closely. Although people call them 'feeds', in reality and RSS/Atom feed is just a web page, and any subscriber has to poll periodically to find updates. Folks like FriendFeed are working to resolve the deficiencies in the RSS/Atom protocols and make web feeds genuinely 'push'.

HTML5 web sockets are making that transition easier.

Julian

Julian Hyde said...

Hans,

I didn't mention in the article, because I was trying to stay reasonably vendor-neutral, but SQLstream has a number of ways to extend the system using Java plugins.

There are SQL/MED adapters (to talk to external systems), user-defined functions (UDFs), and user-defined transforms (UDXs). A UDX is a piece of Java that reads from a stream (or streams) and writes rows to a stream, but once it has been registered you can use it in a query as an extra relational operator.

For example,

SELECT STREAM *

FROM stream1 OVER (

  RANGE INTERVAL '1' HOUR PRECEDING) AS a

JOIN STREAM(

  MyOperator(

    SELECT STREAM *

    FROM stream2

    WHERE x > 5),

    77)) AS b

ON a.x = b.y

These plugins allow you to incorporate procedural language capability into your data flow, but plug the pieces together using declarative SQL.

SQL can't do everything, but as you know from relational databases, the few basic operators can be combined in powerful ways. We are still discovering the operator set for streaming queries -- for example, standard SQL doesn't have a way to rank a value within a sliding window -- and UDXs can be a useful stopgap. We can get the app written using a UDX, then change a few lines of SQL when the operator is implemented in the next release.

Julian

Anonymous said...

Being a newbie to the world of streams, I found the article a good read.

Bob Folkerts said...

Julian, this seems like a natural tie in to the recent 'web hooks' discussions. That helps to move beyond the RSS as page and into data as streaming content. I'm trying to see how to stream metadata in sql streams and use that to control media streams, but my brain is way too tired. This is exciting stuff.