Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 01/01/2009

Friday, January 30, 2009

Recursive event-driven demo

I just gave a demo of SQLstream to Joe di Paolantonio. I gave Joe a couple of the usual demos, including the one where we read feeds from Twitter, Google news and various RSS feeds. Joe sent a real-time commentary as a series of tweets, and these duly showed up in the demo.

A bit of a stunt, but it was cool, in a geeky Hofstadter-esque recursion kind of way. Then we had lunch; good chinese food, and blissfully offline.

Thursday, January 29, 2009

Event-driven marketing

David Raab writes a great piece on SQLstream, its internals, and how it can be applied to event-driven marketing.

Monday, January 26, 2009

SQLstream release 2.0 hit the streets today. It's actually our third release, but it's the first one we've made a fuss about. Since release 1.1, we have hardened the product at customer deployments, introduced 64 bit support to allow larger working memory, and added two major SQL extensions: streaming aggregation and user-defined transforms.

Streaming aggregation adds support for the GROUP BY construct to streaming SQL, and allows us to compute totals on a periodic basis, for example computing hourly subtotals. SQLstream acts as a continuously operating conduit between operational systems and the data warehouse, replacing the traditional batch-based ETL process and populating the fact table and aggregate tables simultaneously. This gives us a natural synergy with my other project, the Mondrian OLAP engine, and we have been getting good take-up among Pentaho's customers.

Streaming aggregation builds on relational operators in SQLstream 1.1 such as joins, windowed aggregations and unions. Those relational operators allowed you to build queries such as fraud detection, looking for anamolous rows in real time. Those were 'needle in the haystack' kinds of problem, and the new features also allow you to do build a high-performance 'water main' between your operational system and data warehouse.

User-defined transforms allow you to define new relational operators in Java and incorporate them into streaming SQL statements. SQLstream 1.1 had user-defined functions, to allow you to compute scalar quantities in Java, and syntactically, user-defined transforms are simply functions that have JDBC ResultSets and PreparedStatements as parameters. If a function has a PreparedStatement as a parameter, SQLstream lets you include it in the FROM clause of a SQL statement as a data source, alongside regular streams, tables, and views. Similarly, if a function has a ResultSet as a parameter, then you can pass in a cursor based on a SELECT statement as an argument.

User-defined transforms are an excellent example of our ongoing collaboration with the open-source Eigenbase project. User-defined transforms were originally developed for LucidDB to operate on traditional stored relational data, with SQL:2003-compliant syntax, and we extended them to handle streaming relational data, but keeping the syntax the same. For more about user-defined transforms in LucidDB, see the excellent documentation at Eigenbase.

Release 2.0 is a major milestone for SQLstream, and is the culmination of several years of development. It allows you to tackle in industry-standard SQL some application areas that previously required guile and custom coding. Go to www.sqlstream.com and see whether there is a fit with your real-time BI application.

Thursday, January 22, 2009

Pentaho Analysis Tool

Some folks are working on an olap4j-based viewer as an alternative to JPivot, called Pentaho Analysis Tool.

The key developers Tom and Luc tell me that they noticed that the halogen project hadn't changed in a few months, so they took the halogen source code (based on GWT, by the way) and started to take it in the direction of the OLAP viewer they'd like to see.

(Edit: There are actually three key developers. I forgot to mention Paul Stöllberger. Sorry Paul!)

No hard feelings! I, and some other key Pentaho folks, are delighted that this project is happening, and will support it any way we can.

It's ironic that when Pentaho seeded a project to build an olap4j-based viewer, they chose an organic, open-sourcey name 'halogen', yet these folks (none of whom works for Pentaho) chose a name that whiffs of corporate branding.

A rose by any other name, as they say. And despite the name, the viewer should work on top of any olap4j data source (which today means Mondrian and any OLAP engine with an XMLA interface).

Gartner releases Magic Quadrant for Business Intelligence

Gartner have released their 2009 Magic Quadrant for Business Intelligence (via DBMS2).

Pentaho has not made it onto the quadrant diagram yet (I suppose because they have not crossed the $20M revenue threshold), but earns its own paragraph in the accompanying commentary:

Pentaho, after just four years in existence, has put together a comprehensive open-source BI platform that includes data integration and data mining capabilities. In 2008, Pentaho was noticeably more aggressive, openly competing against traditional BI platform vendors. Like Jaspersoft, Pentaho is affordable and also offers a subscription-based model that avoids an initial large payment for the software license. Some of the significant features Pentaho introduced in 2008 include an automatic table designer that analyzes relational schemas and data patterns, performs a cost-benefit analysis of aggregation at different levels, and generates and populates those aggregate tables. Despite a handful of large customers, Pentaho reference survey respondents more frequently indicated that they had more departmental deployments (versus enterprisewide) and smaller data volumes compared with the other vendors.

Nice that the aggregate table designer gets a call-out. It's very important in helping Mondrian scale to enterprise-scale data warehouses. (And besides, it was a lot of work to write!)

The report calls out the quality of Pentaho's customer support:

This was in evidence in the MQ reference survey, as both Jaspersoft and particularly Pentaho scored strongly on the customer support question — higher than any of the megavendors.

Wednesday, January 07, 2009

Hard-won lessons in Mondrian query optimization

Mondrian is generally very smart in how it chooses to implement queries. Over the last month or so, I have learned some lessons about how hard can be to make Mondrian smarter.

As a ROLAP engine (I prefer to call it 'ROLAP with caching'), Mondrian's evaluation strategy has always been a blend of in-memory processing, caching, and native SQL execution. Naturally there is always SQL involved, because Mondrian doesn't store any of its own data, but the question is how much of the processing Mondrian pushes down to the DBMS and how much it does itself, based on data in its cache.

The trends are towards native SQL execution. Data volumes are growing across the board, Mondrian is being deployed to larger enterprises with large data sets (in some cases displacing more established, and expensive, engines). Mondrian cannot keep up with the growth by simply pulling more data into memory and throwing one or two more CPU cores at the problem.

Luckily a new breed of database engines, including Aster Data, Greenplum, Infobright, Kickfire, LucidDB, Netezza and Vertica, are helping to solve the data problem with innovative architectures and algorithms. To exploit the power of the database engine, Mondrian's ability to generate native SQL is more important than ever.

I have spent the last few weeks struggling to make Mondrian handle a particular case more efficiently. It was ultimately unsuccessful, but it was a case where defeat teaches you more than victory.

Here is the actual MDX query:

WITH
SET [COG_OQP_INT_s9] AS
  'CROSSJOIN({[Store Size in SQFT].[Store Sqft].MEMBERS},[COG_OQP_INT_s8])'
SET [COG_OQP_INT_s8] AS
  'CROSSJOIN({[Yearly Income].[Yearly Income].MEMBERS},[COG_OQP_INT_s7])'
SET [COG_OQP_INT_s7] AS
  'CROSSJOIN({[Time].[Time].MEMBERS}, [COG_OQP_INT_s6])'
SET [COG_OQP_INT_s6] AS
  'CROSSJOIN({[Store].[Store Country].MEMBERS},[COG_OQP_INT_s5])'
SET [COG_OQP_INT_s5] AS
  'CROSSJOIN({[Promotions].[Promotions].MEMBERS}, [COG_OQP_INT_s4])'
SET [COG_OQP_INT_s4] AS
  'CROSSJOIN({[Promotion Media].[Promotion Media].MEMBERS},[COG_OQP_INT_s3])'
SET [COG_OQP_INT_s3] AS
  'CROSSJOIN({[Store Type].[Store Type].MEMBERS}, [COG_OQP_INT_s2])'
SET [COG_OQP_INT_s2] AS
  'CROSSJOIN({[Marital Status].[Marital Status].MEMBERS}, [COG_OQP_INT_s1])'
SET [COG_OQP_INT_s1] AS
  'CROSSJOIN({[Gender].[Gender].MEMBERS},
    {[Education Level].[Education Level].MEMBERS})'
SELECT {[Measures].[Unit Sales]} ON AXIS(0),
  NON EMPTY [COG_OQP_INT_s9] ON AXIS(1)
FROM [Sales]
WHERE ([Customers].[All Customers].[USA].[CA].[San Francisco].[Karen Moreland])

The query looks a bit fearsome, but is quite likely to occur in practice as a business user slices and dices on several attributes simultaneously. The rows axis is a CrossJoin of ten dimensions, but because of the filtering effect of the slicer (combined with NON EMPTY) the query evaluates to a single row. The goal is to make Mondrian generate a SQL statement to evaluate the axis.

Each way that I tried to write the logic, I ended up making decisions that made other optimizations invalid. It was difficult to make Mondrian see the big picture: that, although named sets are not supposed to inherit the context where they evaluated, in this case it was OK; and to recognize a complex expression (many nested CrossJoin operators, slicer, and implicit non-empty context), and convert the whole thing into a single SQL statement. For instance, in one attempt I succeeded in generating a SQL statement which evaluates very efficiently, but in so doing I had to let the non-empty context of the evaluator leak into places that it shouldn't... which broke quite a few existing queries, in particular queries involving calculated sets.

There are several conclusions for Mondrian's architecture. One conclusion is that we need to deal with filtering non-empty tuples as part of the expression, not as a flag in the evaluator (the data structure that contains, among other things, the set of members that form the context for evaluating an expression).

MDX has an operator, EXISTS, that specifies that empty tuples should be removed from a set. Then we can reason about queries by applying logic-preserving transformations (just the way that an RDBMS query optimizer works), which should be safer than today's ad hoc reasoning. For example, if I am a developer implementing an MDX function and the evaluator has nonEmpty=true, am I required to eliminate non-empty tuples or am I merely allowed to eliminate them? (In other words, will my caller return the wrong result if I forget to check the evaluator flag?) I often forget, so I suspect that filtering of empty tuples is performed inconsistently throughout the Mondrian code base; which is a shame, because eliminating empty tuples early can do a lot for performance.

I'd also like to use the same model for native SQL generation as for other forms of expression compilation. Native SQL generation currently happens at query execution time: when the function is evaluated, it figures out whether it can possibly translate the logic (and the constraints inherited from the evaluation context) into SQL. That is currently unavoidable, because the nonEmpty flag is only available in the evaluator, at query execution time. And we need to do some work at query execution time, if only to plug in the keys of the members in the current context as predicates in the SQL statement. But I've seen several cases where we need to be smarter.

One example is 'NON EMPTY [Level].Members' that always gets translated into SQL even though the level only has two members and they are in cache. Cost-based optimization would help there.

Another example is where there are many layers of MDX functions — say Filter on top of CrossJoin on top of Filter — and these could be rolled into a single SQL statement. The right approach is to build a SQL statement by accretion, but it is too expensive to do every time the expression is evaluated.

Further, as we add more rules for recognizing MDX constructs that can turn into SQL, we will reach decision points where we choose to have to choose whether to apply rule A or rule B. Solutions are (a) using costing to decide which rule to apply, and (b) applying both rules and seeing which ultimately generates a better outcome. Neither of these solutions are suitable for query execution time: they need an optimization stage, as part of query preparation.

It's ironic, considering I've been building SQL optimizers for years (the first at Broadbase, and the second the optimizer for the Eigenbase project, which is used by both LucidDB and SQLstream) that I have avoided giving Mondrian a true query optimizer for so long. I know it's a lot of work to build an optimizer, and it's foolish to start before you know what problem you need to solve.

Don't expect to see any changes in the short term; this kind of architectural change doesn't happen fast. My struggle over the past few weeks has been a big step in seeing the big picture, and realize that the considerable pain and effort of unifying Mondrian's query planning system is justified by the potential benefits in performance.