Wednesday, January 07, 2009

Hard-won lessons in Mondrian query optimization

Mondrian is generally very smart in how it chooses to implement queries. Over the last month or so, I have learned some lessons about how hard can be to make Mondrian smarter.

As a ROLAP engine (I prefer to call it 'ROLAP with caching'), Mondrian's evaluation strategy has always been a blend of in-memory processing, caching, and native SQL execution. Naturally there is always SQL involved, because Mondrian doesn't store any of its own data, but the question is how much of the processing Mondrian pushes down to the DBMS and how much it does itself, based on data in its cache.

The trends are towards native SQL execution. Data volumes are growing across the board, Mondrian is being deployed to larger enterprises with large data sets (in some cases displacing more established, and expensive, engines). Mondrian cannot keep up with the growth by simply pulling more data into memory and throwing one or two more CPU cores at the problem.

Luckily a new breed of database engines, including Aster Data, Greenplum, Infobright, Kickfire, LucidDB, Netezza and Vertica, are helping to solve the data problem with innovative architectures and algorithms. To exploit the power of the database engine, Mondrian's ability to generate native SQL is more important than ever.

I have spent the last few weeks struggling to make Mondrian handle a particular case more efficiently. It was ultimately unsuccessful, but it was a case where defeat teaches you more than victory.

Here is the actual MDX query:
WITH
SET [COG_OQP_INT_s9] AS
  'CROSSJOIN({[Store Size in SQFT].[Store Sqft].MEMBERS},[COG_OQP_INT_s8])'
SET [COG_OQP_INT_s8] AS
  'CROSSJOIN({[Yearly Income].[Yearly Income].MEMBERS},[COG_OQP_INT_s7])'
SET [COG_OQP_INT_s7] AS
  'CROSSJOIN({[Time].[Time].MEMBERS}, [COG_OQP_INT_s6])'
SET [COG_OQP_INT_s6] AS
  'CROSSJOIN({[Store].[Store Country].MEMBERS},[COG_OQP_INT_s5])'
SET [COG_OQP_INT_s5] AS
  'CROSSJOIN({[Promotions].[Promotions].MEMBERS}, [COG_OQP_INT_s4])'
SET [COG_OQP_INT_s4] AS
  'CROSSJOIN({[Promotion Media].[Promotion Media].MEMBERS},[COG_OQP_INT_s3])'
SET [COG_OQP_INT_s3] AS
  'CROSSJOIN({[Store Type].[Store Type].MEMBERS}, [COG_OQP_INT_s2])'
SET [COG_OQP_INT_s2] AS
  'CROSSJOIN({[Marital Status].[Marital Status].MEMBERS}, [COG_OQP_INT_s1])'
SET [COG_OQP_INT_s1] AS
  'CROSSJOIN({[Gender].[Gender].MEMBERS},
    {[Education Level].[Education Level].MEMBERS})'
SELECT {[Measures].[Unit Sales]} ON AXIS(0),
  NON EMPTY [COG_OQP_INT_s9] ON AXIS(1)
FROM [Sales]
WHERE ([Customers].[All Customers].[USA].[CA].[San Francisco].[Karen Moreland])
The query looks a bit fearsome, but is quite likely to occur in practice as a business user slices and dices on several attributes simultaneously. The rows axis is a CrossJoin of ten dimensions, but because of the filtering effect of the slicer (combined with NON EMPTY) the query evaluates to a single row. The goal is to make Mondrian generate a SQL statement to evaluate the axis.

Each way that I tried to write the logic, I ended up making decisions that made other optimizations invalid. It was difficult to make Mondrian see the big picture: that, although named sets are not supposed to inherit the context where they evaluated, in this case it was OK; and to recognize a complex expression (many nested CrossJoin operators, slicer, and implicit non-empty context), and convert the whole thing into a single SQL statement. For instance, in one attempt I succeeded in generating a SQL statement which evaluates very efficiently, but in so doing I had to let the non-empty context of the evaluator leak into places that it shouldn't... which broke quite a few existing queries, in particular queries involving calculated sets.

There are several conclusions for Mondrian's architecture. One conclusion is that we need to deal with filtering non-empty tuples as part of the expression, not as a flag in the evaluator (the data structure that contains, among other things, the set of members that form the context for evaluating an expression).

MDX has an operator, EXISTS, that specifies that empty tuples should be removed from a set. Then we can reason about queries by applying logic-preserving transformations (just the way that an RDBMS query optimizer works), which should be safer than today's ad hoc reasoning. For example, if I am a developer implementing an MDX function and the evaluator has nonEmpty=true, am I required to eliminate non-empty tuples or am I merely allowed to eliminate them? (In other words, will my caller return the wrong result if I forget to check the evaluator flag?) I often forget, so I suspect that filtering of empty tuples is performed inconsistently throughout the Mondrian code base; which is a shame, because eliminating empty tuples early can do a lot for performance.

I'd also like to use the same model for native SQL generation as for other forms of expression compilation. Native SQL generation currently happens at query execution time: when the function is evaluated, it figures out whether it can possibly translate the logic (and the constraints inherited from the evaluation context) into SQL. That is currently unavoidable, because the nonEmpty flag is only available in the evaluator, at query execution time. And we need to do some work at query execution time, if only to plug in the keys of the members in the current context as predicates in the SQL statement. But I've seen several cases where we need to be smarter.

One example is 'NON EMPTY [Level].Members' that always gets translated into SQL even though the level only has two members and they are in cache. Cost-based optimization would help there.

Another example is where there are many layers of MDX functions — say Filter on top of CrossJoin on top of Filter — and these could be rolled into a single SQL statement. The right approach is to build a SQL statement by accretion, but it is too expensive to do every time the expression is evaluated.

Further, as we add more rules for recognizing MDX constructs that can turn into SQL, we will reach decision points where we choose to have to choose whether to apply rule A or rule B. Solutions are (a) using costing to decide which rule to apply, and (b) applying both rules and seeing which ultimately generates a better outcome. Neither of these solutions are suitable for query execution time: they need an optimization stage, as part of query preparation.

It's ironic, considering I've been building SQL optimizers for years (the first at Broadbase, and the second the optimizer for the Eigenbase project, which is used by both LucidDB and SQLstream) that I have avoided giving Mondrian a true query optimizer for so long. I know it's a lot of work to build an optimizer, and it's foolish to start before you know what problem you need to solve.

Don't expect to see any changes in the short term; this kind of architectural change doesn't happen fast. My struggle over the past few weeks has been a big step in seeing the big picture, and realize that the considerable pain and effort of unifying Mondrian's query planning system is justified by the potential benefits in performance.

12 comments:

Roland Bouman said...

Hi Julian!

great to see you writing about this difficult but interesting problem.

Keep it up ;)


regards,

Roland

John Kemp said...

Julian,

Great post!

It's helped us to better understand the challenges that you are facing with a problem that I have to believe all OLAP tools must address.

Thanks,

John

Daniel Lemire said...

Thank you for this post. As a database researcher, I can appreciate the difficulties.

JVS said...

It's about time. :)

Jon said...

Glad to see ongoing work in this area. The potential is great, and we're looking to make the most of it.

We're still at the point where we're trying to get a better feel for some of the more mundane aspects of this, such as:
- how large and complex (dimensional?) can our cube get, and still be reasonably handled in Mondrian without aggregate tables?

- how well can the 2.0 Aggregate Designer handle large complex cubes (so far, it seem to work for us on simpler cases, but eats enormous amounts of memory -- and runs out -- on our more demanding cases)? Also: if the Aggregate Designer chokes, is our only alternative working through the design and implementation of perhaps a hundred or more aggregate tables, by hand, for each of our 8 to 10 cubes?

Julian Hyde said...

Jon,

how large and complex (dimensional?) can our cube get, and still be reasonably handled in Mondrian without aggregate tables?

It depends on your database. The limiting factor is the speed at which your database can do a 'select ... from fact-table group by ...' query. Absent any special tricks on the part of the database, that query requires a full scan of the fact table, and therefore its running time increases at least linearly with the size of the fact table. If that takes longer than your business users are prepared to wait, you will need aggregate tables.

how well can the 2.0 Aggregate Designer handle large complex cubes (so far, it seem to work for us on simpler cases, but eats enormous amounts of memory -- and runs out -- on our more demanding cases)

The aggregate designer is a work in progress. It's solving a hard problem (one of those your CS professor loved to talk about), and so trades off memory for running time. The search space is exponential, but the algorithm cuts a few corners by using statistical approximation rather than traversing the whole search space.

The good news is that the algorithm is open source and pluggable. You can make tweaks to the algorithm, or indeed drop in a whole new algorithm, without affecting other parts of the aggregate designer (e.g. the SQL generation or the cost analysis).

Are there any smart CS students reading this blog? (Or even better, supervisors of smart CS students!) I challenge you to come up with a better algorithm! The source code is available at sourceforge.

Also: if the Aggregate Designer chokes, is our only alternative working through the design and implementation of perhaps a hundred or more aggregate tables, by hand, for each of our 8 to 10 cubes?

Try running the aggregate designer with shorter amounts of time. The algorithm tries to come up with a reasonable approach first, then refine it. Ideally, if the algorithm runs out of memory it would return the best solution it has found so far, but clearly that isn't happening.

Julian

Jon said...

Julian - thanks for the quick reply. We tried running it again, setting the time limit to 2 minutes. But it ran for 10 minutes and then ran out of memory and crashed. Either it's ignoring the time limit, or there's an atomic operation that's crashing before it comes up for air to check run time against the limit we specified.

We just moved the process over from a dev workstation to a server, and gave the JVM 8GB max to work with (the server only has 4GB physical RAM, so most of this is virtual). We set the limits to 20 tables max (a lot less than we'll probably need in the long run, but enough for a test), and set the max time to 20 seconds. Again, the process seemed to ignore our time limits, ran for 15 minutes, and then ran out of memory.

We just raised the JVM max mem to 12GB and we're kicking off the test again.

I'm wondering if we've done something wrong in our cube definition? We *did* see that by dropping the number of rows in the fact table from 7 million to 24,000 before running the Aggregation Designer we *were* able to get it to finish -- albeit with a less than optimal solution. Seems that the fully populated fact table has a big impact on the process.

FYI: we have 28 dimensions in the cube, most have 1 level and half a dozen have 3 levels; and there are 9 measures in the cube. Half the dimensions have about 100 members, and the other half have about 1000 each.

- Jon

Julian Hyde said...

Jon, I feel your pain. I don't know what's going on. Log a bug. Contributions welcome, etc.

Jon said...

Progress!!! We didn't delve into the Aggregation Designer source code, but we tried a work around this seems to be yield very nice results for the moment.

We put together some automation (using Kettle of course) that assists us at various stages in the following sequence of tasks:
- we define and save the cube schema (workbench)
- run through a process that shows the designer all the dimension tables, the levels to be computed, and the number of rows in each dimension table
- allow the designer to select or de-select the levels or dimensions, so that they will be considered or not during the aggregation designer session
- create a modified (simplified) copy of the cube schema as per the prior step's selections
- run Aggregation Designer, creating the aggregate tables, updated schema and aggregation scripts
- reassemble a schema that mondrian can use with the full cube design, along with the references to the newly defined aggregate tables.

Result? Success!

Our typical client (hospital) situation involves 7 cubes, each with perhaps 20 to 30 dimensions, and fact tables in the 2 million to 7 million row range. For this initial test, we created one cube with a couple of dozen dimensions and a fact table with 1.3 million rows.

We ran through the steps described above, simplifying a copy of the schema, etc -- and Aggregate Designer created 233 aggregate tables. Populating the aggregates took PostgreSQL about 8 minutes, and we were then ready to point Mondrian to the new cube, et voila! It worked!

For the moment, Mondrian is running on a workstation, pointing to the instance of PostgreSQL which is running on an old test server (only 3GB ram, non-RAID single SATA drive -- well below the spec of our production servers). Even with this limited setup, when we point JPivot to the Mondrian instance running on a workstation, MDX query results are extremely fast, even when they include calculated measures on large result sets.

We clearly have more experimentation ahead of us, especially before we get to the 60 million row cubes we have yet to build for some new clients, but but results to date are VERY encouraging!

- Jon

Jon said...

Julian -

We're over our largest hurdle now with Aggregation Designer, so we're setting that aside for a moment to consider alternatives.

We've begun to look at the idea of keeping PostgreSQL for our transactional database, and even considering keeping it for the entire ETL process that builds our fact tables. It's at this last step that we're considering replicating the fact tables to a column oriented DBMS. Since we're doing our best to avoid license fees, we're staying away from the Vertica's and Netezza's in the market, and looking at:
- Infobright (ICE)
- LucidDB
- MonetDB

MonetDB sounds fast -- but my impression is that it's the least mature of the bunch. I like leading edge, but tend to back off a bit from the bleeding edge.

Infobright / ICE sounds like a solid product, but it's a pity there's no DML in ICE. We *could* simply create a pure read-only data warehouse, but it might be a bit limiting. Not sure yet.

That leaves LucidDB. I've already read a couple of the pieces about LucidDB and Pentaho (e.g., LucidDBAggregateDesigner, Kettle integration, etc).

Curious to know if you (or anyone reading this) has compared the three above, and would be even more interested to hear experiences of people who use them with Pentaho.

Thoughts / comments / replies most welcome!

- Jon

raveeeeeeeee said...

Interesting and Innovative.....

Thanks..

venkatesh said...

hi i am facing one problem in MDX.Actually i don't know how to write MDX query.My problem is how to write OR condition in MDX query.

WITH
set[~one] as {[idPaymentAccount.idPaymentAccount_Hierarchy].[3]}
set [~two] as {[serviceafterlife.serviceafterlifeHierarchy].[3]}

SET [~ROWS] AS
{[TimeDimension.By_week].[year].Members}
SELECT
NON EMPTY {[Measures].[authorizationBreakValue2], [Measures].[priceSupplier]} ON COLUMNS,
NON EMPTY [~ROWS] ON ROWS
FROM [ServiceAfterLife_all_1]
where crossjoin([~one],[~two])


This is the my MDX.In where condition how to write OR condition in MDx.
thank u