Tuesday, June 15, 2010

OLAP change notification, and the CellSetListener API

There has been an interesting design discussion on the olap4j forums about how an OLAP server could notify its client that the data set has changed. It is exciting because it would allow us to efficiently update OLAP displays in real-time.

We came up with an API, at the center of which is the new interface CellSetListener, which I have just checked into olap4j's subversion repository. (The API is experimental. That means you shouldn't expect to find a working implementation just yet, or assume that the API won't change radically before it is finalized, but it does mean we are still very much open to suggestions for improvements.)

Of course, OLAP notifications are a subject close to my heart, because they bring together my interests in SQLstream and mondrian. 'Push-based' computing is challenging, because every link in the chain needs to propagate the events to the next link. In a previous post I described how SQLstream could do continuous ETL, populate fact and aggregate tables incrementally, and notify mondrian that data items in its cache were out of date.

A mondrian implementation of the CellSetListener API would cause mondrian to internally re-evaluate all queries that have listeners and cover an affected area of the cache. If the results of those queries changed, mondrian would transmit those notifications to OLAP client applications such as Pentaho Analyzer or PAT. The client application would then change the value of the cell on the screen, and maybe change the cell's background color momentarily to attract the user's attention.

Getting data to change on the screen, in front of the end-user's eyes, within seconds of the data changing in the operational system, would be truly spectacular.

There are several links in the chain to make that happen. Two of the links, SQLstream and mondrian's cache control API, are already complete. We've just begun forging the next link.

9 comments:

Daniel Lemire said...

Typically, OLAP assumes static data and uses materialization to speed up results. One might say that's almost part of the definition of OLAP.

So, how do you avoid performance pitfalls?

Julian Hyde said...

Speaking as the designer of the olap4j API, I don't need to avoid the performance pitfalls; I just provide an API so that ingenious OLAP engine designers can provide the latest data.

But speaking as the designer of the Mondrian engine, here's how I'd do it. By the way, I do not agree that static data and materialization are part of the definition of OLAP. OLAP is a multidimensional view of data, and short query response times. Static data, materialization, and for that matter star schemas, have been a pragmatic best practice in achieving those goals but I don't see them as essential.

Materialization has a high cost if it occurs on disk. If there are N materialized views on top of each row in a fact table, and they are stored on disk, then updating that row will occur N+1 disk block writes.

If you batch up updates over a period of time, then updates for multiple rows will tend to hit the same disk blocks, and you can save some effort.

But the story is radically different if the materialization occurs in memory. Memory writes do not have a significant cost (unless you are processing hundreds of thousands of updates per second) so you do not need to worry about keeping N, the number of aggregates small. You can have lots of materializations of the fact table, at different granularities, and the main cost is memory.

Results from an OLAP engine are always materialized in memory at least once; namely the cell from which the OLAP engine generates its result. If the transactional system notifies the OLAP engine of each transaction, the OLAP engine can modify its cache and all aggregations in it.

This picture seems to work nicely for MOLAP, where all data is stored in multidimensional format, and less well for ROLAP, where the primary source of data is relational and data is stored in multidimensional format, in memory, only fleetingly.

But it is not relational format that makes materialized aggregations start to perform poorly; it is storage on disk. If a MOLAP engine has more data in multidimensional arrays than it has memory, those updating those arrays will require a write to disk, and the only efficient way to do that is to batch up many updates into the same disk write. That implies that what is on disk has to lag behind the latest transactional state.

Putting all this together, we arrive at an architecture where the OLAP engine has a cache that is up to date with the transactional system, and contains multiple in-memory aggregates at different granularities. If it is a ROLAP system, like Mondrian, then the data ultimately comes to rest in the relational database, but the relational database will lag behind the latest transactional state. The OLAP engine serves as a view for the application, serving up the most up to date data, because it knows which parts of the system are up to date.

Daniel Lemire said...

I do not agree that static data and materialization are part of the definition of OLAP.

Agreed but Jim Gray's data cube is all about materialization (go back and check his original data cube paper).

While this is not Mondrian's business, you can probably get a lot of work done just by keeping the fact table in RAM, without pre-materializing permanently a lot of views. Qlikview seems to be doing it well.

Papa said...

I find this post rather exciting, as real-time olap was the initial decision for our ActivePivot software ( http://www.quartetfs.com/activepivot ).

We materialize low level aggregates in the main memory, index them with compressed bitmaps also in the main memory so that higher level aggregates can be computed in milliseconds. The result is a set of fully incremental data structures that support fast throughputs of transactionnal updates and can be queried multidimensionnally. And a subscription engine is built on top of that to register MDX queries and push updated cells to subscribers.

We implement MDX and XMLA but had to develop some proprietary push protocol, along with a custom web frontend to show those green/red blinking cells. Now it looks like we may incorporate this layer behind Olap4J interfaces, a library used by several of our customers.

It is good to see a significant step towards real-time OLAP.



I will also use this comment as an opportunity to greet and thank you Julian for the continuous innovation and excellence of the Mondrian project, and Daniel for your inspiring work on bitmap indexes.

Julian Hyde said...

@Papa:

Thanks for the kind words. I'd be grateful if you could carefully review the proposed olap4j API. Since you have already implemented this support, you probably have a good idea what works.

Also, can you drop me an email -- I'd like to see if there are any other ways we can work together.

Julian

Daniel Lemire said...

@Papa (and @Julian)

I've convinced a student to start working on real-time OLAP in the context of proposed API upgrade.

Any chance you might share prototypical code? Or anything at all?

I don't think she'll have time to produce a full prototype, but she might be able to run some experiments and produce some feedback.

Without any software, her work might be a bit more conceptual and maybe less useful.

Julian Hyde said...

Daniel,

The olap4j API is checked in already.

I was thinking of a simple (brute force) implementation in mondrian (at first). For each statement being watched, store the results on the server side. Each time there is a notification from the CacheControl API, re-execute these statements, compare the results, and notify the client if there is a difference. For extra credit, compute that difference.

Plenty of ways to improve this implementation... each of which could be a paper...

Julian

Daniel Lemire said...

The olap4j API is checked in already.

Yes. Which is why I think there is a enough meat for an academic project.

Antoine CHAMBILLE said...

Julian we gave a first look at the new Olap4J real-time interfaces. We obviously need to spend more time and possibly implement those in an extension of the XMLA driver implementation. But it looks fine, more ambitious actually than what we currently do. We currently support updates of cells within the current cell set, but when the cell set boundaries change (like a new member appearing on an axis) we reexecute the query.

I want to get back to you next week when I am back from my current Hong-Kong business trip.



Daniel, ActivePivot is a commercial software and I cannot freely drop code like you mentionned. But we are very interested in working on practical projects/evaluations with academic partners. I am myself based in Paris and already have partnerships going on with Supélec and Télécom Paris french engineering schools. And QuartetFS also has offices in London, New-York and Singapour.

Please send me an email at my QuartetFS address if you want to talk about that ( ach@quartetfs.com )



By the way my name is Antoine ;) I don't remember why "Papa" was my registered nickname in blogger.