Saturday, January 14, 2012

Changes to Mondrian's caching architecture


I checked in some architectural changes to Mondrian's cache this week.

First the executive summary:

1. Mondrian should do the same thing as it did before, but scale up better to more concurrent queries and more cores.

2. Since this is a fairly significant change in the architecture, I'd appreciate if you kicked the tires, to make sure I didn't break anything.

Now the longer version.

Since we introduced external caches in Mondrian 3.3, we were aware that we were putting a strain on the caching architecture. The caching architecture has needed modernization for a while, but external caches made it worse. First, a call to an external cache can take a significant amount of time: depending on the cache, it might do a network I/O, and so take several orders of magnitude longer than a memory access. Second, we introduced external caching and introduced in-cache rollup, and for both of these we had to beef up the in-memory indexes needed to organize the cache segments.

Previously we'd used a critical section approach: any thread that wanted to access an object in the cache locked out the entire cache. As the cache data structures became more complex, those operations were taking longer. To improve scalability, we adopted a radically different architectural pattern, called the Actor Model. Basically, one thread, called the Cache Manager is dedicated to looking after the cache index. Any query thread that wants to find a segment in the cache, or to add a segment to the cache, or create a segment by rolling up existing segments, or flush the cache sends a message to the Cache Manager.

Ironically, the cache manager does not get segments from external caches. As I said earlier, external cache accesses can take a while, and the cache manager is super-busy. The cache manager tells the client the segment key to ask the external cache for, and the client does the asking. When a client gets a segment, it stores it in its private storage (good for the duration of a query) so it doesn't need to ask the cache manager again. Since a segment can contain thousands of cells, even large queries typically only make a few requests to the cache manager.

The external cache isn't just slow; it is also porous. It can have a segment one minute, and forget it the next. The Mondrian query thread that gets the cache miss will tell the cache manager to remove the segment from its index (so Mondrian doesn't ask for it again), and formulate an alternative strategy to find it. Maybe the required cell exists in another cached segment; maybe it can be obtained by rolling up other segments in cache (but they, too, could have gone missing without notice). If all else fails, we can generate SQL to populate the required segment from the database (a fact table, or if possible, an aggregate table). 

Since the cache manager is too busy to talk to the external cache, it is certainly too busy to execute SQL statements. From the cache manager's perspective, SQL queries take an eternity (several million CPU cycles each), so it farms out SQL queries to a pool of worker threads. The cache manager marks that segment as 'loading'. If another query thread asks the cache manager for a cell that would be in that segment, it receives a Future<SegmentBody> that will be populated as soon as the segment arrives. When that segment returns, the query thread pushes the segment into the cache, and tells the cache manager to update the state of that segment from 'loading' to 'ready'.

The Actor Model is a radically different architecture. First, let's look at the benefits. Since one thread is managing an entire subsystem, you can just remove all locking. This is liberating. Within the subsystem, you can code things very simply, rather than perverting your data structures for thread-safety. You don't even need to use concurrency-safe data structures like CopyOnWriteArrayList, you can just use the fastest data structure that does the job. Once you remove concurrency controls such as 'synchronized' blocks, and access from only one thread, the data structure becomes miraculously faster. How can that be? The data structure now resides in the thread's cache, and when you removed the concurrency controls, you were also removing memory barriers that forced changes to be written through L1 and L2 cache to RAM, which is up to 200 times slower.

Migrating to the Actor Model wasn't without its challenges. First of all, you need to decide which data structures and actions should be owned by the actor. I believe we got that one right. I found that most of the same things needed to be done, but by different threads than previously; so the task we mainly about moving code around. We needed to refine the data structures that were passed between "query", "cache manager" and "worker" threads, to make sure that they were immutable. If, for instance, you want the query thread to find other useful work to do while it is waiting for a segment, it shouldn't be modifying a data structure that it put into the cache manager's request queue. In a future blog post, I'll describe in more detail the challenges & benefits of migrating one component of a complex software system to the Actor Model.

Not all caches are equal. Some, like JBoss Infinispan, are able to share cache items (in our case, segments containing cell values) between nodes in a cluster, and to use redundancy to ensure that cache items are never lost. Infinispan calls itself a "data grid", which first I dismissed as mere marketing, but I became convinced that it is genuinely a different kind of beast than a regular cache. To support data grids, we added hooks so that a cache can tell Mondrian about segments that have been added to other nodes in a cluster. This way, Mondrian becomes a genuine cluster. If I execute query X on node 1, it will put segments into the data grid that will make the query you are about to submit, query Y on node 2, execute faster.

As you can tell by the enthusiastic length of this post, I am very excited about this change to Mondrian's architecture. Outwardly, Mondrian executes the same MDX queries the same as it ever did. But the internal engine can scale better when running on a modern CPU with many cores; due to the external caches, the cache behave much more predictably; and you can create clusters of Mondrian nodes that share their work and memory.

The changes will be released soon as Mondrian version 3.3.1 3.4, but you can help by downloading from the main line (or from CI), kicking the tires, and letting us know if you find any problems.

[Edited 2011/1/16, to fix version number.]

4 comments:

rossjudson said...

Years ago I adopted a similar approach to query execution, in a sort-of OLAP execution engine. Each query would be recursively broken down into smaller components and added to a "request" queue.

There was a multi-part loop for optimization -- performing both breakdown of query components into smaller components, an also sometimes stitching multiple components back together into larger ones so that a single query to the database could be used. Every query component had "coordinates" that described what was needed, so a large data-flow graph could share as required, across items.

Once small enough to be actionable, query components would be diverted off into queues appropriate to executing that kind of component. Database-sourced information, in particular, would end up on separate threads. This allowed a lot of computation to take place while the database stuff was running.

The whole thing is referred to as an eddy-query optimizer, I think.

On the whole it worked pretty well, and certainly executed most of the query model as quickly as possible.

I like the twist you've got in there -- using the data grid as a caching tier, and sharing that across many instances of Mondrian.

As I recall the trickiest part was figuring out how to incorporate real-time data into efficient updates to the cache, and doing that incrementally off of real-time feeds.

Julian Hyde said...

The version number in my original post was wrong. These changes will be released as Mondrian version 3.4. I have corrected the blog post.

Vladimir Rodionov said...

Why would not you delegate cache concurrency control to a cache provider itself? Many things can be done more efficiently. You just single-threaded performance.

Julian Hyde said...

Vladimir,

The cache provider doesn't know much about what it is caching. It doesn't know how to create (by querying a database or rolling up other cache segments) a segment that Mondrian wants but is not currently in the cache.

So, Mondrian needs other subsystems to do that.

The data structures that maintain the correspondence among the subsystems (pending SQL queries, an index of segments so that we can find cells or create segments by rolling up finer-granularity segments) need to be synchronized and they are outside the cache.

I don't think caching providers provide a way of managing concurrency between themselves and other subsystems.

Julian