Sunday, April 05, 2009

A what-if scenario: musing about adding writeback capability to Mondrian

The Pentaho Partner Summit last week was a great chance to meet people who are using — and being successful — with Mondrian.

As always, people are thinking of using it in ways that I hadn't imagined. A couple of comments got me thinking about adding writeback support, something we'd long talked about, but seriously considered implementing.

Writeback allows the OLAP end-user to modify cell values and see the effects ripple through their spreadsheet. As you can imagine, it is useful for doing what-if analysis, especially budgeting.

If the cell is a sum of finer-grained cells, we need to modify those finer-grained cells also, and all of the totals of other dimensionalities created from those finer-grained cells, otherwise things just don't add up. This is hard to implement, because you sometimes need to modify a lot of cells, and even harder for ROLAP engines like Mondrian, because such engines don't store cells, they read directly from the unaggregated fact table.

First, I went looking for existing APIs for writeback.

Microsoft offers support for writeback via the UPDATE CUBE MDX statement. As always with Microsoft's MDX support, it's difficult to tell whether this is 'standard MDX', but the command seems to be well thought-out. The fact that it is an MDX command rather than an API call allows them to use an MDX expression as the rule by which to pro-rate changes to child cells.

I also looked at the JPalo Java API. (I've always wanted to work more closely with Palo. Although they're an OLAP engine, they have a different architecture (C and MOLAP) and core target audience (Excel users), and they're open source, so I see a lot of benefit to them and us if we pool resources. I invited them to join the olap4j process early on, but they preferred to define their own Palo-specific Java API. I'm still hopeful.)

I downloaded their most recent release from SourceForge and found that it was a release out of date (2.0 versus 2.5) and didn't contain the source code. There is a more up-to-date version in subversion. In DbConnection I found the setDataNumericSplashed method:

 /**
  * Sets the given double value at the specified cell.
  * The splashMode paramater is only important for consolidated cells and
  * determines how the value is scattered among the consolidated elements.
  * Please use the defined class constants for valid values. Although more
  * modes are currently defined only three are supported, namely:
  * SPLASH_MODE_DEFAULT, SPLASH_MODE_BASE_SET and SPLASH_MODE_BASE_ADD
  * @param cube {@link CubeInfo} representation
  * @param coordinates {@link ElementInfo} representations which specify the
  * coordinates
  * @param value the new value
  * @param splashMode the splash mode, use defined class constants
  */
 public void setDataNumericSplashed(CubeInfo cube, ElementInfo[] coordinate, double value, int splashMode);
I couldn't find any more documentation than that, but 'splash mode' seems to be equivalent to Microsoft's update strategies USE_EQUAL_ALLOCATION etc.

There are several remaining questions. What are the right changes to the olap4j API to support writeback? Support for the UPDATE CUBE statement is the leading contender. I'd love to hear what the olap4j community — especially the folks building the Pentaho Analysis Tool — think of this API, and how they would expose writeback in their UI.

I presume we'll need a scheme for transaction management. End-users will want to save their work, come back another day and continue where they left off. Several end-users might be using Mondrian at the same time, and want to see their numbers, not anyone else's. So, I think we'll need to introduce a concept I'd call a 'scenario', which is a property of a connection and can be persisted.

We'll need to figure out how to implement writeback within a Mondrian's ROLAP-with-caching architecture. Writing to the fact table is not tenable, because the modified cells can be of a multitude of dimensionalities. Neither is writing to an aggregate table, for the same reason. Ideal would be to write to disk a minimal description of the cells the user has modified — in XML, say — and do the other magic in the caching layer.

Lastly, I just need to find time to implement it.

5 comments:

Ethan Jewett said...

Julian,

This would be pretty cool. I'm trying to get time to try out Mondrian (coming from SAP BW world). In our planning projects we find it useful to be able to write back to the nodes themselves (I'd call this a postable-node hierarchy) rather than allocating out the full value to the constituent cells at write-time. This allows us to plan at a different level than actuals and also leaves room for a lot of flexibility around the allocation algorithm that we chose to use (or not use).

Unfortunately, I'm not optimistic that MDX supports the concept of postable-node hierarchies, so this behavior has to be modeled by creating a dummy-child for each node and writing back the value to that node.

It's definitely a complicated area, but the first step is a working implementation.

Ethan

Julian Hyde said...

Ethan,

I don't fully understand your postable-node hierarchy scenario, but I wonder whether it would be possible to write attributes of the members of the parent-child hierarchy (just as current salary is an attribute of an employee) and have calculated members derived from that property. That would be fairly easy to do, even without writeback support.

If the data values depend on a subset of the hierarchies -- for example, exchange rate depends on currency and time -- then use a cube, and join it with the rest of the data using a virtual cube.

As long as you are writing values that correspond to column values in a dimension or fact table, things are straightforward. You just need to flush the cache (or relevant part of the cache) to see the values.

Julian

Ethan Jewett said...

Perhaps I'm the one who's not following what you are proposing.

I had in mind a planning or budgeting scenario that involves writeback of data to a cube, and I was specifically commenting on the problem of writing back data to a cell that is made up of the aggregate of finer-grained values.

To take a simple example, say you have a two-level hierarchy in the product dimension that has individual products rolling up to brands (products are the children, brands are the parents). Revenue numbers will come in at the product level ($100 for Widget A, $200 for Widget B), but a company will often want to plan revenue at the brand level (plan $400 in yearly revenue for Widgets).

In the case of planning, it may be advantageous to hold off on allocating out the plan at the brand level down to the product level (base members), as we may want to use alternative allocations, or there may not be any advantage in allocation at all so we can just avoid a tough problem.

Now I'm not sure that this sort of scenario is what you were talking about at all :-)

It sounds like what you are talking about in your comment may be writing back to dimensions and changing attributes of dimension members for what-if scenarios (employee salary, for example). In this case I don't understand how the roll-up of finer-grained cells is a concern.

MS's docs make the distinction between cube and dimension writeback here: http://msdn.microsoft.com/en-us/library/ms144770.aspx

Martin Stangeland said...
This comment has been removed by the author.
Martin Stangeland said...

ulian,

I'm really glad I saw your post. Like Ethan I come from a SAP BW world where this functionality allows the planner to write transactional data back to the cube. I've been missing this functionality as part of our OSBI offering.

In the TDWI market report thay present an overview of BI functionality. Budgeting and planning is one of the main categories.
I really think this functionality would move OSBI even closer to proprietary BI offerings. This functionality would give us an even stronger argument for OSBI. OSBI could offer a "Closed-Loop" in Corporate Performance Planning/Monitoring just like the biggest BI vendors.

The planning process require a top-down approach as well as bottom-up approach.

Top-Down approach:Allow high-level planners to set target levels on low granularity. Publish planning numbers and have second-level planners to do the finer planning, increasing granularity. Alternatively implement distribution functions the automatically distribute values. As this is a planning functionality we do not need to distribute it to the lowest granularity we post data in production systems.
Bottom-up approach:Either an iterative part of the top-down approach or one could start the planning process by allowing planners to register their expected target values on their planning levels and present these to the higher level planners.

I will be following your writeback posts with great interest!

Martin