Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 04/01/2009

Thursday, April 30, 2009

How Mondrian evaluates expressions

When it comes to expression evaluation, Mondrian keeps things simple. It doesn't tend to cache the results of expressions, but calculates them each time they are evaluated. Eventually the calculation tunnels through all calculated members and ends up at an atomic cell. Atomic cells are retrieved from the database, and stored in the cell value cache, so they are only calculated once.

By the way, an atomic cell is not necessarily at the lowest level of the hierarchy; Mondrian would prefer to load cells at a coarse granularity, and leave the hard work of aggregating values to the database, or even better, an aggregate table. And Mondrian does its best to retrieve atomic cells in batches. It gathers together requests for lots of cells of the same granularity and generates a single SQL statement to retrieve them all at once.

Mondrian's 'keep it simple' scheme comes unstuck when a particular calculation is repeated many times over. Nick Goodman came up with a classic example of this in bug MONDRIAN-552. The query is as follows:

with member [Measures].[Profit Change] as
   ([Measures].[Profit], [Time].CurrentMember)
   - ([Measures].[Profit], [Time].PrevMember)
member [Measures].[Running Total] as
   ([Measures].[Profit], [Time].CurrentMember)
   + ([Measures].[Running Total], [Time].PrevMember)
member [Measures].[Average Daily Running Total] as
   Avg(
       Descendants(
           [Time].CurrentMember, [Time.Weekly].[Day])
      [Measures].[Running Total])
select
   {[Measures].[Profit Change],
     [Measures].[Running Total],
     [Measures].[Average Daily Running Total]} ON COLUMNS,
   {[Time.Weekly].[Week].Members} ON ROWS
from [Sales]

Note how [Measures].[Running Total] is recursive. The running total for week 3 is defined as the running total for week 2 plus the profit for week 3. To calculate the average running total for week 99, Mondrian computes profit for the first 99 weeks and to calculate the average running total for week 100, Mondrian computes profit 100 for the first 100 weeks. There's lots of wasted effort: Mondrian has computed profit 50,000 times when it could have done it just 100 times and cached the results.

The solution is simple: wrap the calculation for [Measures].[Running Total] in the Cache() function, and Mondrian will compute the value only once.

You will see that in the bug I come up with a couple of proposals for making Mondrian better. I don't think Mondrian should automatically cache every expression, because caching costs time and memory, and most expressions are only evaluated once or twice. And by the way, you should use the Cache function sparingly, for the same reason.

But it would be nice if Mondrian could automatically detect some cases where expression caching is desirable. The proposed 'cache' property of a calculated member would have three values: 0 (never cache), 1 (always cache) and null (Mondrian should use its best judgment). Most calculated members would leave the caching up to Mondrian, so we would need to come up with a simple, effective rule that governs caching before we implemented this feature. What do you think the rule should be?

Monday, April 27, 2009

PAT 0.2

PAT (Pentaho Analysis Tool) project renews my faith in open source. A team of folks from Pentaho's community have got together and are cooking up a new UI for Mondrian. Due to the magic of olap4j, it will work against other OLAP engines too.

They just released version 0.2, and Tom's release notes are an amusing and informative history of the project. Download from the project home page.

Friday, April 10, 2009

Formatting MDX as plain text

When Mondrian tools output MDX results as text, such as in the cmdRunner utility, we've been using the same old crappy format for years. For example, the query

select
  crossjoin(
    {[Time].[1997].[Q1],  [Time].[1997].[Q2].[4]},
    {[Measures].[Unit Sales], [Measures].[Store  Sales]}) on 0,
  {[USA].[CA].[Los Angeles],
   [USA].[WA].[Seattle],
   [USA].[CA].[San Francisco]} on 1
FROM [Sales]

is formatted as

Axis #0:
{}
Axis #1:
{[Time].[1997].[Q1], [Measures].[Unit  Sales]}
{[Time].[1997].[Q1], [Measures].[Store  Sales]}
{[Time].[1997].[Q2].[4], [Measures].[Unit  Sales]}
{[Time].[1997].[Q2].[4], [Measures].[Store Sales]}
Axis #2:
{[Store].[All Stores].[USA].[CA].[Los Angeles]}
{[Store].[All  Stores].[USA].[WA].[Seattle]}
{[Store].[All Stores].[USA].[CA].[San  Francisco]}
Row #0: 6,373
Row #0: 13,736.97
Row #0: 1,865
Row #0: 3,917.49
Row #1: 6,098
Row #1: 12,760.64
Row #1: 2,121
Row #1: 4,444.06
Row #2: 439
Row #2: 936.51
Row #2: 149
Row #2: 327.33

I've just checked in an alternative formatter that makes the result look more like a pivot table. The same query would come out like this:

                     1997       1997        1997        1997
                     Q1         Q1          Q2          Q2
                                            4           4
                     Unit Sales Store Sales Unit Sales Store Sales
=== == ============= ========== =========== ========== ===========
USA CA Los Angeles   6,373      13,736.97   1,865      3,917.49
USA WA Seattle       6,098      12,760.64   2,121      4,444.06
USA CA San Francisco 439        936.51      149        327.33

Two questions:

1. Should we move this code into the olap4j code base? (It would seem to make sense because it doesn't require any mondrian internals to do the job, and the processing requires a 'grid model' similar to query models already part of olap4j. But I don't want to 'dump' code that is not generally useful.)

2. What do people feel is the ideal format for formatting MDX results as text? As a starting point, another couple of possible formats are below.

"Oracle" format

                     1997
                     Q1                      Q2
                                            4
                     Unit Sales Store Sales Unit Sales Store Sales
=== == ============= ========== =========== ========== ===========
USA CA Los Angeles        6,373   13,736.97      1,865    3,917.49
    WA Seattle            6,098   12,760.64      2,121    4,444.06
    CA San Francisco        439      936.51        149      327.33

"MySQL" format

|                          | 1997                                                |
|                          | Q1                       |  Q2                      |
|                          |                          | 4                        |
|                          | Unit Sales | Store Sales | Unit Sales | Store Sales |
+-----+----+---------------+------------+-------------+------------+-------------+
| USA | CA | Los Angeles   |      6,373 |   13,736.97 |      1,865 |   3,917.49  |
|     | WA | Seattle       |      6,098 |   12,760.64 |     2,121  |    4,444.06 |
|     | CA | San Francisco |        439 |     936.51  |        149 |      327.33 |

Tuesday, April 07, 2009

The last MDX holdout folds, but true OLAP interop is still a long way off

Oracle, the last major OLAP vendor to embrace MDX, has finally added MDX support to its server. The MDX Provider for Oracle OLAP, developed in partnership with Simba, implements the OLE DB for OLAP API and the MDX query language, and went beta this week.

The most obvious application of this technology, and I'm sure the initial revenue driver, will be to allow end-users to use Excel 2007 as their client for slicing and dicing.

Simba's architecture diagram shows the MDX provider loaded onto the same machine as the Excel client. It wouldn't seem technically difficult to run the MDX provider as a server, and have multiple clients connect via OLE DB for OLAP or via XML for Analysis. (Licensing may be a different matter.)

This announcement means that now it is possible to talk MDX to every major OLAP server. (Are there any OLAP servers that do not speak MDX? I can't think of any.) The OLAP market has moved very slowly towards standardization, but this is a significant moment, even a tipping point. In a conversation five years ago, Oracle executives agreed that MDX was a fine language, but said they would not support it, because that would be to acknowledge that Microsoft was the thought-leader in the OLAP marketplace. It's that old PR strategy: deny in public, agree in private. And in a sense their strategy worked, because without a standard language, the OLAP market could not begin to commoditize.

There is still a long way to go towards OLAP interoperability. Servers differ widely in their support of MDX. Unlike SQL, the MDX language is not in the hands of an independent standards organization; even the originators of the de facto standard, Microsoft, have not released a specification for MDX or XMLA for several years.

A query language is no good without an API to issue queries, and APIs only exist in Microsoft's own technologies: COM (OLE DB for OLAP), .NET (adomd.net) and web services (XMLA).

I have been advocating olap4j as the standard API for Java-based OLAP, but it has yet to receive public backing from vendors outside the open source community. And there are no OLAP APIs for languages such as python, perl, and php.

The final point of concern is the emergence of Simba as virtually the sole supplier of MDX, OLE DB for OLAP and XMLA technology. Simba is an excellent company, who understand MDX very well, and have invested in building a technology stack. But they also benefit from a close relationship with Microsoft. (Remember those specifications for MDX and XMLA I referred to earlier? Though they have not seen public updates for several years, I'm sure those specifications still exist behind the walls of Castle Redmond, and are available to Microsoft's partners.)

As far as I am aware, Simba have been responsible for all of the projects in the last few years to bolt MDX support on to existing servers and applications. (With a sole exception: I was never able to find out where JasperSoft sourced the technology for its ODBO Connect product.)

To summarize, this is a milestone moment in the development of OLAP technology, but there is still cause for concern. OLAP APIs exist only for a small number of languages, vendors show little inclination to provide true interoperability, and the key technology is provided by a small number of players.

You can help. If you are a user of OLAP technology it is in your interests to see the emergence of standards in the OLAP marketplace. So, please ask your vendor what they are doing about interoperability. Ask them whether there are OLAP clients, other than their own, that run on their server. And ask them for APIs to connect to their server from all of the languages you use in your organization. Then, we may move a little closer to the goal of OLAP for all.

Sunday, April 05, 2009

A what-if scenario: musing about adding writeback capability to Mondrian

The Pentaho Partner Summit last week was a great chance to meet people who are using — and being successful — with Mondrian.

As always, people are thinking of using it in ways that I hadn't imagined. A couple of comments got me thinking about adding writeback support, something we'd long talked about, but seriously considered implementing.

Writeback allows the OLAP end-user to modify cell values and see the effects ripple through their spreadsheet. As you can imagine, it is useful for doing what-if analysis, especially budgeting.

If the cell is a sum of finer-grained cells, we need to modify those finer-grained cells also, and all of the totals of other dimensionalities created from those finer-grained cells, otherwise things just don't add up. This is hard to implement, because you sometimes need to modify a lot of cells, and even harder for ROLAP engines like Mondrian, because such engines don't store cells, they read directly from the unaggregated fact table.

First, I went looking for existing APIs for writeback.

Microsoft offers support for writeback via the UPDATE CUBE MDX statement. As always with Microsoft's MDX support, it's difficult to tell whether this is 'standard MDX', but the command seems to be well thought-out. The fact that it is an MDX command rather than an API call allows them to use an MDX expression as the rule by which to pro-rate changes to child cells.

I also looked at the JPalo Java API. (I've always wanted to work more closely with Palo. Although they're an OLAP engine, they have a different architecture (C and MOLAP) and core target audience (Excel users), and they're open source, so I see a lot of benefit to them and us if we pool resources. I invited them to join the olap4j process early on, but they preferred to define their own Palo-specific Java API. I'm still hopeful.)

I downloaded their most recent release from SourceForge and found that it was a release out of date (2.0 versus 2.5) and didn't contain the source code. There is a more up-to-date version in subversion. In DbConnection I found the setDataNumericSplashed method:

/**
  * Sets the given double value at the specified cell.
  * The splashMode paramater is only important for consolidated cells and
  * determines how the value is scattered among the consolidated elements.
  * Please use the defined class constants for valid values. Although more
  * modes are currently defined only three are supported, namely:
  * SPLASH_MODE_DEFAULT, SPLASH_MODE_BASE_SET and SPLASH_MODE_BASE_ADD
  * @param cube {@link CubeInfo} representation
  * @param coordinates {@link ElementInfo} representations which specify the
  * coordinates
  * @param value the new value
  * @param splashMode the splash mode, use defined class constants
  */
public void setDataNumericSplashed(CubeInfo cube, ElementInfo[] coordinate, double value, int splashMode);

I couldn't find any more documentation than that, but 'splash mode' seems to be equivalent to Microsoft's update strategies USE_EQUAL_ALLOCATION etc.

There are several remaining questions. What are the right changes to the olap4j API to support writeback? Support for the UPDATE CUBE statement is the leading contender. I'd love to hear what the olap4j community — especially the folks building the Pentaho Analysis Tool — think of this API, and how they would expose writeback in their UI.

I presume we'll need a scheme for transaction management. End-users will want to save their work, come back another day and continue where they left off. Several end-users might be using Mondrian at the same time, and want to see their numbers, not anyone else's. So, I think we'll need to introduce a concept I'd call a 'scenario', which is a property of a connection and can be persisted.

We'll need to figure out how to implement writeback within a Mondrian's ROLAP-with-caching architecture. Writing to the fact table is not tenable, because the modified cells can be of a multitude of dimensionalities. Neither is writing to an aggregate table, for the same reason. Ideal would be to write to disk a minimal description of the cells the user has modified — in XML, say — and do the other magic in the caching layer.

Lastly, I just need to find time to implement it.