Wednesday, October 22, 2008

Pentaho 2.0 brings good things

This week Pentaho released version 2.0 of its BI Suite, and it contains two major features that the mondrian community will love.

First, the Pentaho User Console, a web-based environment where end users can create, view, save, and share BI content. Content is arranged into folders, and includes operational reports created with Pentaho Reports, and dimensional analytics created with Pentaho Analysis (mondrian). Users can also create subscriptions to receive reports regularly by email. PUC is simple and elegant. I predict that it will quickly become the face of Pentaho for end users.


Second, the Pentaho Aggregate Designer, which automatically creates a set of aggregate tables to accelerate a mondrian schema. I believe that the Aggregate Designer takes mondrian to a new level of scalability. Let me explain why.

There are several architectures possible for multidimensional analysis engines, and they each have their strengths and weaknesses. Mondrian's architecture is best described as 'ROLAP with caching'. A ROLAP engine, short for 'relational online analytical processing', stores its data in a relational database (RDBMS) and accesses it via SQL. It follows that the RDBMS does most of the heavy-duty processing, such as JOIN and GROUP BY operations, while the ROLAP engine deals with presentation, caching, and calculations that can be expressed in a multidimensional model but cannot easily be converted into SQL.

The chief advantage of ROLAP is its simplicity. A ROLAP engine does not have its own storage engine: everything is in the RDBMS. In particular, the load process is simply a matter of loading the RDBMS, and if the contents of the RDBMS change you just need to flush mondrian's cache to see the up to date contents. Provided that the RDBMS scales, you can scale mondrian to greater numbers of concurrent users by having multiple instances of mondrian in a farm of web servers.

This great strength is also a great weakness. It means that mondrian is beholden to the RDBMS for performance. In particular, that first query of the day, the one that scans all 100 million rows in the fact table to generate a three segment pie chart on the CEO's dashboard:
SELECT
customer.region,
sum(fact.store_sales)
FROM sales AS fact
JOIN customer ON fact.cust_id = sales.cust_id
GROUP BY customer.region
Without aggregate tables, that query takes however long the RDBMS takes to scan 100 million rows -- perhaps 1 minute, or ten minutes -- but the CEO is not prepared to wait that long. Aggregate tables are the answer. They contain the pre-computed result of such queries, and are declared in mondrian's schema so that mondrian knows how to generate SQL to make use of them.

The problem is that aggregate tables are hard to use. Mondrian has supported aggregate tables for several releases, but very few people have made effective use of them. The steps are as follows.

First of all, choose an effective set of aggregate tables. The possibilities are literally exponential: in a schema with N attributes (hierarchy levels) there are 2N possible aggregate tables. If you choose too many, you will use too much disk space and spend too long loading them every night. If you choose too few, many queries will fall through the net and end up using a full scan of the fact table. Many aggregates can be derived from other aggregates, so it is possible to economize, but there are pitfalls if you do it by hand. (I will write further about the algorithm the Aggregate Designer uses to choose a near-optimal set of aggregate tables in a future post.)

Next, create the aggregate tables in the RDBMS and add mapping elements such as to mondrian's schema. Last, write SQL statements to populate the aggregate tables as part of your ETL process.

These steps are possible by hand, but very difficult for mere mortals to get right. The Aggregate Designer automates all of these steps. Once you have chosen a mondrian schema, and a particular cube in that schema to optimize, the algorithm analyzes the data in the star schema underlying that cube, and generates a set of aggregate tables. If you have a particular set of aggregate tables in mind, you can create these before running the algorithm, and the algorithm will create additional aggregate tables, taking yours into account.


Each aggregate table is categorized according to its cost (a combination of the number of bytes on disk and the time it will take to populate) and its benefit (the effort that will be saved at run time, over a typical query load, by having the aggregate table). The Aggregate Designer displays the set of tables it has chosen as a graph: usually convex, reflecting the fact that the first tables suggested are the ones with the most favorable cost/benefit ratios.

When the algorithm has run, Aggregate Designer can add the definitions of the aggregate tables into the mondrian schema. You can either create and populate the tables immediately, or save a scripts of 'CREATE TABLE' and 'INSERT INTO {aggregate table} SELECT ...' statements. You can even generate Pentaho Data Integration (Kettle) steps to perform the ETL process.

Pentaho User Console and Pentaho Aggregate Designer are both available in the Pentaho open source BI suite version 2.0. Download the suite, or check out the live demo. They are compatible with mondrian 3.0.4.11371, which is available as part of Pentaho 2.0 or for separate download.

4 comments:

Anonymous said...

What is the download link for Mondrian 3.0.5? It's not on Sourceforge..

Thanks!

Julian Hyde said...

Oops, I meant to say mondrian 3.0.4.11371, not 3.0.5. I have corrected my post. It is available for download here.

Peter Tran said...

Hi Julian,

I wonder if it would be more efficient to make use of database specific syntax to create aggregate tables. For example, in Oracle we use Create Table As Select construct to recreate the aggregate table. It's much more efficient than pre-creating the table and using insert/select construct.

-Peter

Julian Hyde said...

Peter,

We welcome contributions. The Aggregate Designer has a very modular/pluggable design so it should be easy to drop in other dialects and styles of implementing aggregate tables. For example, some Oracle users may wish to generate CREATE MATERIALIZED VIEW, so that the aggregate tables populate themselves automatically.

Julian