Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 10/01/2008

There is a natural fit between university students and researchers and the open source community. They are smart, educated, short of cash, and want to make the world a better place; and some of them, at least, have plenty of spare time.

More seriously, open source projects are a great platform for software research. By starting with a mature software platform, the researchers can spend less time recreating existing functionality, and get to the new, interesting stuff faster. The findings of such projects are more applicable to the real world because the new ideas have been tested in realistic architectures and on data sets of a reasonable size. In the area of spatial (GIS) applications alone, there are several projects, including the work of Joel da Silva and others at Pernambuco, Brazil, the GeWOlap project, and GeoMondrian.

Still, the majority of the mondrian and Pentaho communities are from industry. I would love to get more committers and active community members from academia.

Open Source Boot Camp (OSBOOTCAMP) is a mini-conference which is trying to put that right. I will be participating in a panel on open-source database development at Berkeley this Thursday 30th October, along with open-source advocates Bill Maimone (Ingres), Josh Berkus (PostgreSQL), Mark Atwood (MySQL) and John Sichi (LucidDB).

This week Pentaho released version 2.0 of its BI Suite, and it contains two major features that the mondrian community will love.

First, the Pentaho User Console, a web-based environment where end users can create, view, save, and share BI content. Content is arranged into folders, and includes operational reports created with Pentaho Reports, and dimensional analytics created with Pentaho Analysis (mondrian). Users can also create subscriptions to receive reports regularly by email. PUC is simple and elegant. I predict that it will quickly become the face of Pentaho for end users.

Second, the Pentaho Aggregate Designer, which automatically creates a set of aggregate tables to accelerate a mondrian schema. I believe that the Aggregate Designer takes mondrian to a new level of scalability. Let me explain why.

There are several architectures possible for multidimensional analysis engines, and they each have their strengths and weaknesses. Mondrian's architecture is best described as 'ROLAP with caching'. A ROLAP engine, short for 'relational online analytical processing', stores its data in a relational database (RDBMS) and accesses it via SQL. It follows that the RDBMS does most of the heavy-duty processing, such as JOIN and GROUP BY operations, while the ROLAP engine deals with presentation, caching, and calculations that can be expressed in a multidimensional model but cannot easily be converted into SQL.

The chief advantage of ROLAP is its simplicity. A ROLAP engine does not have its own storage engine: everything is in the RDBMS. In particular, the load process is simply a matter of loading the RDBMS, and if the contents of the RDBMS change you just need to flush mondrian's cache to see the up to date contents. Provided that the RDBMS scales, you can scale mondrian to greater numbers of concurrent users by having multiple instances of mondrian in a farm of web servers.

This great strength is also a great weakness. It means that mondrian is beholden to the RDBMS for performance. In particular, that first query of the day, the one that scans all 100 million rows in the fact table to generate a three segment pie chart on the CEO's dashboard:

SELECT
customer.region,
sum(fact.store_sales)
FROM sales AS fact
JOIN customer ON fact.cust_id = sales.cust_id
GROUP BY customer.region

Without aggregate tables, that query takes however long the RDBMS takes to scan 100 million rows -- perhaps 1 minute, or ten minutes -- but the CEO is not prepared to wait that long. Aggregate tables are the answer. They contain the pre-computed result of such queries, and are declared in mondrian's schema so that mondrian knows how to generate SQL to make use of them.

The problem is that aggregate tables are hard to use. Mondrian has supported aggregate tables for several releases, but very few people have made effective use of them. The steps are as follows.

First of all, choose an effective set of aggregate tables. The possibilities are literally exponential: in a schema with N attributes (hierarchy levels) there are 2^N possible aggregate tables. If you choose too many, you will use too much disk space and spend too long loading them every night. If you choose too few, many queries will fall through the net and end up using a full scan of the fact table. Many aggregates can be derived from other aggregates, so it is possible to economize, but there are pitfalls if you do it by hand. (I will write further about the algorithm the Aggregate Designer uses to choose a near-optimal set of aggregate tables in a future post.)

Next, create the aggregate tables in the RDBMS and add mapping elements such as to mondrian's schema. Last, write SQL statements to populate the aggregate tables as part of your ETL process.

These steps are possible by hand, but very difficult for mere mortals to get right. The Aggregate Designer automates all of these steps. Once you have chosen a mondrian schema, and a particular cube in that schema to optimize, the algorithm analyzes the data in the star schema underlying that cube, and generates a set of aggregate tables. If you have a particular set of aggregate tables in mind, you can create these before running the algorithm, and the algorithm will create additional aggregate tables, taking yours into account.

Each aggregate table is categorized according to its cost (a combination of the number of bytes on disk and the time it will take to populate) and its benefit (the effort that will be saved at run time, over a typical query load, by having the aggregate table). The Aggregate Designer displays the set of tables it has chosen as a graph: usually convex, reflecting the fact that the first tables suggested are the ones with the most favorable cost/benefit ratios.

When the algorithm has run, Aggregate Designer can add the definitions of the aggregate tables into the mondrian schema. You can either create and populate the tables immediately, or save a scripts of 'CREATE TABLE' and 'INSERT INTO {aggregate table} SELECT ...' statements. You can even generate Pentaho Data Integration (Kettle) steps to perform the ETL process.

Pentaho User Console and Pentaho Aggregate Designer are both available in the Pentaho open source BI suite version 2.0. Download the suite, or check out the live demo. They are compatible with mondrian 3.0.4.11371, which is available as part of Pentaho 2.0 or for separate download.

Julian Hyde on Streaming Data, Open Source OLAP. And stuff.

Monday, October 27, 2008

Academia meets Open Source

Wednesday, October 22, 2008

Pentaho 2.0 brings good things