Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 06/01/2009

Tuesday, June 30, 2009

SQLstream powers Firefox 3.5 realtime downloads monitor

Mozilla launched Firefox 3.5 today, and with it, a neat applet, powered by SQLstream, to monitor downloads in real time.

You can see the results at Mozilla's download stats page.

A few weeks ago, Apple's Hyperwall was awe-inspiring as a piece of visual art, but it was less impressive as a piece of real-time data integration, because the data was delayed five minutes from the app store.

SQLstream gathers data from Mozilla's download centers around the world, assigns each record a latitude and longitude, and summarizes the information in a continuously executing SQL query. Data is read with sub-second latencies, and then aggregated (using SQLstream's streaming GROUP BY operator) into summary records each describing a second of activity.

A server-side Java program reads the data using JDBC, serializes it as JSON, and transmits it to all connected web clients. Clients render the charts using the Canvas tag, newly introduced in HTML 5. The results are very impressive visually, but to a back-end guy like myself, the plumbing is impressive too.

The amazing thing is that SQLstream makes this so easy. Our official company blurb talks about "shortening data integration projects from months to weeks", but this project took just a couple of days of work.

By the way, don't try to view the page in Microsoft's Internet Explorer. Ten years ago, Internet Explorer led the charge to enhance the capabilities of the web browser, introducing dynamic HTML (DHTML), XML handling in the browser, ActiveX controls and other capabilities, but those days are over. With HTML 5 there is a renaissance in web standards; Firefox is leading the pack, with other 'modern' browsers such as Safari, Opera and Chrome not far behind.

Sunday, June 28, 2009

Black screen, white pointer on Vista

Last night I had a problem where Vista gives me a black screen and white pointer. You can move the pointer around, but you can't do anything useful (except, as noted below, press the Shift key 5 times). I tried restarting in safe mode, and I got a black screen with 'Safe mode' in each corner of the screen, but otherwise the same experience.

I had a huge sinking feeling. I've had this problem twice before in the last twelve months. On the other two occasions, Dell technical support asked the usual questions for an hour or so, then told me to re-install Vista. An operating system that needs to be re-installed every 6 months is not a productive operating system, even if the operating system is great in between times. Which Vista isn't, anyway.

Luckily, this time I found this post at Microsoft TechNet. The problem was exactly as described in the post, and so was the cause (corrupted windows event log files) and the solution (rename the directory, or delete the event log files).

I'm pleased to say I discovered the same hack that they did: press the shift key five times, which gives you the 'Do you want to turn on Sticky Keys?' dialog. (Yes, this is literally the ONLY meaningful interaction you can have with what is obviously an instance of Vista which is functioning but just not listening.) Then you click on the 'Go to the Ease of Access Center do disable the keyboard shortcut', and because this brings up an explorer window, you can then type in the address bar to launch all kinds of other commands.

Thanks to towz and others who posted to that forum; proving that even for Microsoft products, the crowd can sometimes provide better technical support than the professionals.

Wednesday, June 17, 2009

Spatial OLAP using GeoMondrian

I received an email from Thierry Badard of Laval University, Québec:

After the release of a new version of its open source spatial ETL tool, GeoKettle yesterday (please see the announcement for more details), the GeoSOA research group at Laval University, Quebec, Canada is proud to announce the availibility as new open source projects of GeoMondrian, the first implementation of a Spatial OLAP (SOLAP) server and Spatialytics, a lightweight cartographic component which enables navigation in SOLAP data cubes.

GeoKettle, GeoMondrian and Spatialytics are components of the complete geospatial BI (Business Intelligence) software stack developed by the GeoSOA research group.

For some screenshots of the project, see Fabio D'Ovidio's blog.

This is the kind of news that makes me proud to have gotten involved in open source. I'm not an expert on spatial software, so I could never have written a spatial OLAP engine; I didn't even realize the need existed. I'm delighted that people who are experts in the field could build upon my efforts and all of the other people who have contributed to Mondrian and JPivot over the years.

I hear that they are also involved in a Google Summer of Code (GSoC) project to integrate with Pentaho Community Dashboard Framework (CDF). That makes a lot of sense.

And of course I would be happy to receive contributions back to Mondrian if it makes it easier for them to maintain the code base.

I wish the GeoMondrian, GeoKettle and Spatialytics projects every success, and look forward to them bringing BI to a new audience.

Thursday, June 11, 2009

Cell writeback in Mondrian

Writeback is a feature that allows you to modify OLAP cell values and see the effects ripple through the data set, automatically modifying child and parent cells, and also cells derived using calculations. This allows you to perform 'what if' analysis and applications such as budgeting.

I have added experimental support for writeback to Mondrian.

In Mondrian's case, the term 'writeback' is a bit misleading. In a ROLAP system such as Mondrian, writing back to the database would be difficult, since values are stored in a fact table but we allow cells of any granularity to be modified. One modified cell might contain thousands of fact table rows. So, we don't write cells back to the database, but just retain the modified cells in memory, and propagate the modifications to related cells.

Here's how to use the experimental writeback support. Some of the details may change later as we make the feature more usable.

First, enable writeback for your cube. Create a dimension called 'Scenario', and a measure called 'Atomic Cell Count':

<Cube name='Sales'>
    <Dimension name='Scenario' foreignKey='time_id'>
        <Hierarchy primaryKey='time_id' hasall='true'>
            <InlineTable alias='_dummy'>
                <ColumnDefs>
                    <ColumnDef name='foo' type='Numeric'/>
                </ColumnDefs>
                <Rows/>
            </InlineTable>
            <Level name='Scenario' column='foo'/>
        </Hierarchy>
    </Dimension>
    <!-- Other dimensions... -->
   <Measure name='Atomic' aggregator='count'/>
    <!-- Other measures... -->
</Cube>

(Yes, this is a lot of crud to add to your cube definition, and it's temporary. In future, we will let you flag a cube as 'writeback enabled', and a [Scenario] dimension and [Atomic Cell Count] measure will be created automatically. Also, we will make it easier for you to create dimensions that have only calculated members, without resorting to inline tables.)

Next, create a Scenario:

Connection connection;
Scenario scenario = connection.createScenario();
int scenarioId = scenario.getId();

(The Scenario API will soon move to olap4j: before mondrian-4.0, I hope. This includes the class mondrian.olap.Scenario, the method mondrian.olap.Cell.setValue(), and the method mondrian.olap.Connection.createScenario(). It will be optional for an olap4j driver to support writeback, but Mondrian's olap4j driver will, of course.)

Write a query that uses the scenario. Assuming that scenarioId above was 1, the query

SELECT [Measures].[Unit Sales] ON COLUMNS,    {[Product],
     [Product].Children,
     [Product].[Drink].Children} ON ROWS
FROM [Sales]
WHERE [Scenario].[1]

returns

[Product]                 [Unit Sales]
========================= ============
(All)                          266,773
 + Drink                        24,597
 +--+ Alcoholic Beverages        6,838
 +--+ Beverages                 13,573
 +--+ Dairy                      4,186
 + Food                        191,940
 + Non-Consumable               50,236

Choose one of the cells returned from the query and modify its value. For example, let's reduce the sales of Drink by 1,000 from 24,597 to 23,597:

Result result = connection.executeQuery(...);
Cell cell = result.getCell(new int[] {0, 1});
cell.setValue(23597, AllocationPolicy.EQUAL_ALLOCATION);

Execute the query again, and it returns

[Product]                 [Unit Sales]
========================= ============
(All)                          265,773
 + Drink                        23,597
 +--+ Alcoholic Beverages        6,563
 +--+ Beverages                 12,990
 +--+ Dairy                      4,043
 + Food                        191,940
 + Non-Consumable               50,236

The value for Drink is 23,597, as expected, and the values of its children have been correspondingly reduced.

How the value is allocated to the children (and in fact all descendants) is decided by the allocation policy. In this case, we specified EQUAL_ALLOCATION, which means that all atomic cells have the same value.

An atomic cell is the finest grained value that can be viewed multidimensionally; for this cube, it is an instance of a particular customer buying a particular product, on a particular promotion, on a particular day, in a particular store. That makes for an awful lot of of atomic cells, but there may be fewer atomic cells than fact table rows. If the fact table does not have a primary key on (customer, product, time, promotion, store) some cells may have more than one fact table row.

If instead we had written

cell.setValue(23597, AllocationPolicy.EQUAL_INCREMENT);

the query would have returned

[Product]                 [Unit Sales]
========================= ============
(All)                          265,773
 + Drink                        23,597
 +--+ Alcoholic Beverages        6,560
 +--+ Beverages                 13,022
 +--+ Dairy                      4,015
 + Food                        191,940
 + Non-Consumable               50,236

We notice that Beverages has not been reduced as much under EQUAL_INCREMENT policy than EQUAL_ALLOCATION policy; the average value for atomic cells of Beverages must be greater than for Drink as a whole.

Allocation policies are defined consistent with Analysis Services' UPDATE CUBE statement. Mondrian does not currently implement WEIGHTED_ALLOCATION or WEIGHTED_INCREMENT policies.

Treating scenarios as a dimension is an elegant and powerful idea. Using the Scenario dimension, you can easily switch from one scenario to another, or you can compare scenarios side-by-side.

Note that you can also set a connection's current scenario. This effectively becomes the default value for the Scenario dimension in that connection, so you do not need to specify Scenario in the slicer. However, there still needs to be an explicit scenario in the context when you call Cell.setValue(). I'm not sure whether the benefit of having a scenario for a connection outweighs the benefit/confusion, and we may discontinue this feature.

Remember, this is still an experimental feature. There is some cleanup to be done, some performance tuning, and the API needs to be moved into olap4j. But most importantly, it's not useful until a user interface, such as PAT or JPivot, supports scenarios and modifying cell values.