Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: xmla4js

Thursday, January 14, 2010

xmla4js

Yesterday I attended Roland Bouman's webinar introducing xmla4js. Xmla4js is a library for connecting to OLAP servers in JavaScript. All you need is an OLAP server that speaks XMLA (and most of them do).

It's classic web 2.0 technology. It does virtually nothing, yet changes everything. There is very little code (about 20K), an old-school enterprise architect would regard it as a trivial piece of protocol glue, and yet it opens the door to all kinds of mashups. Those mashups will get powerful OLAP data into the hands, and onto the screens, of the business users who care about that data.

Roland is a practical guy and a great communicator, so the presentation (and the download from google code) includes several examples of those mashups. I urge you to take a look at the recorded webinar.

A few issues came up during the webinar that are worth mulling over.

1. Query model

One thing that is missing is a query model. A query model allows you to represent the state of the current query, apply a transformation (say sorting on column #3, or adding a hierarchy to an axis), then generate a new MDX statement to send to the OLAP server. The demo Roland showed had a rudimentary query model, but in order to do more complex analyses, that query model will run out of road very quickly.

It's a problem that xmla4js shares with olap4j (as the PAT developers know only too well). I'd like to find a way to pool resources.

We could create a query model that works both on olap4j (in Java) and on xmla4js (in JavaScript). There would be two implementations, but at least the transformations can be specified in a language-neutral way, and we could write a single test suite that could exercise both implementations.

2. Cube metadata

Roland bemoaned the fact that getting the metadata for a cube (including dimensions, hierarchies, levels, measures) takes several XMLA round-trips.

He has a good point. Those round-trips may make the page load several times slower. We could easily extend Mondrian's cube metadata request so that you can ask for those extra elements.

If it proves useful, other XMLA engines (such as PALO) could do the same, and heck, if the XMLA council is not asleep in their castle, they could add it to the next version of the XMLA spec. (Well, we can hope.)

3. Results in JSON

Roland pointed out that XMLA is a verbose and inconvenient data format for JavaScript to consume. The "industry standard" for that environment is JSON. It has a similar attributes/values/nested sets structure to XML but is easier to parse: because it is syntactically valid JavaScript you just execute the JSON code to get the value.

Mondrian's XMLA servlet is written to generate elements, attributes, and nested collections of elements, and precious little of the code directly generates XML. It wouldn't be too much work to generate JSON instead. The JSON would have the same structure as the XMLA, sans the irritating namespaces that are necessary in XML.

For example, the JSON response from MDSCHEMA_CUBES could look like this:

"DiscoverResponse": {
  "return": {
    "root": {
      "row": [
        {
          "CATALOG_NAME": "FoodMart",
          "SCHEMA_NAME": "FoodMart",
          "CUBE_NAME": "HR",
          "CUBE_TYPE": "CUBE",
          "IS_DRILLTHROUGH_ENABLED": true,
          "IS_WRITE_ENABLED": false,
          "IS_LINKABLE": false,
          "IS_SQL_ENABLED": false,
          "DESCRIPTION": "FoodMart Schema - HR Cube"
        },
        {
          "CATALOG_NAME": "FoodMart",
          "SCHEMA_NAME": "FoodMart",
          "CUBE_NAME": "Sales",
          "CUBE_TYPE": "CUBE",
          "IS_DRILLTHROUGH_ENABLED": true,
          "IS_WRITE_ENABLED": false,
          "IS_LINKABLE": false,
          "IS_SQL_ENABLED": false,
          "DESCRIPTION": "FoodMart Schema - Sales Cube"
        },
      ]
    }
  }

I'd like to hear Roland's (and the Mondrian, Pentaho, olap4j and PAT community's) take on these points. Thanks again Roland for an informative webinar and a great new addition to the open source BI technology stack.

13 comments:

rpbouman said...: Hi Julian!

thanks for the kind words - it feels really, really good to hear you like it!

And, you did a great job of summarizing the problems that remain.

Now, from all things you mention, it seems 2 and 3 are not a big issue to solve. I agree. I also like the idea of using the same metadata that you have for XML/A - I don't think it will add much value to re-invent that wheel again.

As for solving point 2 - It seems to me there are 2 obvious ways of doing it:
1) The schema rowsets remain distinct, and the response just returns multiple sets as siblings of each other.

2) The schema rowsets are nested, by letting the rows have a new property that contains the rowset that you would have obtained by applying the values from the "ancestor" rowsets as restrictions. So a record from the MDSCHEMA_CUBES rowset would have a "MDSCHEMA_DIMENSIONS" property, containing those records from the MDSCHEMA_DIMENSIONS rowset that would have been obtained by applying restrictions for the CATALOG_NAME, SCHEMA_NAME, and CUBE_NAME columns using values from the corresponding columns in the containing MDSCHEMA_CUBES record. And from there, you could go down in a similar way to the MDSCHEMA_HIERARCHIES rowset, MDSCHEMAS_LEVELS rowset, then MDSCHEMA_MEMBERS and finally MDSCHEMA_PROPERTIES.

Now, I have a slight preference for this last option, but that's only because I am thinking that that is a more natural fit for javascript (and probably all object oriented languages). But at the same time, I think that the actual response format is not as important as the fact that all information is passed in a single round trip. And I can imagine some disadvantages to the nested format too - it may be harder to generate on the server side, for example for those OLAP servers that use a relational backend for storing their cube metadata.

Perhaps the best option would be to have a metadata object on the client side that allows the application to access the data in both ways, which would have the responsibility of hiding the actual format of the response to the application.

So far, I don't see many problems. But the Query model is a different kettle of fish. I completely agree something needs to be done there, but so far my own attempts to that have been unsatisfactory.; 1/14/2010 1:49 PM
vmalic said...: What happens when you have dimension with 100-200k or more elements and transfer all of it in one single round trip? Do you have experience with that?; 1/15/2010 12:38 AM
rpbouman said...: @vmalic: sure, there will be cases where it's not a good idea to get all metadata up to the last bit.

But surely, we can agree that the purely structural elements: datasources, catalogs, schemas, cubes, dimensions (lets consider Measures as a special case of dimension), and levels should be obtainable with a single request. Right now, XML/A needs at least 2 requests (1 for the datasources, and one for the levels; the app could reconstruct the existence of schema, catalog, cube, dimension and hierarchy from that schema rowset). But to get user-friendly names for all these things, you need a separate request for each rowset. That just leads to a lot of overhead to code it, and a lot of waiting for the user.

Now, of course it should be possible to curtail to what depth the metadata should be obtained: in case you'd choose to not do the nested representation of the schema rowsets and just ship multitple rowsets as siblings in the response, a trivial extension to XML/A would be to allow the RequestType to be a list or array of schema rowset names (instead of requiring exactly one).

I agree that for the members we may need to be a bit more cautious.

Now that you mention these large dimensions: I think that the current mechanism of restrictions offered by XML/A is also not very effective for monster-dimensions. And I think that the higher level structural information I just mentioned will be just a fraction of the amount of data volume you have to deal with for a monster dimension. So this proposed extension does not actually introduce a problem that wans't there before.

Basically, the most effective filtering you can do today is by using restrictions based the level. But a -to me- obvious restriction, namely by parent member is not defined by the standard. Another obvious way of curtailing large datasets, through pagination, is also not provided by the current XML/A specification.

While I think it could be interesting to try and solve these problems too, my main focus is now on the purely structual metadata. But thanks for bringing this up because I do think you have a good point here.; 1/15/2010 2:43 AM
vmalic said...: Thanks for the comment! We(Palo) will keep an eye on xmla4js.; 1/15/2010 5:31 AM
rpbouman said...: @Vladislav: thanks! Feel free to participate. The ticket/issue system on the site may be used for feature requests too.; 1/15/2010 6:51 AM
Julian Hyde said...: I've implemented #3, results in JSON, in change 13323 on the mondrian-release/3.2 branch. (Requests still need to be in SOAP.)

You can get an idea of how the JSON responses look from the two test cases I added to XmlaBasicTest.ref.xml.

Julian; 1/16/2010 1:50 AM
rpbouman said...: Julian, you move *fast*! Great work! I can understand why the request still has to be SOAP for now, and in that light, i think the choice for an extra property is exactly the right decision.

I browsed the diff, but couldn't find the actual definition of the enum values. I'd like to propose the standard MimeTypes for that, so either application/xml or text/xml for a regular SOAP exchange, and application/json for JSON. To me this seems appropriate and in accordance with the proper Content-Type HTTP header for these responses. Besides, I think the mime-type notation might be a cue that could remove any doubt as to what is meant (but YMMV).

If I may raise one word of caution: XML/A already specifies "Language" in another context (i18n, see http://msdn.microsoft.com/en-us/library/ms187205(SQL.90).aspx) so perhaps we can choose a slightly less generic name for the property. The comment in the diff already provides some inspiration: "RPCLanguage". Some, IMO, sensible alternatives: if the enum values are mime types, then perhaps simply MimeType or ContentType or maybe HTTPContentType seem valid to me too.

Anyway, I'm excited to see this taking off already - thanks!; 1/16/2010 10:04 AM
Julian Hyde said...: Fixed in change 13329.

The servlet now looks at the Accept HTTP header. It will respond in JSON if 'Accept: application/json' is among the headers. Other values accepted are 'application/soap+xml', 'application/xml' and 'text/xml'; these are equivalent, but 'application/soap+xml' is preferred.

I renamed the Language property as you suggested. Now, if the XMLA request contains ResponseMimeType among the properties, it will override the Accept HTTP header.

The servlet also sets the ContentType HTTP header in the response to 'application/json' or 'text/xml' as appropriate.; 1/17/2010 12:36 PM
Julian Hyde said...: This comment has been removed by the author.; 1/18/2010 2:53 AM
Julian Hyde said...: I've implemented #2, deep metadata, in change 13331.

With the <Deep>true</Deep> property, the MDSCHEMA_CUBES request returns the dimensions, hierarchies, levels, measures, and named sets inside each cube. It works in JSON mode too.; 1/18/2010 3:01 AM
rpbouman said...: Julian - wow, just wow! :)

This is making me very happy.

I'll try and get it to run shortly. I I am very interested in benchmarking it when its inside a web page to see how much difference this makes in a typical page loading scenario.; 1/18/2010 5:53 AM
Unknown said...: Huh, thats coming along quite fast.

What i would be interested in is the support for those features in the olap4j xmla driver. and how we can implement that concerning that other xmla servers don't support those features. do we want the xmla driver to be able to use those features automatically, if they are available or do we need something like mondrian/sap bw/... xmla server specific meta data layer (like kettle has for the supported db types. like: truncate table works in mysql with "truncate table" but "delete from" in teradata)

i know that we will (hopefully soon) be able to use olap4j connections in the platform that use the mondrian olap4j driver, but i think it would be nice if the xmla driver would know about this stuff as well

-paul; 1/18/2010 8:18 AM
Julian Hyde said...: Paul,

I definitely think the XMLA driver should know about different servers, exploit their beneficial features and work around their deficiencies. Like Kettle, Mondrian has dialecting for the purpose of adapting to the DBMS it is talking to.

The olap4j-XMLA driver could definitely exploit these features. Luc does most of the work on the XMLA driver these days, so ultimately it's his call.

The good news is that the change will be transparent to applications written on top of olap4j (as for xmla4js). When the XMLA driver exploits the nested metadata call to get all of the metadata in the cube in one shot, the applications will just run faster.

Julian; 1/20/2010 7:31 PM