Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 09/01/2007

Sunday, September 30, 2007

SDForum BI SIG

I shall be giving a talk entitled "Building Scalable OLAP applications with Mondrian and Pentaho" to the Business Intelligence SIG of SDForum. The talk will be at 7pm on October 16th 2007 in the Cubberley Community Center in Palo Alto, CA. All are welcome; details at the SDForum site.

Thursday, September 27, 2007

OpenMRS

OpenMRS stands for 'Open Medical Records System', and is a major open-source project aiming to build medical information systems for developing countries. Their general goal is to provide a low-cost (free) solution for tracking patient conditions. Their specific goal is to fight AIDS in Africa.

Their website is http://openmrs.org/wiki/OpenMRS.

The project is gaining a lot of momentum. They have strong links with organizations such as the World Health Organization (WHO), corporate sponsorship from companies such as Google, and are already deployed in 12 countries.

The OpenMRS architects are are in the process of adding analytics using mondrian. I have committed to helping them out, and Matt Casters (architect of Pentaho Data Integration aka Kettle) is also involved. They are looking for committers do build the mondrian schema and analytics, and that's why I'm reaching out to the mondrian community.

Maybe you have used mondrian in a commercial project, and are looking to use those skills in a project which makes the world a better place. Maybe you have some experience with databases and data modeling, but want to hone those skills on a real project and improve your CV. If you are student or have limited income, they have some stipends available.

I urge you to get involved with this project. Go to the website and sign up! If you know anyone who would be interested in this project, please forward this to them.

New access-control features in mondrian

Data warehouse administrators often want to give different users access to different subsets of the data. The OLAP model is a powerful paradigm for exploring data, and this paradigm also allows complex access-control rules to be expressed easily.

Mondrian already implements a rich set of access-control primitives (see mondrian schema guide), but we are considering some features to make mondrian more expressive in future releases.

Introduction: Access control in mondrian-2.4

In the current version of mondrian, roles are defined by elements in the mondrian schema file. A role can have access granted, or denied, for cubes, hierarchies, and members of hierarchies.

The type of access granted to an object is inherited by its children. For example, if you deny access to a schema, then access to cubes in that schema will be denied by default, but you can override by granting access to specific cubes.

Most objects have just 'all' or 'none' access, but there is finer-grained access for hierarchies. If a role has 'custom' access to a hierarchy, it can spell out specific the members it has access to.

Mondrian does not perform authentication, or have a concept of a user. It is the responsibility of the container (usually a webserver or application server) to check that the user is allowed to access the mondrian instance, and to provide mondrian with a role under which the user should execute queries.

Deficiencies in access-control

There are several deficiencies in mondrian's current access control model. Here are some of the major ones:

Partial cells. Need a policy where if a role can only see some of the children of a member, the total for that member includes only accessible children.
Hidden cells. Need a policy where if a role can only see some of the children of a member, the total for that member is not accessible.
Union roles. If a user belongs to more than one role, there is no easy way to run queries in the combined privileges of the roles.
Intrinsic roles. Need a way to define which members a role has access to using a rule rather than an explicit list of members in the schema file.
External roles. Need a way to define roles outside of the schema file, say in a database table.

Rollup policy

This proposed feature covers 1 'Partial cells' and 2 'Hidden cells' above.

Suppose that Fred belongs to a role which can see [USA].[CA] and [USA].[OR] but not [USA].[WA]. Fred runs the query

SELECT {[Measures].[Unit Sales]} ON COLUMNS,
  {[[Store].[USA], Store].[USA].Children} ON ROWS
FROM [Sales]

Under the current access-control policy, called 'full', the query returns

           | Unit Sales
-----------+-----------
[USA]      |    266,773
[USA].[CA] |     74,748
[USA].[OR] |     67,659

Note that [USA].[WA] is not returned, per the access-control policy, but the total includes the total from Washington (124,366) that Fred cannot see. For some applications, this is not appropriate. In particular, if the dimension has a small number of members, the end-user may be able to deduce the values of the members which they do not have access to.

To remedy this, a role can apply a 'rollup policy' to a hierarchy. The policy describes how a total is calculated for a particular member if the current role can only see some of that member's children:

Full. The total for that member includes all children. This is the only policy today, and will remain the default policy.
Partial. The total for that member includes only accessible children.
Hidden. If any of the children are inaccessible, the total is hidden.

Results under 'partial' policy:

           | Unit Sales
-----------+-----------
[USA]      |    142,407
[USA].[CA] |     74,748
[USA].[OR] |     67,659

Results under 'hidden' policy:

           | Unit Sales
-----------+-----------
[USA]      |          -
[USA].[CA] |     74,748
[USA].[OR] |     67,659

The policy is specified per role and hierarchy. In the following example, the role sees partial totals for the [Store] hierarchy but full totals for [Product].

<Role name="South Pacific manager">
<SchemaGrant access="none">
    <CubeGrant cube="Sales" access="all">
      <HierarchyGrant hierarchy="[Store]" access="custom" rollupPolicy="partial"
topLevel="[Store].[Store Country]">
        <MemberGrant member="[Store].[USA].[CA]" access="all"/>
        <MemberGrant member="[Store].[USA].[CA].[Los Angeles]" access="none"/>
      </HierarchyGrant>
      <HierarchyGrant hierarchy="[Customers]" access="custom" rollupPolicy="full"
topLevel="[Customers].[State Province]" bottomLevel="[Customers].[City]">
        <MemberGrant member="[Customers].[USA].[CA]" access="all"/>
        <MemberGrant member="[Customers].[USA].[CA].[Los Angeles]"
access="none"/>
      </HierarchyGrant>
      <HierarchyGrant hierarchy="[Gender]" access="none"/>
    </CubeGrant>
</SchemaGrant>
</Role>

This example also shows existing features, such as how hierarchy grants can be restricted using topLevel and/or bottomLevel attributes, and how a role can be prevented from seeing a hierarchy using access="none". See the schema guide for a description of these features.

It is important that Mondrian applies access-control transparently. Within a particular role and access-control scheme, the value of a particular cell is always the same; it is not possible to circumvent access-control by defining calculations, for instance.

Intrinsic roles

Currently mondrian roles are extrinsic: they describe explicitly the members which they can see. If the number of members is large, or the set of members changes often, it is more convenient to define the members by their properties.

An intrinsic role defines which members it can see according to the properties of the member. For example, this role can see only products whose color is red.

<Role name="Red Part manager">
<SchemaGrant access="none">
    <CubeGrant cube="Sales" access="all">
      <HierarchyGrant hierarchy="[Product]"
access="custom">
        <MemberGrant level="[Product].[Product Name]" access="all"/>
        <Expression>
            <[!CDATA[
Filter([Product].[Product Name].Members,

[Product].Properties("Color") = 'red') ]]>
         </Expression>
    </HierarchyGrant>

As before, parent members are visible if one of their children is visible, and the values of cells for those members are determined by the rollup policy.

Combining roles

Many systems have an access-control model where a user belongs to many roles simultaneously; when a user attempts an action, such as accessing a file or executing a query, the system either selects the appropriate role or uses the sum of the privileges of all of the roles. Mondrian does not work well with these systems, because we require that queries are executed in precisely one role.

The obvious solution is for mondrian to automatically combine roles roles when executing queries into one 'super role'. The semantics are subtle. First each role figures out which members it can see (which, for extrinsic roles may involve rules which say they cannot see certain members), then the members visible to the roles are combined.

For example, suppose that role R1 can see all cities in California except San Francisco and Los Angeles, and role R2 can see all cities whose population is less than 1M. If Fred has roles R1 and R2, can he see San Francisco? Yes: R1 cannot see San Francisco, R2 can see it because it has fewer than 1M inhabitants, so it is in the union of the members in R1 and R2.

Discussion

The partial and hidden rollup policies are more expensive to evaluate than full. Full requires just one point in multidimensional space to be fetched from disk, which can be satisfied using a SQL statement or a cache lookup, whereas partial and hidden require a more complex MDX expression, sometimes involving many cells.

Various implementations for partial and hidden are possible. One suggested scheme would add a special 'role_id' column to aggregate tables, which would contain a set of data for each role. This scheme would likely perform well, but because it is custom made for access control, it would tend to work against other strategies such as caching and semi-joins.

The preferred implementation is one which leverages the strengths of the existing MDX language. Some of these queries will require special 'tricks', not currently implemented in mondrian, to evaluate efficiently, but at least these tricks will benefit a wide range of queries beside access control, and therefore be thoroughly tested.

We plan to implement the partial and hidden policies by having the schema-reader introduce a calculated member for each partial total. For instance, under a 'partial' policy, above query would be executed as if Fred had written

WITH MEMBER [Store].[USA Partial] AS 'Aggregate([Store].[USA].Children)'
SELECT {[Measures].[Unit Sales]} ON COLUMNS,
{[Store].[USA Partial], [Store].[USA].Children} ON ROWS
FROM [Sales]

In the naive evaluation strategy, the calculated member is expanded away as mondrian recursively evaluates expressions. In a more sophisticated strategy, the predicates which define the calculated member - including the access-control policies which filter the results of the expression '[Store].[USA].Children' - will be generated into a SQL query.

Evaluating expressions in memory is inefficient when there are many thousands of members. This is particularly easy to do when you can define a role according to attributes of members, for example 'Red products'. We need to find ways to optimize those kinds of queries, by pushing the predicates down to SQL, and ideally by expressing the expressions in a dimensional manner so that the results can be cached.

Optimization #1: dimensional shift

This optimization involves recognizing that a set of members is being filtered by a property which is also dimension. The filtered set can be replaced by a member of the dimension.

For example,

   Filter([Customer].[Name].Members, [Customer].CurrentMember.[Gender] = 'F')

becomes

   [Gender].[F]

This optimization saves a scan over a large set of members, replacing it with an access to a single cache cell.

Optimization #2, push predicates down to SQL

In some filter expressions the predicate is too complex to be converted a dimension. If the number of members in the set is large, it is worthwhile to let the DBMS evaluate the entire expression. For example,

   Filter(
   [Customer].[Name].Members,
   [Customer].[Gender] = 'F' or [Customer].[Age] > 50)

The 'or' prevents the dimensional shift from taking place. The best optimization would be to (a) translate the condition into the SQL WHERE clause

   SELECT ... FROM customer WHERE gender = 'F' and age > 50

The DBMS can use evaluation techniques such as indexes, and it incurs less I/O if we crunch the data as close to the disk as possible.

These kinds of predicates are very likely to occur when using intrinsic access-control on hierarchies.

Optimization #3, cache cells whose coordinates are complex members

After the above query has run, store the results in the cache. The should fabricate special members like 'Red product' so that these cells can be kept in the dimensional cache.

This optimization will benefit totals generated by access-controlled hierarchies, but also totals over calculated sets such as top 10 customers, and other commonly occurring patterns.

Conclusion

I have discussed some of the deficiencies in mondrian's current access-control model, some features which could be introduced to address them, and sketches for how those features could be implemented. The 'rollup policy' feature is planned for mondrian-3.0 (see mondrian roadmap), but the other features are currently unplanned.

Friday, September 21, 2007

A nice way to populate those pesky closure tables

If you use mondrian's parent-child hierarchies, you will know that performance sucks unless you create closure tables. Closure tables expand the hierarchy, and allow mondrian to the operations required to roll-up a parent-child hierarchy using raw SQL: really fast.

The problem is populating the things. Closure tables contain what computer science profs. call a transitive closure of the parent-child relation (hence their name), and transitive closures aren't something which relational databases are very good at computing (which is why the database performs so much better when they're around). You can't just define a view, or write a simple SQL statement to populate them. Up til now, you'd have to use a stored procedure (if your database supports them) or write some gnarly JDBC code.

Now Matt Casters has added a Closure Generator step to Pentaho Data Integration (aka Kettle), which should make everyone's life easier. Yet another reason to use Kettle to load your mondrian schema. The new step debuts in Pentaho 3.0.0-RC1, but it should work with any version of mondrian.

Which, I suppose, means we all get to that Friday evening beer a little earlier. I'll drink to that.

Sunday, September 16, 2007

Open standards and open-source: The ballad of BIRT and olap4j

This week, I criticized the Eclipse BIRT project for choosing JOLAP as the API between their Dynamic Crosstab support and their cubing engine. I believe that an open-source project should be judged by the extent of its adoption by the community -- after all, if a project fails to serve its community, the community will simply go elsewhere -- and so I hesitate to criticize a project which is clearly pleasing its community.

My argument hinges on the particular importance of open standards for open-source projects, and what I believe is the obligation of an open-source project to embrace standards.

Open standards, open-source

I am not the first person to observe that open standards help open-source projects succeed. An open-source project is likely to find a ready market if it is a better, cheaper, plug-in replacement for an existing software module, and if there is a standard, the boundaries of that module are already well-defined.

But open standards are almost invariably created by commercial software vendors. The open-source community is beholden to these vendors, and the vendors are often in no particular hurry to create open standards.

Barriers to open standards

In public, vendors claim to love open standards as if they were virtues comparable with apple pie and motherhood; in private, they fight them tooth and nail. A couple of years ago, a software vendor asked me to create an MDX interface to their product using mondrian, to satisfy the growing number of their customers who wanted MDX. In private, they admitted that MDX was a good idea, and they would move to it if the majority of their customers demanded it, while in public they derided MDX as inferior to their API. They vendor knew that there is a tipping point when a market moves to an open-standard, and wanted to delay that tipping point as long as possible.

Sadly, this thinking is not confined to commercial software. In commercial and open-source software alike, if a software vendor/project is in a dominant market position, they can choose to ignore open standards. If they deliver a collection of software components connected by proprietary interfaces, customers have to take all or nothing.

Open-source projects build in functionality for which they could have used an open standard API. There are some legitimate reasons for this. Sometimes license balkanization can be blamed, and it would be legally impossible to distribute the two components under the same license; sometimes an open standard exists, but there is no suitable open-source implementation of that standard. But all too commonly, open-source developers fall victim to not-invented-here syndrome, and build a technology where a good open-source alternative exists.

The pressures are particularly intense when the open-source project is backed by a commercial organization, as is the case with Eclipse BIRT. (BIRT is run by Actuate Corporation, a commercial BI vendor, and all of its leaders and key developers work for that company.) The designers naturally want to provide a coherent experience for their end-users, but there is also a temptation to create an architecture which excludes technology of the parent organization's competitors.

I am proud of the efforts we have made to embrace standards in the Pentaho BI Suite, of which mondrian is a part. Among the standards used in the Pentaho suite are CWM, XForm, XPDL, XSLT, XPath, SVG, and XUL, as well as all the standards implicit in the Java and J2EE environments, such as JDBC and JNDI.

Pentaho CTO James Dixon has pointed out that open-source is about more than giving your source code away for free: it requires a process which is transparent (allowing others to know where the project is going) and open (allowing others to participate). Such a process encourages re-use of and contribution to other projects, and in-house projects are ruthlessly 'right-sized' into the smallest module which can be self-contained and sustain its own community. Dependencies between projects are minimized, to make them usable by the largest possible audience. This is a true open-source culture, and it naturally embraces open standards.

A new open standard: olap4j

So it is saddening that there is no API for Java-based OLAP. The only effective OLAP standards are OLE DB for OLAP (fine if you're running Windows) and XML for Analysis (fine if you're prepared to write a program to handle reams of XML). Java developers were promised JSR-069, also known as JOLAP, but the vendors involved (principally IBM, Oracle and Hyperion) failed to agree on a final version. No server vendor released an implementation of the API, and no client tool was released which used it. After it became apparent that JOLAP was truly dead, I removed mondrian's JOLAP interface; mondrian-2.2 was the last release with JOLAP support.

As I noted above, standards are traditionally created by commercial vendors, yet in the case of Java-based OLAP, the vendors have singularly failed to create a standard. The need for an interoperable OLAP client still exists, so I believe it now falls to the open-source community to create one.

Last year, I proposed creating an API called olap4j, and since then, there has been an initial draft of the specification, and work is under way for drivers for both mondrian and for generic XMLA servers.

olap4j and BIRT

As an important part of the open-source BI movement, I naturally tried to include Eclipse BIRT in the olap4j process. Unfortunately the spirit of openness and transparency seems to be lacking in the BIRT project. I have made several attempts, privately and publicly, to engage with BIRT leaders and start collaboration between our projects. They have not responded to my requests, and I can only speculate why that might be. There must be a temptation to create a single BI suite, all parts of which are supplied by Actuate (albeit under an open-source license) rather than present an architecture based on open standards where the users have a genuine choice of components.

olap4j would be a superior API for BIRT to use in order to access dimensional data. JOLAP is dead: it never became finalized as a specification, and there has been no further development, releases of the specification, or implementations since the specification stalled in 2004. olap4j has an active community, interoperates with XMLA, and has multiple implementations available or under development.

The benefit to BIRT's community would be considerable. They would be able to use BIRT's facilities for designing reports, acknowledged as among the best in open-source BI, and run them against the OLAP server of their choice. This would allow reporting users to choose the best OLAP server for their performance requirements and data volumes. Users who already own an OLAP server could use BIRT as an alternative interface to the reporting stack provided by their OLAP vendor.

BIRT could even continue to develop a cubing engine. If successful, that engine could be used as an alternative to mondrian or XMLA by olap4j clients.

Conclusion

I am not optimistic that we will see a change of heart from BIRT, but I wanted to lay things out as I see them.

If you are a member of the open-source BI community -- whether you use Eclipse, Pentaho or just mondrian -- please let the BIRT team know that their decision is making BIRT less open, delivering less value to BIRT's users, and will damage the open-source BI for everyone.