Wednesday, June 01, 2011

Removing Mondrian's 'high cardinality dimension' feature

I would like to remove the 'high cardinality dimension' feature in mondrian 4.0.

To specify that a dimension is high-cardinality, you set the highCardinality attribute of the Dimension element to true. This will cause mondrian to scan over the dimension, rather than trying to load all of the children of a given parent member into memory.

The goal is a worthy one, but the implementation — making iterators look like lists — has a number of architectural problems: it duplicates code; because it allows backtracking for a fixed amount, it works with small dimensions but unpredictably fails with larger ones; and because lists are based on iterators, re-starting an iteration multiple times (e.g. from within a crossjoin) can re-execute complex SQL statements.

There are other architectural features designed to help with large dimensions. Many functions can operate in an 'iterable' mode (except that here the iterators are explicit). And for many of the most data-intensive operators, such as crossjoin, filter, semijoin (non-empty), and topcount, we can push down the operator to SQL, and thereby reduce the number of records coming out of the RDBMS.

It's always hard to remove a feature. But over the years we have seen numerous inconsistencies, and if we removed this feature in mondrian 4.0, we could better focus our resources.

If you are using this feature and getting significant performance benefit, I would like to hear from you. I would like to understand about your use case, and either direct you to another feature that solves the problem, or try to develop an alternative solution in mondrian 4.0. The best place to make comments about these use cases is on the Jira case MONDRIAN-949.

2 comments:

LuisVi said...

Hi Julian. What i see with the highcardinality in our case, it is that doesn´t use the time dimension filter. If you have year and month in the filter, mondrian doesn´t use it.
This is a problem with big tables over 5 millions rows. The querys spend a lot of time on MySQL.

Julian Hyde said...

So you're saying that with 'high cardinality' flag turned on, Mondrian's performance is bad? Doesn't that prove my point that we should remove this feature?