Friday, July 10, 2009

Functional dependency optimizations in Mondrian

Eric McDermid just checked in a nice new feature into Mondrian which optimizes the SQL generated by MySQL. It takes advantage of the fact that in MySQL, if some of your GROUP BY columns are unique, you can leave the other columns out of the GROUP BY clause, and MySQL does less work.

In some cases, a lot less work. MySQL implements GROUP BY by sorting, and since this reduces the volume of data being sorted, Eric reports significant performance improvements. Unfortunately it only works on MySQL, since MySQL is the database I know which has this feature.

See the latest schema documentation for more details.

I'll note that we reserve the right to change the syntax a little in future versions. In mondrian-4.0 we're adding physical schemas, which will include much more information about tables and relationships, so it would make sense to declare unique keys along with that. But rest assured that even if we do change the syntax, the feature will still be present.

5 comments:

rpbouman said...

Hi!

Nice feature!

(In an ideal world, MySQL would itself detect the functional dependency and not include redundant columns in the sort, but that's another story)

Anyway, I can see one potential problem. MySQL doesn't actually check whether the non-aggegrated columns in the SELECT list are in fact functionally dependent upon the columns in the GROUP BY clause, and some users like to protect themselves from writing non-sense GROUP BY's by including the ONLY_FULL_GROUP_BY option in the sql_mode (see: http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html#sqlmode_only_full_group_by)
If this setting is in effect, MySQL will give you an error in case not all non-aggregated columns from the SELECT list are also included in the GROUP BY clause, so this could possibly clash with this feature. Error messages go like this in this case:

"ERROR 1055 (42000): '..column-name...' isn't in GROUP BY"

This is just a FYI - I'm not suggesting mondrian should check if its safe to use the feature (although that is of course a possibility), but if you get bug reports with this error message, then at least you know what is causing it.

I hope this helps. Kind regards,

Roland.

Eric McDermid said...

Good point, Roland. As it happens, we've already got it covered. Mondrian checks the sql_mode when creating a connection. If it includes ONLY_FULL_GROUP_BY then functionally dependent columns will still be included in the GROUP BY clause, just as they are on other databases.

It is possible to change the MySQL sql_mode on the fly, but such changes only affect the current session and (when set globally) future sessions, not existing ones. So even if the user sets ONLY_FULL_GROUP_BY while Mondrian is running, there shouldn't be any risk.

Of course, if someone changes the data so that those columns are no longer functionally dependent while Mondrian is running, there will be problems. But that constitutes lying to your OLAP server, which is generally considered a bad idea anyway. :)

By the way, there is another aspect of this feature that is applicable to all databases, not just MySQL. In those cases where all level properties are functionally dependent, and the query is at a depth where we can tell all columns will be unique, Mondrian can now eliminate the GROUP BY entirely.

rpbouman said...

Hi Eric!


first of all, thank you for implementing this feature! It's a great idea, and I am sure this will improve performance in a lot of cases.

Second, thanks for your explanation. It's great to hear you already anticipated this and took care of it. Kudos! Indeed, data could change while running, but I would consider that user error.

Thanks, and kind regards,

Roland.

Luc Boudreau said...

Since you mention Mondrian 4.0 schema changes; is there any wiki or documentation? A lot of us would like to either be informed up-front or even might want to pitch in a few ideas and improvements.

Keep us informed will y'a? Cheers.

Julian Hyde said...

Luc, fair comment. A functional specification is long overdue.

I've been mainly working on code, because I need to find out options are workable (and can be made to work) before write the documentation. But I know now enough to be able to write a functional specification, explaining the goals of the new schema metamodel, give some examples, and talk about upgrade path.

Look for something on this blog in the next few weeks.

Julian