Wednesday, April 04, 2012

How should Mondrian get table and column statistics?

When evaluating queries, Mondrian sometimes needs to make decisions about how to proceed, and in particular, what SQL to generate. One decision is which aggregate table to use for a query (or whether to stick with the fact table), and another is whether to "round out" a cell request for, say, 48 states and 10 months of 2011 to the full segment of 50 states and 12 months.

These decisions are informed by the volume actual data in the database. The first decision uses row counts (the numbers of rows in the fact and aggregate tables) and the second uses column cardinalities (the number of distinct values in the "month" and "state" columns).

Gathering statistical information is an imperfect science. The obvious way to get the information is to execute some SQL queries:
-- row count of the fact table
select count(*) from sales_fact_1997;

-- count rows in an aggregate table
select count(*) from agg_sales_product_brand_time_month;

-- cardinality of the [Customer].[State] attribute
select count(distinct state) from customer;
These queries can be quite expensive. (On many databases, a row count involves reading every block of the table into memory and summing the number of rows in each. A query for a column's cardinality involves an entry scan of an index; or, worse, a table scan followed by an expensive sort if there is no such index.)

Mondrian doesn't need the exact value, but need needs an approximate value (say correct within a factor of 3) in order to proceed with the query.

Mondrian has a statistics cache, so the statistics calls only affect the "first query of the day", when Mondrian has been re-started, or is using a new schema. (If you are making use of a dynamic schema processor, it might be that every user effectively has their own schema. In this case, every user will experience their own slow "first query of the day".)

We have one mechanism to prevent expensive queries: you can provide estimates in the Mondrian schema file. When you are defining an aggregate table, specify the approxRowCount attribute of the <AggName> XML element, and Mondrian will skip the row count query. When defining a level, if you specify the approxRowCount attribute of the <Level> XML element (the <Attribute> XML element in mondrian-4), Mondrian will skip the cardinality query. But it is time-consuming to fill in those counts, and they can go out of date as the database grows.

I am mulling over a couple of features to ease this problem. (These features are not committed for any particular release, or even fully formed. Your feedback to this post will help us prioritize them, shape them so that they are useful for how you manage Mondrian, and hopefully trim their scope so that they are reasonably simple for us to implement.)

Auto-populate volume attributes

The auto-populate feature would read a schema file, run queries on the database to count every fact table, aggregate table, and the key of every level, and populate the approxRowCount attributes in the schema file. It might also do some sanity checks, such as that the primary key of your dimension table doesn't have any unique values, and warn you if they are violated.

Auto-populate is clearly a time-consuming task. It might take an hour or so to execute all of the queries. You could run it say once a month, at a quiet time of day. But at the end, the Mondrian schema would have enough information that it would not need to run any statistics queries at run time.

Auto-populate has a few limitations. Obviously, you need to schedule it, as a manual task, or a cron job. Then you need to make sure that the modified schema file is propagated into the solution repository. Lastly, if you are using a dynamic schema processor to generate or significantly modify your schema file, auto-populate clearly cannot populate sections that have not been generated yet.

Pluggable statistics

The statistics that Mondrian needs probably already exist. Every database has a query optimizer, and every query optimizer needs statistics such as row counts and column cardinalities to make its decisions. So, that ANALYZE TABLE (or equivalent) command that you ran after you populated the database (you did run it, didn't you?) probably calculated these statistics and stored them somewhere.

The problem is that that "somewhere" is different for each and every database. In Oracle, they are in ALL_TAB_STATISTICS and ALL_TAB_COL_STATISTICS tables; in MySQL, they are in INFORMATION_SCHEMA.STATISTICS. And so forth.

JDBC claims to provide the information through the DatabaseMetaData.getIndexInfo method. But it doesn't work for all drivers. (The only one I tried, MySQL, albeit a fairly old version, didn't give me any row count statistics.)

Let's suppose we introduced an SPI to get table and column statistics:
package mondrian.spi;

import javax.sql.DataSource;

interface StatisticsProvider {
   int getColumnCardinality(DataSource dataSource, String catalog, String schema, String table, String[] columns);
   int getTableCardinality(DataSource dataSource, String catalog, String schema, String table);
}
and several implementations:
  • A fallback implementation SqlStatisticsProvider that generates "select count(distinct ...) ..." and "select count(*) ..." queries.
  • An implementation JdbcStatisticsProvider that uses JDBC methods such as getIndexInfo
  • An implementation that uses each database's specific tables, OracleStatisticsProvider, MySqlStatisticsProvider, and so forth.
Each Dialect could nominate one or more implementations of this SPI, and try them in order. (Each method can return -1 to say 'I don't know'.)

Conclusion

Statistics are an important issue for Mondrian. In the real world, missing statistics are more damaging than somewhat inaccurate statistics. If statistics are inaccurate, Mondrian will execute queries inefficiently, but the difference with optimal performance is negligible if the statistics are within an order of magnitude; missing statistics cause Mondrian to generate potentially expensive SQL statements, especially during that all-important first query of the day.

A couple of solutions are proposed.

The auto-population tool would solve the problem in one way, at the cost of logistical effort to schedule the running of the tool.

The statistics provider leverages databases' own statistics. It solves the problem of diversity the usual open source way: it provides an SPI and lets the community provide implementations that SPI for their favorite database.

10 comments:

MattK said...

Could Mondrian get useful information from an EXPLAIN of the query?

Joshua Tolley said...

Various databases' implementations of EXPLAIN differ at least as much as their implementations of query optimizer statistics. From my perspective, it makes lots of sense to implement an interface, like you mentioned, and allow different database-specific implementations to copy the statistics from the database's optimizer. But it will probably need some not-necessarily-optimal alternative, for databases where that implementation doesn't exist, or isn't configured.

Julian Hyde said...

I agree, Joshua. I suppose one could, in theory, generate a star join against the fact table and various candidate aggregate tables, and actually submit whichever one that EXPLAIN says will be cheapest. But you'd still have to parse the EXPLAIN output, which is different for each database.

I was hesitating to introduce a mechanism which is database-specific, because people hate to have to write or modify their own subclass of Dialect even more than they hate editing mondrian.properties files.

Unknown said...

The auto population of the cardinality would be a good generic way that works with all databases but might be expensive time and effort wise.

I think the second option would be generally the better solution. It requires more coding work, but we could leverage each databases best way to handle that problem.

We could even start with a separate SPI for column store db's that can perform a count() more quickly than standard rdbms.

Judging from the dialect classes I think we could cover most users with about 8 different implementations.

Once those SPI's exist I would still revisit the idea of auto generating the approxRowCount attribute with those SPI's. So you don't have to perform those queries that often.

-Paul

Roland Bouman said...

It seems to me the pluggable statistics idea is most desirable, provided there are standard implementations available (i.e. "do COUNT(*) and COUNT(DISTINCT ..) queries", and "ask the driver")).

It also seems to me the autopopulate feature and the pluggable statistics features don't exclude each other - you could have some mondrian command that says: use whatever statistics provider is currently plugged in to calculate statistics, and stick them into the schema (or output a new schema with the populated statistics so people can save the modified schema). Mondrian could then rely on the statistics in the schema if present, and get fresh ones if not.

Here's another idea to marry "autopopulate" with the pluggable statistics idea: why not create a statistics server provider that reads the statistics from a local file, and create a tool or mondrian command that can generate/refresh that file. That way, statistics are pulled entirely out of the schema, but can still easily be tweaked by a knowledgeable user.

Carlos Lopez said...

Considering the time and efforts it will take to auto-populate the volume attributes the best choice will be to go with the pluggable statistics since these are already in-place on the database.

mprudhom said...

Using the cardinality from getIndexInfo sounds like a brilliant idea! I went ahead and tested out the how well-supported getIndexInfo is with recent versions of our test databases, and most of them (including recent versions of the MySQL driver) seem to accurately report the table cardinality. The only exception was Sybase, which doesn't report anything for SQLAnywhere and throws an exception with ASE (both using the jTDS driver).

I've pasted the output from my experiments at http://pastebin.com/rCyPjU4e

DanD said...

It would be useful to allow Mondrian to get this data from a table in the database. The same process that routinely updates the base fact and aggregates could then update that table with the counts as well. Mondrian can then get the counts for all aggregates in one very fast query.

Julian Hyde said...

I have implemented the StatisticsProvider SPI. See commit 3129b4e on the master branch.

By default it uses SQL queries to get statistics, as today.

If you want to use a different provider, set either mondrian.statistics.provider or mondrian.statistics.provider.DATABASE (replacing DATABASE with the name of your database, e.g. MYSQL or ORACLE) in mondrian.properties.

DanD,

Try creating a provider that looks for a table, say called "mondrian_statistics" with columns schema_name, table_name, column_name. If it works contribute it!

mprodhom,

I was hoping that MySQL's getIndexInfo would provide a record with TYPE = tableIndexStatistic (0). That is, summarizing the whole table, not a particular index). It didn't. Your output shows that Oracle does, but MySQL doesn't.

Look in JdbcStatisticsProvider, and you will see that if there is not a tableIndexStatistic row, we assume that the non-unique index with the most rows is close enough to the true value.

Carlos,

Agreed. After this fix, we will still use approxRowCount attributes, if present.

Paul (aka Unknown),

There is still no implementation of the SPI that reads statistics. (E.g. OracleAnalyzedStatisticsProvider.)
(Unless that is what comes out through JDBC getIndexInfo -- I couldn't tell.) Contributions welcome!

Also, an implementation that reads a text file, formatted say

customer.cardinality=10281
customer.gender.cardinality=2
customer.state.cardinality=16

would also be most welcome.

Julian

Rashmi C G said...

can u give me the steps to populate data into mondrain from begining..