I was on the way home from an all-hands at Pentaho's Orlando, Florida headquarters, where new CEO Quentin Gallivan had outlined his strategy for the company. I also got to spend time with the many smart folks from all over the world who work for Pentaho, among them Roland Bouman, formerly an evangelist for MySQL, now with Pentaho, but still passionately advocating for open source databases, open source business intelligence, and above all, keeping it simple.
Roland and I got talking about how to map Mondrian onto operational schemas. Though not designed as star schemas, some operational schemas nevertheless have a structure that can support a cube, with a central fact table surrounded by star or snowflake dimension tables. Often the one thing missing is a time dimension table. Since these time dimension tables look very much the same, how easy would it be for Mondrian to generate them on the fly? Not that difficult, I thought, as the captain turned off the "fasten seatbelts" sign and I opened my laptop. Here's what I came up with.
Here's how you declare a regular time dimension table in Mondrian 4:
<PhysicalSchema>
<Table name='time_by_day'/>
<!-- Other tables... -->
</PhysicalSchema>
Mondrian sees the table name 'time_by_day', checks that it exists, and finds the column definitions from the JDBC catalog. The table can then be used in various dimensions in the schema.
An auto-generated time dimension is similar:
<PhysicalSchema>
<AutoGeneratedDateTable name='time_by_day_generated' startDate='2012-01-01' endDate='2014-01-31'/>
<!-- Other tables... -->
</PhysicalSchema>
The first time Mondrian reads the schema, it notices that the table is not present in the schema, and creates and populates it. Here is the DDL and data it produces.
CREATE TABLE `time_by_day_generated` (
`time_id` Integer NOT NULL PRIMARY KEY,
`yymmdd` Integer NOT NULL,
`yyyymmdd` Integer NOT NULL,
`the_date` Date NOT NULL,
`the_day` VARCHAR(20) NOT NULL,
`the_month` VARCHAR(20) NOT NULL,
`the_year` Integer NOT NULL,
`day_of_month` VARCHAR(20) NOT NULL,
`week_of_year` Integer NOT NULL,
`month_of_year` Integer NOT NULL,
`quarter` VARCHAR(20) NOT NULL)
JULIAN | YYMMDD | YYYYMMDD | DATE | DAY_OF_WEEK_NAME | MONTH_NAME | YEAR | DAY_OF_MONTH | WEEK_OF_YEAR | MONTH | QUARTER |
---|---|---|---|---|---|---|---|---|---|---|
2455928 | 120101 | 20120101 | 2012-01-01 | Sunday | January | 2012 | 1 | 1 | 1 | Q1 |
2455929 | 120102 | 20120102 | 2012-01-02 | Monday | January | 2012 | 2 | 1 | 1 | Q1 |
2455930 | 120103 | 20120103 | 2012-01-03 | Tuesday | January | 2012 | 3 | 1 | 1 | Q1 |
The columns present are all of the time-dimension domains:
Domain | Default column name | Default data type | Example | Description |
---|---|---|---|---|
JULIAN | time_id | Integer | 2454115 | Julian day number (0 = January 1, 4713 BC). Additional attribute 'epoch', if specified, changes the date at which the value is zero. |
YYMMDD | yymmdd | Integer | 120219 | Decimal date with two-digit year |
YYYYMMDD | yyyymmdd | Integer | 20120219 | Decimal date with four-digit year |
DATE | the_date | Date | 2012-12-31 | Date literal |
DAY_OF_WEEK_NAME | the_day | String | Friday | Name of day of week |
MONTH_NAME | the_month | String | December | Name of month |
YEAR | the_year | Integer | 2012 | Year |
DAY_OF_MONTH | day_of_month | String | 31 | Day ordinal within month |
WEEK_OF_YEAR | week_of_year | Integer | 53 | Week ordinal within year |
MONTH | month_of_year | Integer | 12 | Month ordinal within year |
QUARTER | quarter | String | Q4 | Name of quarter |
Suppose you wish to choose specific column names, or have more control over how values are generated. You can do that by including a
<ColumnDefs>
element within the table, and <ColumnDef>
elements within that — just like a regular <Table>
element.For example,
<PhysicalSchema>
<AutoGeneratedDateTable name='time_by_day_generated' startDate='2008-01-01 endDate='2020-01-31'>
<ColumnDefs>
<ColumnDef name='time_id'>
<TimeDomain role='JULIAN' epoch='1996-01-01'/>
</ColumnDef>
<ColumnDef name='my_year'>
<TimeDomain role='year'/>
</ColumnDef>
<ColumnDef name='my_month'>
<TimeDomain role='MONTH'/>
</ColumnDef>
<ColumnDef name='quarter'/>
<ColumnDef name='month_of_year'/>
<ColumnDef name='week_of_year'/>
<ColumnDef name='day_of_month'/>
<ColumnDef name='the_month'/>
<ColumnDef name='the_date'/>
</ColumnDefs>
<Key>
<Column name='time_id/>
</Key>
</AutoGeneratedDateTable>
<!-- Other tables... -->
</PhysicalSchema>
The first three columns have nested
<TimeDomain>
elements that tell the generator how to populate them.The other columns have the standard column name for a particular time domain, and therefore the
<TimeDomain>
element can be omitted. For
instance,<ColumnDef name='month_of_year'/>
is shorthand for
<ColumnDef name='month_of_year' type='int'>
<TimeDomain role="month"/>
</ColumnDef>
The nested
<Key>
element makes that column valid as the
target of a link (from a foreign key in the fact table, for instance),
and also declares the column as a primary key in the CREATE TABLE
statement. This has the pleasant side-effect, on all databases I know
of, of creating an index. If you need other indexes on the generated
table, create them manually.The
<TimeDomain>
element could be extended further. For instance, we
could add a locale attribute. This would allow different translations
of month and weekday names, and also support locale-specific
differences in how week-in-day and day-of-week numbers are
calculated.Note that this functionality is checked into the mondrian-lagunitas branch, so will only be available as part of Mondrian version 4. That release is still pre-alpha. We recently started to regularly build the branch using Jenkins, and you should see the number of failing tests dropping steadily over the next weeks and months. Already over 80% of tests pass, so it's worth downloading the latest build to kick the tires on your application.
9 comments:
Wow, that's fast :)
Great work Julian, this will make many BI consultants happy! I think this is going to be especially useful to quickly prototype existing OLTP schemas.
As a minor variant on this idea, you can also make a table that consists of one column (the_date). You can then build a view where all of the other columns are date function calls. I will admit that it amused me to have a database row that is only 4 bytes, but this is a pretty trivial advantage. By using functions, the names can take advantage of localization rules built into the database engine.
An drawback is that the syntax of the view becomes specific to a RSBMS. It also require that the fact table have a date column.
Is there an obvious problem with this approach?
Bob,
Yes, there's a big problem. Databases can't calculate the inverse of the function. For example, suppose you want to see all of the sales that happened in 2011. If you run the query
select * from sales join time
on sales.the_date = time_by_day.the_date
where to_year(time.the_date) = 2011
then the database will probably do a full scan of the time_by_day table. If instead the time_by_day table had a the_year column, it could do an index scan.
You might say, what's the difference -- the table only has 3000 rows for 10 years, and that's a small amount of memory on a modern system. But database query optimizers can't easily combine functions, whereas they can combine indexes.
Kimball introduced the "star schema" methodology, including time dimension tables, about 15 years ago, largely to deal with the stupidity of database query optimizers. Database query optimizers have improved a lot, but they still need all the help they can get.
If you don't believe me, try some experiments against a reasonable sized database.
That is a good point, but one that I can work around in many RDMSs.
In Postgresql, I can index functions of the date column, so I can index any of the columns in the view, as long as the functions are deterministic. In my test query, the query optimizer was able to do an index scan when grouped by a column in the view. As predicted, it did make an improvement in the query time (600 ms without index to 400 ms with index). I must admit that surprised me a bit, since i thought that expensive aggregate was over the fact table. But I can get the same performance with either a table or a view since I have essentially the same indexes.
I have always though of the star schema as a logical design ,more than a technique to improve performance. So thanks for that insight.
Bob,
Agreed. That approach would probably work pretty well.
But as they say: once you know the rules, you can start breaking them. I'd like everyone to start off with a date dimension table.
Can you please post your DDL for fact table, including the functional indexes? I think people would find it useful.
Julian
Hi, Julian.
Selecting a whole year having days in dimension column is much more simple than
where to_year(time.the_date) = 2011
You can use
time.the_date between and
If question is in indexing - you will have index range scan, not table fullscans.
So as far as I understand- goal of explicit dimension table for date storage is that you can have bitmap index on dimension column (Oracle case) instead of tree index.
Am I right? may be you know other reason?
Esquire,
You'd think so. But:
(1) Mondrian isn't smart enough to know that to find the dates between two points, it just needs to a range scan on date_id. (I can't find all customers in San Francisco by using a range of customer_id... how is Mondrian to know that the date dimension is different?)
(2) Database query optimizers and operators sometimes don't work as well on ranges as '=' predicates.
I'm not claiming that other approaches won't work. I just claim that generated date dimensions are the simplest. And then one can move onto solving the harder problems in the schema.
Hi Julian,
I checked this feature (latest lagunitas), but it seems that a primary key is not created in the DB. I tried , keyColumn with my custom . The same issue with 'standard' columns (without and ).
ex,
Please log a jira case.
Post a Comment