Tuesday, February 21, 2012

Auto-generated date dimension tables

It seems that whenever I have a cross-continent flight, Mondrian gets a new feature. This particular flight was from Florida back home to California, and this particular feature is a time-dimension generator.

I was on the way home from an all-hands at Pentaho's Orlando, Florida headquarters, where new CEO Quentin Gallivan had outlined his strategy for the company. I also got to spend time with the many smart folks from all over the world who work for Pentaho, among them Roland Bouman, formerly an evangelist for MySQL, now with Pentaho, but still passionately advocating for open source databases, open source business intelligence, and above all, keeping it simple.

Roland and I got talking about how to map Mondrian onto operational schemas. Though not designed as star schemas, some operational schemas nevertheless have a structure that can support a cube, with a central fact table surrounded by star or snowflake dimension tables. Often the one thing missing is a time dimension table. Since these time dimension tables look very much the same, how easy would it be for Mondrian to generate them on the fly? Not that difficult, I thought, as the captain turned off the "fasten seatbelts" sign and I opened my laptop. Here's what I came up with.

Here's how you declare a regular time dimension table in Mondrian 4:

<PhysicalSchema>
  <Table name='time_by_day'/>
  <!-- Other tables... -->
</PhysicalSchema>


Mondrian sees the table name 'time_by_day', checks that it exists, and finds the column definitions from the JDBC catalog. The table can then be used in various dimensions in the schema.

An auto-generated time dimension is similar:

<PhysicalSchema>
  <AutoGeneratedDateTable name='time_by_day_generated' startDate='2012-01-01' endDate='2014-01-31'/>
  <!-- Other tables... -->
</PhysicalSchema>


The first time Mondrian reads the schema, it notices that the table is not present in the schema, and creates and populates it. Here is the DDL and data it produces.

CREATE TABLE `time_by_day_generated` (
  `time_id` Integer NOT NULL PRIMARY KEY,
  `yymmdd` Integer NOT NULL,
  `yyyymmdd` Integer NOT NULL,
  `the_date` Date NOT NULL,
  `the_day` VARCHAR(20) NOT NULL,
  `the_month` VARCHAR(20) NOT NULL,
  `the_year` Integer NOT NULL,
  `day_of_month` VARCHAR(20) NOT NULL,
  `week_of_year` Integer NOT NULL,
  `month_of_year` Integer NOT NULL,
  `quarter` VARCHAR(20) NOT NULL)

JULIANYYMMDDYYYYMMDDDATEDAY_OF_WEEK_NAMEMONTH_NAMEYEARDAY_OF_MONTHWEEK_OF_YEARMONTHQUARTER
2455928120101201201012012-01-01SundayJanuary2012111Q1
2455929120102201201022012-01-02MondayJanuary2012211Q1
2455930120103201201032012-01-03TuesdayJanuary2012311Q1

The columns present are all of the time-dimension domains:

DomainDefault column nameDefault data typeExampleDescription
JULIAN time_id Integer 2454115Julian day number (0 = January 1, 4713 BC). Additional attribute 'epoch', if specified, changes the date at which the value is zero.
YYMMDD yymmdd Integer 120219Decimal date with two-digit year
YYYYMMDD yyyymmdd Integer 20120219Decimal date with four-digit year
DATE the_date Date 2012-12-31Date literal
DAY_OF_WEEK_NAMEthe_day String FridayName of day of week
MONTH_NAME the_month String DecemberName of month
YEAR the_year Integer 2012Year
DAY_OF_MONTH day_of_monthString 31Day ordinal within month
WEEK_OF_YEAR week_of_yearInteger 53Week ordinal within year
MONTH month_of_yearInteger 12Month ordinal within year
QUARTER quarter String Q4Name of quarter

Suppose you wish to choose specific column names, or have more control over how values are generated. You can do that by including a <ColumnDefs> element within the table, and <ColumnDef> elements within that — just like a regular <Table> element.

For example,

<PhysicalSchema>
  <AutoGeneratedDateTable name='time_by_day_generated' startDate='2008-01-01 endDate='2020-01-31'>
    <ColumnDefs>
      <ColumnDef name='time_id'>
        <TimeDomain role='JULIAN' epoch='1996-01-01'/>
      </ColumnDef>
      <ColumnDef name='my_year'>
        <TimeDomain role='year'/>
      </ColumnDef>
      <ColumnDef name='my_month'>
        <TimeDomain role='MONTH'/>
      </ColumnDef>
      <ColumnDef name='quarter'/>
      <ColumnDef name='month_of_year'/>
      <ColumnDef name='week_of_year'/>
      <ColumnDef name='day_of_month'/>
      <ColumnDef name='the_month'/>
      <ColumnDef name='the_date'/>
    </ColumnDefs>
    <Key>
      <Column name='time_id/>
    </Key>
  </AutoGeneratedDateTable>
  <!-- Other tables... -->
</PhysicalSchema>


The first three columns have nested <TimeDomain> elements that tell the generator how to populate them.

The other columns have the standard column name for a particular time domain, and therefore the <TimeDomain> element can be omitted. For instance,

<ColumnDef name='month_of_year'/>

is shorthand for

<ColumnDef name='month_of_year' type='int'>
  <TimeDomain role="month"/>
</ColumnDef>


The nested <Key> element makes that column valid as the target of a link (from a foreign key in the fact table, for instance), and also declares the column as a primary key in the CREATE TABLE statement. This has the pleasant side-effect, on all databases I know of, of creating an index. If you need other indexes on the generated table, create them manually.

The <TimeDomain> element could be extended further. For instance, we could add a locale attribute. This would allow different translations of month and weekday names, and also support locale-specific differences in how week-in-day and day-of-week numbers are calculated.

Note that this functionality is checked into the mondrian-lagunitas branch, so will only be available as part of Mondrian version 4. That release is still pre-alpha. We recently started to regularly build the branch using Jenkins, and you should see the number of failing tests dropping steadily over the next weeks and months. Already over 80% of tests pass, so it's worth downloading the latest build to kick the tires on your application.

9 comments:

Roland Bouman said...

Wow, that's fast :)

Great work Julian, this will make many BI consultants happy! I think this is going to be especially useful to quickly prototype existing OLTP schemas.

Bob Folkerts said...

As a minor variant on this idea, you can also make a table that consists of one column (the_date). You can then build a view where all of the other columns are date function calls. I will admit that it amused me to have a database row that is only 4 bytes, but this is a pretty trivial advantage. By using functions, the names can take advantage of localization rules built into the database engine.

An drawback is that the syntax of the view becomes specific to a RSBMS. It also require that the fact table have a date column.

Is there an obvious problem with this approach?

Julian Hyde said...

Bob,

Yes, there's a big problem. Databases can't calculate the inverse of the function. For example, suppose you want to see all of the sales that happened in 2011. If you run the query

select * from sales join time
on sales.the_date = time_by_day.the_date
where to_year(time.the_date) = 2011

then the database will probably do a full scan of the time_by_day table. If instead the time_by_day table had a the_year column, it could do an index scan.

You might say, what's the difference -- the table only has 3000 rows for 10 years, and that's a small amount of memory on a modern system. But database query optimizers can't easily combine functions, whereas they can combine indexes.

Kimball introduced the "star schema" methodology, including time dimension tables, about 15 years ago, largely to deal with the stupidity of database query optimizers. Database query optimizers have improved a lot, but they still need all the help they can get.

If you don't believe me, try some experiments against a reasonable sized database.

Bob Folkerts said...

That is a good point, but one that I can work around in many RDMSs.

In Postgresql, I can index functions of the date column, so I can index any of the columns in the view, as long as the functions are deterministic. In my test query, the query optimizer was able to do an index scan when grouped by a column in the view. As predicted, it did make an improvement in the query time (600 ms without index to 400 ms with index). I must admit that surprised me a bit, since i thought that expensive aggregate was over the fact table. But I can get the same performance with either a table or a view since I have essentially the same indexes.

I have always though of the star schema as a logical design ,more than a technique to improve performance. So thanks for that insight.

Julian Hyde said...

Bob,

Agreed. That approach would probably work pretty well.

But as they say: once you know the rules, you can start breaking them. I'd like everyone to start off with a date dimension table.

Can you please post your DDL for fact table, including the functional indexes? I think people would find it useful.

Julian

Esquire said...

Hi, Julian.

Selecting a whole year having days in dimension column is much more simple than

where to_year(time.the_date) = 2011

You can use

time.the_date between and

If question is in indexing - you will have index range scan, not table fullscans.

So as far as I understand- goal of explicit dimension table for date storage is that you can have bitmap index on dimension column (Oracle case) instead of tree index.

Am I right? may be you know other reason?

Julian Hyde said...

Esquire,

You'd think so. But:

(1) Mondrian isn't smart enough to know that to find the dates between two points, it just needs to a range scan on date_id. (I can't find all customers in San Francisco by using a range of customer_id... how is Mondrian to know that the date dimension is different?)

(2) Database query optimizers and operators sometimes don't work as well on ranges as '=' predicates.

I'm not claiming that other approaches won't work. I just claim that generated date dimensions are the simplest. And then one can move onto solving the harder problems in the schema.

ex said...

Hi Julian,

I checked this feature (latest lagunitas), but it seems that a primary key is not created in the DB. I tried , keyColumn with my custom . The same issue with 'standard' columns (without and ).

Julian Hyde said...

ex,

Please log a jira case.