Julian Hyde on Streaming Data, Open Source OLAP. And stuff.

Table macros

2014-05-05T15:04:00.000-07:00

Table macros are a new Optiq feature (since release 0.6) that combine the efficiency of tables with the flexibility of functions.

Optiq offers a convenient model for presenting data from multiple external sources via a single, efficient SQL interface. Using adapters, you create a schema for each external source, and a table for each data set within a source.

But sometimes the external data source does not consist of a fixed number of data sets, known ahead of time. Consider, for example, Optiq’s web adapter, optiq-web, which makes any HTML table in any web page appear as a SQL table. Today you can create an Optiq model and define within it several tables.

Optiq-web’s home page shows an example where you can create a schema with tables “Cities” and “States” (based on the Wikipedia pages List of states and territories of the United States and List of United States cities by population) and execute a query to find out the proportion of the California’s population that live in cities:

SELECT COUNT(*) "City Count", 
SUM(100 * c."Population" / s."Population") "Pct State Population" 
FROM "Cities" c, "States" s 
WHERE c."State" = s."State" AND s."State" = 'California';

But what if you want to query a URL that isn’t in the schema? A table macro will allow you to do this:

SELECT * FROM TABLE(
web(‘http://en.wikipedia.org/wiki/List_of_countries_by_population’));

web is a function that returns a table. That is, a Table object, which is the definition of a table. In Optiq, a table definition doesn’t need to be assigned a name and put inside a schema, although most do; this is a free-floating table. A table just needs to be able to describe its columns, and to be able to convert itself to relational algebra. Optiq invokes it while the query is being planned.

Here is the WebTableMacro class:

public class WebTableMacro {
  public Table eval(String url) {
    Map operands = new HashMap();
    operands.put(“url”, url);
    return new WebTable(operands, null); 
  }
 }

And here is how you define a WEB function based upon it in your JSON model:

{
  version: '1.0',
  defaultSchema: ‘ADHOC’,
  schemas: [
    { 
      name: 'ADHOC',
      functions: [
        { 
          name: ‘WEB’,
          className: 'com.example.WebTableMacro'
        } 
      ]
    }
  ]
  }

Table macros are a special kind of table function. They are defined in the same in the model, and invoked in the same way from a SQL statement. A table function can be used at prepare time if (a) its arguments are constants, and (b) the table it returns implements TranslatableTable. If it fails either of those tests, it will be invoked at runtime; it will still produce results, but will have missed out on the advantages of being part of the query optimization process.

What kind of advantages can the optimization process being? Suppose a web page that produces a table supports URL parameters to filter on a particular column and sort on another. We could write planner rules that push take a FilterRel or SortRel on top of a WebTableScan and convert them into a scan with extra URL parameters. A table that came from the web function would be able to participate in that process.

The name ‘table macros’ is inspired by Lisp macros — functions that are invoked at compile time rather than run time. Macros are an extremely powerful feature in Lisp and I hope they will prove to be a powerful addition to SQL. But to SQL users, a more familiar name might be ‘parameterized views’.

Views and table macros are both expanded to relational algebra before the query is optimized. Views are specified in SQL, whereas table macros invoke user code (it takes some logic to handle those parameters). Under the covers, Optiq’s views are implemented using table macros. (They always have been — we’ve only just got around to making table macros a public feature.)

To sum up. Table macros are powerful new Optiq feature that extend the reach of Optiq to data sources that have not been pre-configured into an Optiq model. They are a generalization of SQL views, and share with views the efficiency of expanding relational expressions at query compilation time, where they can be optimized. Table macros will help bring a SQL interface to yet more forms of data.

Improvements to Optiq's MongoDB adapter

2014-03-19T13:39:00.002-07:00

It’s been a while since I posted to this blog, but I haven’t been idle. Quite the opposite; I’ve been so busy writing code that I haven’t had time to write blog posts. A few months ago I joined Hortonworks, and I’ve been improving Optiq on several fronts, including several releases, adding a cost-based optimizer to Hive and some other initiatives to make Hadoop faster and smarter.

More about those other initiatives shortly. But Optiq’s mission is to improve access to all data, so here I want to talk about improvements to how Optiq accesses data in MongoDB. Optiq can now translate SQL queries to extremely efficient operations inside MongoDB.

MongoDB 2.2 introduced the aggregation framework, which allows you to compose queries as pipelines of operations. They have basically implemented relational algebra, and we wanted to take advantage of this.

As the following table shows, most of those operations map onto Optiq’s relational operators. We can exploit that fact to push SQL query logic down into MongoDB.

MongoDB operator	Optiq operator
$project	ProjectRel
$match	FilterRel
$limit	SortRel.limit
$skip	SortRel.offset
$unwind	-
$group	AggregateRel
$sort	SortRel
$geoNear	-

In the previous iteration of Optiq’s MongoDB adapter, we could push down project, filter and sort operators as $project, $match and $sort. A bug pointed out that it would be more efficient if we evaluated $match before $project. As I fixed that bug yesterday, I decided to push down limit and offset operations. (In Optiq, these are just attributes of a SortRel; a SortRel sorting on 0 columns can be created if you wish to apply limit or offset without sorting.)

That went well, so I decided to go for the prize: pushing down aggregations. This is a big performance win because the output of a GROUP BY query is often a lot smaller than its input. It is much more efficient for MongoDB aggregate the data in memory, returning a small result, than to return a large amount of raw data to be aggregated by Optiq.

Now queries involving SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, OFFSET, FETCH (or LIMIT if you prefer the PostgreSQL-style syntax), not to mention sub-queries, can be evaluated in MongoDB. (JOIN, UNION, INTERSECT, MINUS cannot be pushed down because MongoDB does not support those relational operators; Optiq will still evaluate those queries, pushing down as much as it can.)

Let's see some examples of push-down in action.

Given the query:

SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state

Optiq evaluates:

db.zips.aggregate(
{$project: {STATE: '$state'}},
{$group: {_id: '$STATE', C: {$sum: 1}}},
{$project: {STATE: '$_id', C: '$C'}})

and returns

STATE=WV; C=659
STATE=WA; C=484
...

Now let’s add a HAVING clause to find out which states have more than 1,500 zip codes:

SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
HAVING COUNT(*) > 1500

Optiq adds a $match operator to the previous query's pipeline:

db.zips.aggregate(
{$project: {STATE: '$state'}},
{$group: {_id: '$STATE', C: {$sum: 1}}},
{$project: {STATE: '$_id', C: ‘$C'}},
{$match: {C: {$gt: 1500}}})

and returns

STATE=NY; C=1596
STATE=TX; C=1676
STATE=CA; C=1523

Now the pièce de résistance. The following query finds the top 5 states in terms of number of cities (and remember that each city can have many zip-codes).

SELECT state, COUNT(DISTINCT city) AS cdc
FROM zips
GROUP BY state
ORDER BY cdc DESC
LIMIT 5

COUNT(DISTINCT x) is difficult to implement because it requires the data to be aggregated twice — once to compute the set of distinct values, and once to count them within each group. For this reason, MongoDB doesn’t implement distinct aggregations. But Optiq translates the query into a pipeline with two $group operators. For good measure, we throw in ORDER BY and LIMIT clauses.

The result is an awe-inspiring pipeline that includes two $group operators (implementing the two phases of aggregation for distinct-count), and finishes with $sort and $limit.

db.zips.aggregate(
{$project: {STATE: '$state', CITY: '$city'}},
{$group: {_id: {STATE: '$STATE', CITY: '$CITY'}}},
{$project: {_id: 0, STATE: '$_id.STATE', CITY: '$_id.CITY'}},
{$group: {_id: '$STATE', CDC: {$sum: {$cond: [ {$eq: ['CITY', null]}, 0, 1]}}}},
{$project: {STATE: '$_id', CDC: '$CDC'}},
{$sort: {CDC: -1}}, {$limit: 5})

I had to jump through some hoops to get this far, because MongoDB’s expression language can be baroque. In one case I had to generate

{$ifNull: [null, 0]}

in order to include the constant 0 in a $project operator. And I was foiled by MongoDB bug SERVER-4589 when trying to access the values inside the zips table's loc column, which contains (latitude, longitude) pairs represented as an array.

In conclusion, Optiq on MongoDB now does a lot of really smart stuff. It can evaluate any SQL query, and push down a lot of that evaluation to be executed efficiently inside MongoDB.

I encourage you to download Optiq and try running some sophisticated SQL queries (including those generated by the OLAP engine I authored, Mondrian).

Efficient SQL queries on MongoDB

2013-06-17T17:15:00.000-07:00

How do you integrate MongoDB with other data in your organization? MongoDB is great for building applications, and it has its own powerful query API, but it's difficult to mash up data between MongoDB and other tools, or to make tools that speak SQL, such as Pentaho Analysis (Mondrian), connect to MongoDB.

Building a SQL interface isn't easy, because MongoDB's data model is such a long way from SQL's model. Here are some of the challenges:

MongoDB doesn't have a schema. Each database has a number of named 'collections', which are the nearest thing to a SQL table, but each row in a collection can have a completely different set of columns.
In MongoDB, data can be nested. Each row consists of a number of fields, and each field can be a scalar value, null, a record, or an array of records.
MongoDB supports a number of relational operations, but doesn't use the same terminology as SQL: the find method supports the equivalent of SELECT and WHERE, while the aggregate method supports the equivalent of SELECT, WHERE, GROUP BY, HAVING and ORDER BY.
For efficiency, it's really important to push as much of the processing down to MongoDB's query engine, without the user having to re-write their SQL.
But MongoDB doesn't support anything equivalent to JOIN.
MongoDB can't access external data.

I decided to tackle this using Optiq. Optiq already has a SQL parser and a powerful query optimizer that is powered by rewrite rules. Building on Optiq's core rules, I can add rules that map tables onto MongoDB collections, and relational operations onto MongoDB's find and aggregate operators.

What I produced is a effectively a JDBC driver for MongoDB. Behind it is a hybrid query-processing engine that pushes as much of the query processing down to MongoDB, and does whatever is left (such as joins) in the client.

Let's give it a try. First, install MongoDB, and import MongoDB's zipcode data set:

$ curl -o /tmp/zips.json http://media.mongodb.org/zips.json
$ mongoimport --db test --collection zips --file /tmp/zips.json
Tue Jun  4 16:24:14.190 check 9 29470
Tue Jun  4 16:24:14.469 imported 29470 objects

Log into MongoDB to check it's there:

$ mongo
MongoDB shell version: 2.4.3
connecting to: test
> db.zips.find().limit(3)
{ "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" : "AL", "_id" : "35004" }
{ "city" : "ADAMSVILLE", "loc" : [ -86.959727, 33.588437 ], "pop" : 10616, "state" : "AL", "_id" : "35005" }
{ "city" : "ADGER", "loc" : [ -87.167455, 33.434277 ], "pop" : 3205, "state" : "AL", "_id" : "35006" }
> exit
bye

Now let's see the same data via SQL. Download and install Optiq:

$ git clone https://github.com/julianhyde/optiq.git
$ mvn install

Optiq comes with a sample model in JSON format, and the sqlline SQL shell. Connect using the mongo-zips-model.json Optiq model, and use sqlline's !tables command to list the available tables.

$ ./sqlline
sqlline> !connect jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json admin admin
Connecting to jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json
Connected to: Optiq (version 0.4.13)
Driver: Optiq JDBC Driver (version 0.4.13)
Autocommit status: true
Transaction isolation: TRANSACTION_REPEATABLE_READ
sqlline> !tables
+------------+--------------+-----------------+---------------+
| TABLE_CAT  | TABLE_SCHEM  |   TABLE_NAME    |  TABLE_TYPE   |
+------------+--------------+-----------------+---------------+
| null       | mongo_raw    | zips            | TABLE         |
| null       | mongo_raw    | system.indexes  | TABLE         |
| null       | mongo        | ZIPS            | VIEW          |
| null       | metadata     | COLUMNS         | SYSTEM_TABLE  |
| null       | metadata     | TABLES          | SYSTEM_TABLE  |
+------------+--------------+-----------------+---------------+

Each collection in MongoDB appears here as a table. There are also the COLUMNS and TABLES system tables provided by Optiq, and a view called ZIPS defined in mongo-zips-model.json.

Let's try a simple query. How many zip codes in America?

sqlline> SELECT count(*) FROM zips;
+---------+
| EXPR$0  |
+---------+
| 29467   |
+---------+
1 row selected (0.746 seconds)

Now a more complex one. How many states have a city called Springfield?

sqlline> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';
+-----+
|   C |
+-----+
| 20  |
+-----+
1 row selected (0.549 seconds)

Let's use the SQL EXPLAIN command to see how the query is implemented.

sqlline> !set outputformat csv
sqlline> EXPLAIN PLAN FOR
. . . .> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';

'PLAN'
'EnumerableAggregateRel(group=[{}], C=[COUNT($0)])
  EnumerableAggregateRel(group=[{0}])
    EnumerableCalcRel(expr#0..4=[{inputs}], expr#5=['SPRINGFIELD'], expr#6=[=($t0, $t5)], STATE=[$t3], $condition=[$t6])
      MongoToEnumerableConverter
        MongoTableScan(table=[[mongo_raw, zips]], ops=[[<{city: 1, state: 1, _id: 1}, {$project ...}>]])
'
1 row selected (0.115 seconds)

The last line of the plan shows that Optiq calls MongoDB's find operator asking for the "city", "state" and "_id" fields. The first three lines of the plan show that the filter and aggregation are implemented using in Optiq's built-in operators, but we're working on pushing them down to MongoDB.

Finally, quit sqlline.

sqlline> !quit
Closing: net.hydromatic.optiq.jdbc.FactoryJdbc41$OptiqConnectionJdbc41

Optiq and its MongoDB adapter shown here are available on github. If you are interested in writing your own adapter, check out optiq-csv, a sample adapter for Optiq that makes CSV files appear as tables. It has own tutorial on writing adapters.

Check back at this blog over the next few months, and I'll show how to write views and advanced queries using Optiq, and how to use Optiq's other adapters.

Gathering requirements for olap4j 2.0

2013-06-03T10:35:00.000-07:00

It's time to start thinking about olap4j version 2.0.

My initial goal for olap4j version 1.0 was to decouple application developers from Mondrian's legacy API. We've far surpassed that goal. Many applications are using olap4j to connect to OLAP servers like Microsoft SQL Server Analysis Services, Palo and SAP BW. And projects are leveraging the olap4j-xmlaserver sister project to provide an XMLA interface on their own OLAP server. The need is greater than ever to comply with the latest standards.

The difference between products and APIs is that you can't change APIs without pissing people off. Even if you improve the API, you force the developers of the drivers to implement the improvements, and the users of the API get upset because they don't have their new drivers yet. There are plenty of improvements to make to olap4j, so let's try to do it without pissing too many people off!

Since olap4j version 1.0, there has been a new release of Mondrian (well, 4.0 is not released officially yet, but the metamodel and API are very nearly fully baked) and a new release of SQL Server Analysis Services, the home of the de facto XMLA standard.

Also, the Mondrian team have spun out their XMLA server as a separate project (olap4j-xmlaserver) that can run against any olap4j driver. If this server is to implement the latest XMLA specification, it needs the underlying olap4j driver to give it all the metadata it needs.

Here's an example of the kind of issue that we'd like to fix. In olap4j 1.x, you can't tell whether a hierarchy is a parent-child hierarchy. People have asked for a method

boolean isParentChild();

Inspired by the the STRUCTURE attribute of the MDSCHEMA_HIERARCHIES XMLA request, we instead propose to add

enum Structure {
  FULLYBALANCED,
  RAGGEDBALANCED,
  RAGGED,
  NETWORK
}
Structure isParentChild();

We can't add this without requiring a new revision of all drivers, but let's be careful gather all the requirements so we can do it just this once.

Here are my goals for olap4j 2.0:

Support Analysis Services 2012 metamodel and XMLA as of Analysis Services 2012.
Create an enum for each XMLA enum. (Structure, above, is an example.)
Support Mondrian 4.0 metamodel. Many of the new Mondrian features, such as measure groups and attributes, are already in SSAS and XMLA.
Allow user-specified metadata, such as those specified in Mondrian's schema as annotations, to be passed through the olap4j API and XMLA driver.
We'll know that we've done the right thing if we can remove MondrianOlap4jExtra.

I'd also like to maintain backwards compatibility. As I already said, drivers will need to be changed. But any application that worked against olap4j 1.1 should work against olap4j 2.0, and any driver for olap4j 2.0 should also function as an olap4j 1.x driver. That should simplify things for the users.

I'll be gathering a detailed list of API improvements in the olap4j 2.0 specification. If you have ideas for what should be in olap4j version 2.0, now is the time to get involved!

Need help

2013-04-30T18:28:00.002-07:00

I was amused by this note I just received via email.

Subject: Need help
To: jhyde@users.sourceforge.net
From: <retracted>

Respected Sir, we are doing a final year project as Student Data warehouse for BE degree and we came to know to about olap4j at the end of our project , we are presently in unknown way and we are seeking your help, since we are left with only 15 days for project submission, so if we could get any sample application which is built on olap4j ,will help us to understanding in usage of APIs for our project ,since i find it too difficult in usage of APIs and we are out of time , so any help from your side would greatly be appreciated and remembered
Thanks in advance

I get quite a few like these. (I suppose they are a fact of life for any open source developer.)

The spelling in this one is much better than most. And usually the subject like is more like 'Need help, please, please!!!!!'. But I always wonder how anyone who uses punctuation in such an arbitrary way could ever write code that works. Probably the author's supervisor is wondering the same thing.

Optiq latest

2013-03-01T13:29:00.000-08:00

Optiq has been developing steadily over the past few months. Those of you who watch github will know most of this already, but I thought I'd summarize what's been going on.

(This post originally appeared as an email to the optiq-dev mailing list. Since I compose email messages a lot faster than blog posts, and the email message contained a lot of stuff that you'd all find interesting, it made sense to recycle it. Hope you don't mind.)

There are two exciting new projects using Optiq:

I have been working in private with Chris Wensel on a project to use Optiq to provide a SQL interface to Cascading, and last week we announced Lingual.
I am working a SQL interface for the Apache Drill project.

This week I attended the Strata conference in Santa Clara, and met lots of people who are interested in Optiq for various reasons. There are at least 4 back-end platforms or front-end languages that people would like to see. I can't describe them all here, but this space. Some exciting stuff will play out in this forum over the next few months.

One of my personal favorite projects is to get Optiq running on compressed, in-memory tables managed by a JSR-107-compliant cache/data-grid such as ehCache or Infinispan. ArrayTable and CloneSchema are the beginnings of that project. The end result will be a high-performance, distributed, in-memory SQL database... how cool is that? (Certainly, my own Mondrian project will be able to make massive use of it.)

And, some people were asking for the Splunk adapter (the so-called "JDBC driver for Splunk") to be improved. Good to hear that it's proving useful.

Now regarding the code.

One person noted that "mvn clean install" should just work for any maven-based project, and it doesn't. He's right. It should. I fixed it. Now it does.

I made some breaking API changes last week, so I upped the version to 0.2.

Expect the version numbers to continue to move erratically, because in our current development mode, it doesn't seem to make sense to have specific milestones. We're basically working on general stability rather than a few big features. We are trying to maintain backwards compatibility, but if we need to change API, we'll do it. I'll help dependent projects such as Lingual and Drill migrate to the new API, and make it as easy as possible for the rest of you.

Over the last week I'd been working on the code generation that powers SQL scalar expressions and built-in functions. This code generation is, obviously, used by the Java provider, but it can also be used by other providers. For instance, Lingual generates Java strings for filters that it passes to Janino. I've been working on OptiqSqlOperatorTest to get more and more of the built-in SQL functions to pass, and I've added OptiqAssert.planContains so that we can add tests to make sure that the minitiae of java code generation are as efficient as possible.

I still need to tell you about the extensions I've been making to Optiq SQL to support Drill (but useful to any project that wants to use nested data or late-binding schemas), but that will have to wait for its own blog post. Watch this space.

Announcing Lingual

2013-02-26T15:31:00.001-08:00

The last few months, I've been collaborating on a project with Chris Wensel, the author of Cascading. Last week we announced Lingual, an open source project that puts a SQL interface on top of Cascading.

Architecturally, Lingual combines the Cascading engine with my own Optiq framework. Optiq provides the SQL interface (including JDBC), reads table and column definitions from Cascading's metadata store, and few custom Optiq rules target relational operations (project, filter, join and so forth) onto a Cascading operator graph. The queries are executed, on top of Hadoop, using Cascading's existing runtime engine.

Not everyone has heard of Cascading, so let me explain what it is, and why I think it fits well with Optiq. Cascading is a Java API for defining data flows. You write a Java program to build data flows using constructs such as pipes, filters, and grouping operators, Cascading converts that data flow to a MapReduce job, and runs it on Hadoop. Cascading was established early, picked the right level of abstraction to be simple and useful, and has grown to industry strength as it matured.

As a result, companies who are doing really serious things with Hadoop often use Cascading. Some of the very smartest Hadoop users are easily smart enough to have built their own Hadoop query language, but they did something even smarter — they layered DSLs such as Scalding and Cascalog on top of Cascading. In a sense, Optiq-powered SQL is just another DSL for Cascading. I'm proud to be in such illustrious company.

Newbies always ask, "What is Hadoop?" and then a few moments later, "Is Hadoop a database?". (The answer to the second question is easy. Many people would love Hadoop to be an "open source Teradata", but wanting it doesn't make it so. No Virginia, Hadoop is not a database.)

A cottage industry has sprung up of bad analogies for Hadoop, so forgive me if I make another one: Hadoop is, in some sense, an operating system for the compute cluster. After mainframes, minicomputers, and PCs, the next generation of hardware is the compute cluster. Hadoop is the OS, and MapReduce is the assembly language for that hardware — all powerful, but difficult to write and debug. UNIX came about to serve the then-new minicomputers, and crucial to its success was the C programming language. C allowed developers to be productive while writing code almost as efficient as assembler, and it allowed UNIX to move beyond its original PDP-7 hardware.

Cascading is the C of the Hadoop ecosystem. Sparse, elegantly composable, low-level enough to get the job done, but it abstracts away the nasty stuff unless you really want to roll up your sleeves.

It makes a lot of sense to put SQL on top of Cascading. There has been a lot of buzz recently about SQL on Hadoop, but we're not getting caught up in the hype. We are not claiming that Lingual will give speed-of-thought response times (Hadoop isn't a database, remember?), nor will it make advanced predictive analytics will be easy to write (Lingual is not magic). But Hadoop is really good at storing, processing, cleaning and exporting data at immense scale. Lingual brings that good stuff to a new audience.

A large part of that SQL-speaking audience is machines. I'd guess that 80% of the world's SQL statements are generated by tools. Machine-generated SQL is pretty dumb, so it essential that you have an optimizer. (As author of a tool that speaks SQL — Mondrian — and several SQL engines — Broadbase, LucidDB, SQLstream — I have been on both sides of this problem.) Once you have an optimizer, you can start doing clever stuff like re-organizing your data to make the queries run faster. Maybe the optimizer will even help.

Lingual is not a "SQL-like language". Because it is based on Optiq, Lingual is a mature implementation of ANSI/ISO standard SQL. This is especially important for those SQL-generating tools, which cannot rephrase a query to work around a bug or missing feature. As part of our test suite, we ran Mondrian on PostgreSQL, and captured the SQL queries it issued and the results the database gave. Then we replayed those queries — over 6,200 of them — to Lingual and checked that Lingual gave the same results. (By the way, putting Optiq and Cascading together was surprisingly easy. The biggest challenge we had was removing the Postgres-isms from thousands of generated queries.)

Lingual is not the only thing I've been working on. (You can tell when I'm busy by the deafening silence on this blog.) I've also been working on Apache Drill, using Optiq to extend SQL for JSON-shaped data, and I'll blog about this shortly. Also, as Optiq is integrated with more data formats and engines, the number of possibilities increases. If you happen to be at Strata conference tomorrow (Wednesday), drop me a line on twitter and we can meet up and discuss. Probably in the bar.

Explaining holidays to a 3 year old

2013-01-01T00:02:00.001-08:00

Your birthday involves presents and cake. The day after your birthday you are allowed to say "I am x", where x is the number of candles on your cake.

4th of July is the birthday of our country, which is called America. America has decided that it doesn't want presents or a cake, just fireworks.

Thanksgiving is a big meal with your family. We eat a turkey, which is basically a very large chicken.

Christmas celebrates the first birthday, a very long time ago, of a baby who was very special, a bit like a king but without a palace or anything, who grew up to be very, very nice indeed, so nice that he made everyone else a bit nicer. And there were shepherds and animals. Also, kind of like a birthday for everyone, because everyone gets presents, although giving them is the important part. Also, we have an indoor tree with lights and things on it. Also, we have a meal exactly like Thanksgiving.

New Year. We all need new calendars tomorrow. Train calendars are best.

Happy New Year, everyone. I hope you get the train calendar your heart desires.

Mondrian in Action in Action

2012-10-31T16:05:00.000-07:00

I couldn't resist. When the Mondrian in Action book is published in the spring, it will look something like this...

Artist's conception of the upcoming bestseller, "Mondrian in Action"

Mondrian in Action

2012-10-31T12:35:00.000-07:00

I am delighted to announce an upcoming book all about Mondrian, called "Mondrian in Action". Some chapters are available in electronic form now, and the final print version is scheduled to hit the shelves in Spring, 2013. For one day only, there is a 50% discount if you pre-order the book and join the early-access program.

Whoever heard of a successful open source project that didn't have a book? Mondrian has become successful without one... but our long-suffering users have had to piece together documentation from the scrappy online documentation, forum posts and mailing list archives. For years, my answer has been, "A book is a great idea, but I'm just too busy writing software!" Finally, I've teamed up with two Mondrian and Pentaho experts, Bill Back (@billbackbi) and Nicholas Goodman (@nagoodman), and we've set out to create the definitive guide. Now all the information will be in one place, right there between on your desk between your keyboard and your coffee mug.

Mondrian in Action serves several audiences. It explains to end-users and CIOs how Mondrian analytics can unlock the value in business data. For schema developers and DBAs, it describes in depth to how to create and administer a Mondrian system. Mondrian in Action covers the upcoming Mondrian 4 release, and includes chapters on security, multi-tenancy, and integration with other tools such as the Pentaho Business Analytics suite and Saiku, advanced analytics and visualizations, and integration with Big Data technologies.

The book is part of the acclaimed "In Action" series from Manning Publications. Books in this series are known for their direct approach to the subject, and concrete, practical examples. For countless open source projects, the definitive guide is an "In Action" book.

Speaking of open source, the Manning Early-Access Program (MEAP) is a 'release early, release often' process. This helps the community shape the book while it is being written. One chapter of the book is available to all. If you pre-order the book, you will get electronic access to draft chapters as they are completed, and a print copy when it is released. There is a forum where you can post questions and give feedback on the book.

Manning are offering a 50% discount for one day only. (It expires at midnight, eastern time, on November 1st.) Go to the publisher's site, http://www.manning.com/back, and enter discount code dotd1101au.

Pesky quoted identifiers in SQL

2012-06-10T14:35:00.000-07:00

The SQL that Mondrian generates is, until now, different than the SQL that most people would write by hand. Most people don't use spaces or punctuation in table and column names, and don't enclose identifiers in quotation marks when writing SQL DDL, DML or queries. Mondrian, on the other hand, religiously quotes every identifier, whether it needs it or not.

The two styles are not compatible because on many databases (Oracle is one example) unquoted identifiers are implicitly converted to upper-case. If you use lower-case table and column names in Mondrian's schema, they will not match the upper-case identifiers created during DDL.

For instance, if you create a table in Oracle using

CREATE TABLE emp ( empno INTEGER, ename VARCHAR2(30), deptno INTEGER);

then Oracle creates a table called EMP with columns EMPNO, ENAME and DEPTNO. When you query it using

SELECT ename FROM emp WHERE deptno = 20;

the effect is as if you had written

SELECT ENAME FROM EMP WHERE DEPTNO = 20;

Now, if you'd told Mondrian that the table was called "emp", Mondrian tries to be helpful. It generates the query

SELECT "ename" FROM "emp" WHERE "deptno" = 20;

Of course, there is no table called "emp", only one called "EMP", so on case-sensitive databases such as Oracle this causes an error. You then need to go back to your schema and change

<Table name="emp"/>

<Table name="EMP"/>

and all other table and column names in your schema. Yuck!

There is now a simpler way. The Schema XML element has a quoteSql attribute:

<Schema name='FoodMart' metamodelVersion='4.0' quoteSql='false'>

If you set quoteSql='false', Mondrian will not quote identifiers when generating SQL. (Actually, it will still quote them if they contain spaces and such. But we recommend that if you use quoteSql='false', you use sensible table names containing only alphanumeric characters and '_'.)

More details can be found in MONDRIAN-887. It is only fixed in the lagunitas branch (i.e., mondrian-4.0 alpha), and only in new-style schemas (not mondrian-3 style schemas automatically upgraded). Give it a try and let me know how it works for you.

From the ashes of the database revolution...

2012-06-07T04:45:00.001-07:00

With NoSQL and Hadoop, the database world has undergone a revolution. The fighting reached its peak a couple of years ago, but things have calmed down since, and now is a good time to take stock of old and new style data management technologies.

From this revolution, we can learn a lot about what databases should, and should not be. At the end of this post, I propose a system, called Optiq, that would restore to NoSQL/Hadoop systems some of the good features of databases.

Learning from history

Revolutions tend to follow patterns. George Orwell allegorized the progress of the Russian Revolution in his novel Animal Farm. He described the injustices that were the trigger for the revolution, the new egalitarian value system established after the revolution, and the eventual corruption of those values. Revolutions are an opportunity to introduce new ideas, not all of them good ones. For example, the French revolution put in place a decimal system, and though they kept the kilogramme and the metre, they were forced to quickly relinquish the 10 hour day and the 10 day week when the workers discovered that they'd been conned out of 30% of their weekend time.

We see all of these forces at play in the database revolution. The triggers for the revolution were the requirements that traditional RDBMSs could not meet (or were not meeting at the time). The revolution adopted a new paradigm, introduced some new ideas, and threw out some old ideas. I am interested in which of those old ideas should be reinstated under the new regime.

I am a database guy. I was initially skeptical about the need for a revolution, but a couple of years ago I saw that Hadoop and NoSQL were gaining traction, had some good ideas, growing momentum, and were here to stay. My prediction is that traditional and new data management systems will grow more and more similar in appearance over the next 5-10 years. Traditional RDBMSs will adopt some of the new ideas, and the new systems will support features that make them palatable to customers accustomed to using traditional databases.

But first, some terminology. (As George Orwell would agree, what you call something is almost as important as what it is.)

I call the new breed of systems "data management systems", not "databases". The shift implies something less centralized, more distributed, and about processing as well as just storing and querying data. Or maybe I'm confusing terminology with substance.
I distinguish NoSQL systems from Hadoop, because Hadoop is not a data management system. Hadoop is a substrate upon which many awesome things can, and will, be built, including ETL and data management systems.
NoSQL systems are indeed databases, but they throw out several of the key assumptions of traditional databases.
I'm steering clear of the term "Big data" for reasons I've already made clear.

The good stuff

In the spirit of post-revolutionary goodwill, let's steer clear of our pet gripes and list out what is best about the old and new systems.

Good features from databases

SQL language allows integration with other components, especially components that generate queries and need to work on multiple back-ends.
Management of redundant data (such as indexes and materialized views), and physically advantageous data layout (sorted data, clustered data, partitioned tables)
ACID transactions
High-performance implementations of relational operators
Explicit schema, leading to concise, efficient queries

Good features from Hadoop/NoSQL systems

Easy scale-out on commodity hardware
Non-relational data
User-defined and non-relational operators
Data is not constrained by schema

Scale and transactions

Scale-out is key. The new data systems all run at immense scale. If traditional databases scaled easily and cheaply, the revolution would probably not have happened.

There are strong arguments for and against supporting ACID transactions. Everyone agrees that transactions have high value: without them, it is more difficult to write bug-free applications. But the revolutionaries assert that ACID transactions have to go, because it is impossible to implement them efficiently. Newer research suggests that there are ways to implement transactions at acceptable cost.

In my opinion, transactions are not the main issue, but are being scapegoated because of the underlying problem of scalability. We would not be having the debate — indeed, the whole NoSQL movement may not have occurred — if conventional databases had been able to scale as their users wanted.

To be honest, I don't have a lot of skin in this game. As an analytic database technology, Optiq is concerned more with scalability than transactions. But it's interesting that transactions, like the SQL language, were at first declared to be enemies of the revolution, and are now being rehabilitated.

Schema

Relational databases require a fixed schema. If your data has a schema, and the schema does not change over time, this is a good thing. Your queries can be more concise because you are not defining the same fields, types, and relationships every time you write a query.

Hadoop data does not have a schema (although you can impose one, after the event, using tools such as Pig and Hive).

The ideal would seem to be that you can provide a schema if the data conforms to a fixed format, provide a loose schema if, say, records have variable numbers of fields, or operate without one. In Hadoop, as in ETL tools, data is schema-less in early stages of a pipeline, stronger typing is applied in later stages of the pipeline, as fields are parsed and assigned names, and records that do not conform to the required schema are eliminated.

Location, control and organization of data

Traditional databases own their storage. Their data resides in files, sometimes entire file systems, that can only be accessed by the database. This allows the database to tightly control the access to and organization of the data. But it means that the data cannot be shared between systems, even between databases made by the same vendor.

Hadoop clusters are expensive, but if several applications share the same cluster, the utilization is kept high, and the cost is spread across more departments' budgets. Applications may share data sets, not just processing resources, but they access the data in place. (That place may or may not be HDFS.) Compared to copying data into an RDBMS, sharing data reduces the saves both time and money.

Lastly, the assumption that data is shared encourages applications to use straightforward formats for their data. A wide variety of applications can read the data, even those not envisioned when the data format was designed.

SQL and query planning

SQL is the hallmark of an RDBMS (at for those of us too young to remember QUEL). SQL is complicated to implement, so upstart open source projects, in their quest to implement the simplest thing that could possibly work have been inclined to make do with less powerful "SQL-like" languages. Those languages tend to disappoint, when it comes to interoperability and predictability.

But I contend that SQL support is a consequence of a solid data management architecture, not the end in itself. A data management system needs to accept new data structures and organizations, and apply them without rewriting application code. It therefore needs a query planner. A query planner, in turn, requires a metadata catalog and a theoretically well-behaved logical language, usually based on relational algebra, for representing queries. Once you have built these pieces, it is not a great leap to add SQL support.

The one area that SQL support is essential is tool integration. Tools, unless written for that specific database, want to generate SQL as close to the SQL standard as possible. (I speak from personal experience, having written Mondrian dialects for more than a dozen "standards compliant" databases.) Computer-generated SQL is not very smart — for example, you will often see trivial conditions like "WHERE 1 = 1" and duplicate expressions in the SELECT clause — and therefore needs to be optimized.

Flat relational data

There is no question that (so-called "flat") relational data is easier for the database to manage. And, we are told, Ted Codd decreed forty years ago that relational data is all we should ever want. Yet I think that database users deserve better.

Codd's rules about normalization have been used to justify a religious war, but I think his point was this. If you maintain multiple copies of the same information, you'll get into trouble when you try to update it. One particular, and insidious, form of redundant information is the implicit information in the ordered or nested data.

That said, we're grown ups. We know that there are risks to redundancy, but there are significant benefits. The risks are reduced if the DBMS helps you manage that redundancy (what are indexes, anyway?), and the benefits are greater if your database is read much more often than it is updated. Why should the database not return record sets with line-items nested inside their parent orders, if that's what the application wants? No reason that I can think of.

In summary, a data management system should allow "non-flat" data, and operations on that data, while keeping a semantics based, as far as possible, on the relational algebra.

Introducing Optiq

Optiq aims to add the "good ideas" from traditional databases onto a new-style Hadoop or NoSQL architecture.

To a client application, Optiq appears to be a database that speaks SQL and JDBC, but Optiq is not a database. Whereas a database controls storage, processing, resource allocation and scheduling, Optiq cedes these powers to the back-end systems, which we call data providers.

Optiq is not a whole data management system. It is a framework that can mediate with one or more data management systems. (Optiq could be configured and distributed with a scheduler, metadata layer, data structures, and algorithms, so that it comes out of the box looking like a database. In fact, we hope and expect that some people will use it that way. But that is not the only way it is intended to be used.)

The core of the framework is the extensible query planner. It allows providers to specify their own type systems, operators, and optimizations (for example, switching to a materialized view, or eliminating a sort if the underlying file is already sorted). It also allows applications to define their own functions and operators, so that their application logic can run in the query-processing fabric.

An example

You might describe Optiq as a database with the hood open, accessible to anyone who wants to tinker with the engine. Here is a simple example:

Class.forName("net.hydromatic.optiq.jdbc.Driver"); Connection connection = DriverManager.getConnection("jdbc:optiq:"); OptiqConnection optiqConnection = connection.unwrap(OptiqConnection.class); JavaTypeFactory typeFactory = optiqConnection.getTypeFactory(); optiqConnection.getRootSchema().add( "HR", new CsvSchema("/var/flatfiles/hr", typeFactory)); ResultSet resultSet = connection.createStatement().executeQuery( "SELECT e.name, e.sal, d.name AS department\n" + "FROM hr.emps AS e, hr.depts AS d\n" + "WHERE e.deptno = d.deptno\n" + "ORDER BY e.empno"); while (resultSet.next()) { System.out.println( "emp=" + resultSet.getString(1) + ", sal=" + resultSet.getInt(2) + ", department=" + resultSet.getString(3)); } resultSet.close();

The program requires a directory, /var/flatfiles/hr, containing the files EMPS.csv and DEPTS.csv. Each file has a header record describing the fields, followed by several records of data.

There is no other data or metadata, and in fact CsvSchema is an extension, not a built-in part of the system.

When the connection is opened, the virtual database is empty. There are no tables, nor even any schemas. The getRootSchema().add( ... ) call registers a schema with a given name. It is like mounting a file-system.

Once the CsvSchema is registered with the connection with the name "HR", Optiq can retrieve the table and column metadata to parse and optimize the query. When the query is executed, Optiq calls CsvSchema's implementations of linq4j's Enumerable interface to get the contents of each table, applies built-in Java operators to join and sort the records, and returns the results through the usual JDBC ResultSet interface.

This example shows that Optiq contains a full SQL parser, planner and implementations of query operators, but it makes so few assumptions about the form of data and location of metadata that you can drop in a new storage plugin in a few lines of code.

Design principles

The design of the Optiq framework is guided by the following principles.

Do not try to control the data, but if you know about the data organization, leverage it.
Do not require a schema, but if you know about the shape of the data, leverage it.
Provide the SQL query language and JDBC interface, but allow other languages/interfaces.
Support linq4j as a backend, but allow other protocols.
Delegate policy to the data providers.

Let's see how Optiq brings the "good ideas" of databases to a NoSQL/Hadoop provider.

Applying these principles to schemas, Optiq can operate with no, partial, or full schema. Data providers can determine their own type system, but are generally expected to be able to operate on records of any type: that may be a single string or binary field, and may contain nested collections of records. Since Optiq does not control the data, if operating on a schema-less provider like Hadoop, Optiq would apply its schema to already loaded data, as Pig and Hive do. If Optiq is assured that the data is clean (for example, a particular field is always an integer) then it may be able to optimize.

Optiq's type system allows records to contain nested records, and provides operators to construct and destruct nested collections. Whereas SQL/JDBC queries do not stretch the type system, linq4j gives Optiq a workout: it needs to support the Java type system and operations such as selectMany and groupBy that operate on collection types.

Lastly, on breaking down the rigid boundary between database and application code.

My goal in data-oriented programming is to allow applications, queries, and extension functions and operators to be written in the same language — and if possible using the same programming model, and on the same page of code — and distributed to where query processing is taking place.

The paradigms should be the same, as far as possible. (MapReduce fails this test. Even though MapReduce is Java, one would not choose to write algorithms in this way if there was not the payoff of a massively scalable, fault-tolerant execution infrastructure. Scalding is an example of a DSL that succeeds in making queries fairly similar to "ordinary programming".)

That said, Optiq is not going to fully solve this problem. It will be a research area for years to come. LINQ made a good start. Optiq has a query planner, and is open and extensible for front-end query languages, user-defined operators, and user-defined rules. Those tools should allow us to efficiently and intelligently push user code into the fabric of the query-processing system.

Conclusion

Optiq attempts to create a high-level abstraction on top of Hadoop/NoSQL systems that behaves like a database but does not dilute the strengths of the data provider. But it brings in only those features of databases necessary to create that abstraction; it is a framework, not a database.

Watch this space for further blog posts and code. Or catch me at Hadoop Summit next week and ask me for a demo.

A first look at linq4j

2012-04-23T20:47:00.000-07:00

This is a sneak peek of an exciting new data management technology. linq4j (short for "Language-Integrated Query for Java") is inspired by Microsoft's LINQ technology, previously only available on the .NET platform, and adapted for Java. (It also builds upon ideas I had in my earlier Saffron project.)

I launched the linq4j project less than a week ago, but already you can do select, filter, join and groupBy operations on in-memory and SQL data.

In this demo, I write and execute sample code against the working system, and explain the differences between the key interfaces Iterable, Enumerable, and Queryable.

For those of you who want to get a closer look at the real code, here's one of the queries shown in the demo:

DatabaseProvider provider = new DatabaseProvider(Helper.MYSQL_DATA_SOURCE); provider.emps .where( new Predicate1<Employee>() { public boolean apply(Employee v1) { return v1.manager; } }) .join( provider.depts, new Function1<Employee, Integer>() { public Integer apply(Employee a0) { return a0.deptno; } }, new Function1<Department, Integer>() { public Integer apply(Department a0) { return a0.deptno; } }, new Function2<Employee, Department, String>() { public String apply(Employee v1, Department v2) { return v1.name + " works in " + v2.name; } } ) .foreach( new Function1<String, Void>() { public Void apply(String a0) { System.out.println(a0); return null; } } );

and here is its (not yet implemented) sugared syntax:

List<String> strings = from emp in provider.emps, join dept in provider.depts on emp.deptno == dept.deptno where emp.manager orderBy emp.name select emp.name + " works in " + dept.name;

For more information, visit the linq4j project's home page.

Data-oriented programming for the rest of us

2012-04-13T02:24:00.000-07:00

I have been a fan of LINQ for several years (my Saffron project covered many of the same themes) but I've had difficulty explaining why it isn't just a better Hibernate. In his article “Why LINQ Matters: Cloud Composability Guaranteed” (initially in ACM Queue, now in April's CACM), Brian Beckman puts his finger on it.

The idea is composability.

He writes:

Encoding and transmitting such trees of operators across tiers of a distributed system have many specific benefits, most notably:

Bandwidth savings from injecting filters closer to producers of data and streams, avoiding transmission of unwanted data back to consumers.

Computational efficiency from performing calculations in the cloud, where available computing power is much greater than in clients.

Programmability from offering generic transform and filter services to data consumers, avoiding the need for clairvoyant precanning of queries and data models at data-producer sites.

Databases have been doing this kind of stuff for years. There is a large performance difference between stored and in-memory data, and often several ways to access it, so the designers of the first databases took the decision about which algorithm to use out of the hands of the programmer. They created a query language out of a few theoretically well-behaved (and, not coincidentally, composable) logical operators, a set of composable physical operators to implement them, and a query planner to convert from one to the other. (Some call this component a “query optimizer”, but I prefer the more modest term.) Once the query planner was in place, they could re-organize not only the algorithms, but also the physical layout of the data (such as indexes and clustered tables) and the physical layout of the system (SMP and shared-nothing databases).

These days, there are plenty of other programming tasks that can benefit from the intervention of a planner that understands the algorithm. The data does not necessarily reside in a database (indeed, may not live on disk at all), but needs to be processed on a distributed system, connected by network links of varying latency, by multi-core machines with lots of memory.

What problems benefit from this approach? Problems whose runtime systems are complex, and where the decisions involve large factors. For example, “Is it worth writing my data to a network connection, which has 10,000x the latency of memory, if this will allow me to use 1000x more CPUs to process it?”. Yes, there are a lot of problems like that these days.

Composability

Beckman's shout-out to composability is remarkable because it is something the database and programming language communities can agree on. But though they may agree about the virtues of composability, they took it in different directions. The database community discovered composability years ago, but then set their query language into stone, so you couldn't add any more operators. Beckman is advocating writing programs using composable operators, but does not provide a framework for optimizing those operator trees.

LINQ stands for “Language-INtegrated Query”, but for these purposes, the important thing about LINQ is not that it is “language integrated”. It really doesn't matter whether the front end to a LINQ system uses a “select”, “where” and “from” operator reminiscent of SQL:

var results = from c in SomeCollection
              where c.SomeProperty < 10
              select new {c.SomeProperty, c.OtherProperty};

or higher-order operators on collections:

var results =
     SomeCollection
        .Where(c => c.SomeProperty < 10)
        .Select(c => new {c.SomeProperty, c.OtherProperty});

or actual SQL embedded in JDBC:

ResultSet results = statement.executeQuery(
    "SELECT SomeProperty, OtherProperty\n"
      + "FROM SomeCollection\n"
      + "WHERE SomeProperty < 10");

All of the above formulations are equivalent, and each can be converted into the same intermediate form, a tree of operators.

What matters is what happens next: a planner behind the scenes converts the operator tree into an optimal algorithm. The planner understands what the programmer is asking for, the physical layout of the data sources, the statistics about the size and structure of the data, the resources available to process the data, and the algorithms that can implement available to accomplish that. The effect will be that the program always executes efficiently, even if the data and system are re-organized after the program has been written.

Query planner versus compiler

Composability is the secret sauce that powers query planners, including the one in LINQ. At first sight, a query planner seems to have a similar purpose to a programming language compiler. But a query planner is aiming to reap the large rewards, so it needs to consider radical changes to the operator tree. Those changes are only possible if the operators are composable, and sufficiently well-behaved to be described by a small number of transformation rules. A compiler does not consider global changes, so does not need a simple, composable language.

The differences between compiler and query planner go further. They run in different environments, and have different goals. Compared to a typical programming language compiler, a query planner...

... plans later. A compiler optimizes at the time that the program is compiled; query planners optimize just before it is executed.
... uses more information. A compiler uses the structure of the program; query planners use more information on the dynamic state of the system.
... is involved in task scheduling. Whereas a compiler is quite separate from the task scheduler in the language's runtime environment, the line between query planners and query schedulers is blurred. Resource availability is crucial to query planning.
... optimizes over a greater scope. A compiler optimizes individual functions or modules; query planners optimize the whole query, or even the sequence of queries that make up a job.
... deals with a simpler language. Programming languages aim to be expressive, so have many times more constructs than query languages. Query languages are (not by accident) simple enough to be optimized by a planner. (This property is what Beckman calls “composability”.)
... needs to be more extensible. A compiler's optimizer only needs to change when the language or the target platform changes, whereas a query planner needs to adapt to new front-end languages, algorithms, cost models, back-end data systems and data structures.

These distinctions over-generalize a little, but I am trying to illustrate a point. And I am also giving query planners an unfair advantage, contrasting a “traditional” compiler with a “still just a research project” planner. (Modern compilers, in particular just-in-time (JIT) compilers, share some of the dynamic aspects of query planners.) The point is that a compiler and a planner have different roles, and one should not imagine that one can do the job of the other.

The compiler allows you to write your program in at a high level of abstraction in a rich language; its task is to translate that complex programming language into a simpler machine representation. The planner allows your program to adapt to its runtime environment, by looking at the big picture. LINQ allows you to have both; its architecture provides a clear call-out from the compiler to the query planner. But it can be improved upon, and points to a system superior to LINQ, today's database systems, and other data management systems such as Hadoop.

A manifesto

1. Beyond .NET. LINQ only runs on Microsoft's .NET framework, yet Java is arguably the standard platform for data management. There should be front-ends for other JVM-based languages such as Scala and Clojure.

2. Extensible planner. Today's database query planners work with a single query language (usually SQL), with a fixed set of storage structures and algorithms, usually requiring that data is brought into their database before they will query it. Planners should be allow application developers to add operators and rules. By these means, a planner could accept various query languages, target various data sources and data structures, and use various runtime engines.

3. Rule-driven. LINQ has already rescued data-oriented programming from the database community, and proven that a query planner can exist outside of a database. But to write a LINQ planner, you need to be a compiler expert. Out of the frying pan and into the fire. Planners should be configurable by people who are neither database researchers nor compiler writers, by writing simple rules and operators. That would truly be data-oriented programming for the rest of us.

"Big Data" is dead... long live Big Data Architecture

2012-04-11T13:16:00.003-07:00

Now that just about every data-management and business intelligence product claims that it handles "Big Data", the term is approaching zero information content.

So, I'm shorting the term "Big Data". In the next few months, the marketers will realize that their audience realize that the term means nothing and, in accordance with Monash's First Law of Commercial Semantics, they'll start coming up with new terms.

Have any of those terms been spotted in the wild yet?

Though I'm still not clear what exactly Big Data is, I am fond of the term "Big Data Architecture". That term describes — fairly concisely, to the people who I want to understand me — the idea of a system where scalability is so important that it's best not to assume that there is only one of anything; where scalability is so important that it's worth revisiting all your assumptions; and where the raw performance of each component in the system is not paramount, because if the components can be composed in a scalable fashion, the system will meet its performance goals.

This architecture is going to be the standard for the kind of systems I build, so I think I'll be using the term "Big Data Architecture" for many years to come. If you can come up with got a good alternative to that one, I might just buy you a pint.

How should Mondrian get table and column statistics?

2012-04-04T13:03:00.000-07:00

When evaluating queries, Mondrian sometimes needs to make decisions about how to proceed, and in particular, what SQL to generate. One decision is which aggregate table to use for a query (or whether to stick with the fact table), and another is whether to "round out" a cell request for, say, 48 states and 10 months of 2011 to the full segment of 50 states and 12 months.

These decisions are informed by the volume actual data in the database. The first decision uses row counts (the numbers of rows in the fact and aggregate tables) and the second uses column cardinalities (the number of distinct values in the "month" and "state" columns).

Gathering statistical information is an imperfect science. The obvious way to get the information is to execute some SQL queries:

-- row count of the fact table
select count(*) from sales_fact_1997;

-- count rows in an aggregate table
select count(*) from agg_sales_product_brand_time_month;

-- cardinality of the [Customer].[State] attribute
select count(distinct state) from customer;

These queries can be quite expensive. (On many databases, a row count involves reading every block of the table into memory and summing the number of rows in each. A query for a column's cardinality involves an entry scan of an index; or, worse, a table scan followed by an expensive sort if there is no such index.)

Mondrian doesn't need the exact value, but need needs an approximate value (say correct within a factor of 3) in order to proceed with the query.

Mondrian has a statistics cache, so the statistics calls only affect the "first query of the day", when Mondrian has been re-started, or is using a new schema. (If you are making use of a dynamic schema processor, it might be that every user effectively has their own schema. In this case, every user will experience their own slow "first query of the day".)

We have one mechanism to prevent expensive queries: you can provide estimates in the Mondrian schema file. When you are defining an aggregate table, specify the approxRowCount attribute of the <AggName> XML element, and Mondrian will skip the row count query. When defining a level, if you specify the approxRowCount attribute of the <Level> XML element (the <Attribute> XML element in mondrian-4), Mondrian will skip the cardinality query. But it is time-consuming to fill in those counts, and they can go out of date as the database grows.

I am mulling over a couple of features to ease this problem. (These features are not committed for any particular release, or even fully formed. Your feedback to this post will help us prioritize them, shape them so that they are useful for how you manage Mondrian, and hopefully trim their scope so that they are reasonably simple for us to implement.)

Auto-populate volume attributes

The auto-populate feature would read a schema file, run queries on the database to count every fact table, aggregate table, and the key of every level, and populate the approxRowCount attributes in the schema file. It might also do some sanity checks, such as that the primary key of your dimension table doesn't have any unique values, and warn you if they are violated.

Auto-populate is clearly a time-consuming task. It might take an hour or so to execute all of the queries. You could run it say once a month, at a quiet time of day. But at the end, the Mondrian schema would have enough information that it would not need to run any statistics queries at run time.

Auto-populate has a few limitations. Obviously, you need to schedule it, as a manual task, or a cron job. Then you need to make sure that the modified schema file is propagated into the solution repository. Lastly, if you are using a dynamic schema processor to generate or significantly modify your schema file, auto-populate clearly cannot populate sections that have not been generated yet.

Pluggable statistics

The statistics that Mondrian needs probably already exist. Every database has a query optimizer, and every query optimizer needs statistics such as row counts and column cardinalities to make its decisions. So, that ANALYZE TABLE (or equivalent) command that you ran after you populated the database (you did run it, didn't you?) probably calculated these statistics and stored them somewhere.

The problem is that that "somewhere" is different for each and every database. In Oracle, they are in ALL_TAB_STATISTICS and ALL_TAB_COL_STATISTICS tables; in MySQL, they are in INFORMATION_SCHEMA.STATISTICS. And so forth.

JDBC claims to provide the information through the DatabaseMetaData.getIndexInfo method. But it doesn't work for all drivers. (The only one I tried, MySQL, albeit a fairly old version, didn't give me any row count statistics.)

Let's suppose we introduced an SPI to get table and column statistics:

package mondrian.spi;

import javax.sql.DataSource;

interface StatisticsProvider {
   int getColumnCardinality(DataSource dataSource, String catalog, String schema, String table, String[] columns);
   int getTableCardinality(DataSource dataSource, String catalog, String schema, String table);
}

and several implementations:

A fallback implementation SqlStatisticsProvider that generates "select count(distinct ...) ..." and "select count(*) ..." queries.
An implementation JdbcStatisticsProvider that uses JDBC methods such as getIndexInfo
An implementation that uses each database's specific tables, OracleStatisticsProvider, MySqlStatisticsProvider, and so forth.

Each Dialect could nominate one or more implementations of this SPI, and try them in order. (Each method can return -1 to say 'I don't know'.)

Conclusion

Statistics are an important issue for Mondrian. In the real world, missing statistics are more damaging than somewhat inaccurate statistics. If statistics are inaccurate, Mondrian will execute queries inefficiently, but the difference with optimal performance is negligible if the statistics are within an order of magnitude; missing statistics cause Mondrian to generate potentially expensive SQL statements, especially during that all-important first query of the day.

A couple of solutions are proposed.

The auto-population tool would solve the problem in one way, at the cost of logistical effort to schedule the running of the tool.

The statistics provider leverages databases' own statistics. It solves the problem of diversity the usual open source way: it provides an SPI and lets the community provide implementations that SPI for their favorite database.

Auto-generated date dimension tables

2012-02-21T21:04:00.000-08:00

It seems that whenever I have a cross-continent flight, Mondrian gets a new feature. This particular flight was from Florida back home to California, and this particular feature is a time-dimension generator.

I was on the way home from an all-hands at Pentaho's Orlando, Florida headquarters, where new CEO Quentin Gallivan had outlined his strategy for the company. I also got to spend time with the many smart folks from all over the world who work for Pentaho, among them Roland Bouman, formerly an evangelist for MySQL, now with Pentaho, but still passionately advocating for open source databases, open source business intelligence, and above all, keeping it simple.

Roland and I got talking about how to map Mondrian onto operational schemas. Though not designed as star schemas, some operational schemas nevertheless have a structure that can support a cube, with a central fact table surrounded by star or snowflake dimension tables. Often the one thing missing is a time dimension table. Since these time dimension tables look very much the same, how easy would it be for Mondrian to generate them on the fly? Not that difficult, I thought, as the captain turned off the "fasten seatbelts" sign and I opened my laptop. Here's what I came up with.

Here's how you declare a regular time dimension table in Mondrian 4:

<PhysicalSchema>

  <Table name='time_by_day'/>

  <!-- Other tables... -->

</PhysicalSchema>

Mondrian sees the table name 'time_by_day', checks that it exists, and finds the column definitions from the JDBC catalog. The table can then be used in various dimensions in the schema.

An auto-generated time dimension is similar:

<PhysicalSchema>

  <AutoGeneratedDateTable name='time_by_day_generated' startDate='2012-01-01' endDate='2014-01-31'/>

  <!-- Other tables... -->

</PhysicalSchema>

The first time Mondrian reads the schema, it notices that the table is not present in the schema, and creates and populates it. Here is the DDL and data it produces.

CREATE TABLE `time_by_day_generated` (

  `time_id` Integer NOT NULL PRIMARY KEY,

  `yymmdd` Integer NOT NULL,

  `yyyymmdd` Integer NOT NULL,

  `the_date` Date NOT NULL,

  `the_day` VARCHAR(20) NOT NULL,

  `the_month` VARCHAR(20) NOT NULL,

  `the_year` Integer NOT NULL,

  `day_of_month` VARCHAR(20) NOT NULL,

  `week_of_year` Integer NOT NULL,

  `month_of_year` Integer NOT NULL,

  `quarter` VARCHAR(20) NOT NULL)

JULIAN	YYMMDD	YYYYMMDD	DATE	DAY_OF_WEEK_NAME	MONTH_NAME	YEAR	DAY_OF_MONTH	WEEK_OF_YEAR	MONTH	QUARTER
2455928	120101	20120101	2012-01-01	Sunday	January	2012	1	1	1	Q1
2455929	120102	20120102	2012-01-02	Monday	January	2012	2	1	1	Q1
2455930	120103	20120103	2012-01-03	Tuesday	January	2012	3	1	1	Q1

The columns present are all of the time-dimension domains:

Domain	Default column name	Default data type	Example	Description
JULIAN	time_id	Integer	2454115	Julian day number (0 = January 1, 4713 BC). Additional attribute 'epoch', if specified, changes the date at which the value is zero.
YYMMDD	yymmdd	Integer	120219	Decimal date with two-digit year
YYYYMMDD	yyyymmdd	Integer	20120219	Decimal date with four-digit year
DATE	the_date	Date	2012-12-31	Date literal
DAY_OF_WEEK_NAME	the_day	String	Friday	Name of day of week
MONTH_NAME	the_month	String	December	Name of month
YEAR	the_year	Integer	2012	Year
DAY_OF_MONTH	day_of_month	String	31	Day ordinal within month
WEEK_OF_YEAR	week_of_year	Integer	53	Week ordinal within year
MONTH	month_of_year	Integer	12	Month ordinal within year
QUARTER	quarter	String	Q4	Name of quarter

Suppose you wish to choose specific column names, or have more control over how values are generated. You can do that by including a <ColumnDefs> element within the table, and <ColumnDef> elements within that — just like a regular <Table> element.

For example,

<PhysicalSchema>

  <AutoGeneratedDateTable name='time_by_day_generated' startDate='2008-01-01 endDate='2020-01-31'>

    <ColumnDefs>

      <ColumnDef name='time_id'>

        <TimeDomain role='JULIAN' epoch='1996-01-01'/>

      </ColumnDef>

      <ColumnDef name='my_year'>

        <TimeDomain role='year'/>

      </ColumnDef>

      <ColumnDef name='my_month'>

        <TimeDomain role='MONTH'/>

      </ColumnDef>

      <ColumnDef name='quarter'/>

      <ColumnDef name='month_of_year'/>

      <ColumnDef name='week_of_year'/>

      <ColumnDef name='day_of_month'/>

      <ColumnDef name='the_month'/>

      <ColumnDef name='the_date'/>

    </ColumnDefs>

    <Key>

      <Column name='time_id/>

    </Key>

  </AutoGeneratedDateTable>

  <!-- Other tables... -->

</PhysicalSchema>

The first three columns have nested <TimeDomain> elements that tell the generator how to populate them.

The other columns have the standard column name for a particular time domain, and therefore the <TimeDomain> element can be omitted. For instance,

<ColumnDef name='month_of_year'/>

is shorthand for

<ColumnDef name='month_of_year' type='int'>

  <TimeDomain role="month"/>

</ColumnDef>

The nested <Key> element makes that column valid as the target of a link (from a foreign key in the fact table, for instance), and also declares the column as a primary key in the CREATE TABLE statement. This has the pleasant side-effect, on all databases I know of, of creating an index. If you need other indexes on the generated table, create them manually.

The <TimeDomain> element could be extended further. For instance, we could add a locale attribute. This would allow different translations of month and weekday names, and also support locale-specific differences in how week-in-day and day-of-week numbers are calculated.

Note that this functionality is checked into the mondrian-lagunitas branch, so will only be available as part of Mondrian version 4. That release is still pre-alpha. We recently started to regularly build the branch using Jenkins, and you should see the number of failing tests dropping steadily over the next weeks and months. Already over 80% of tests pass, so it's worth downloading the latest build to kick the tires on your application.

olap4j releases version 1.0.1, switches to Apache license

2012-02-07T23:04:00.000-08:00

I am pleased to announce the release of olap4j version 1.0.1.

As the version number implies, this is basically a maintenance release. It is backwards compatible with version 1.0.0, meaning that any driver or application written for olap4j 1.0.0 should work with 1.0.1.

There is a year's worth of bug fixes, which should help the stability and performance of the XMLA driver in particular.

But more significant than the code changes is the change in license. Olap4j is now released under the Apache License, Version 2.0 (ASL). Our goal is to maximize the number of applications that use olap4j, and the number of drivers. ASL is a more permissive license than olap4j's previous license, Eclipse Public License (EPL), so helps drive adoption.

For instance, under ASL, if you create a driver by forking an existing driver, you are not required to publish your modified source code, and you may embed the driver in a non-ASL project or product. We hope that this will increase the number of commercial olap4j drivers. (Of course, we hope you will see the wisdom of contributing back your changes, but you are not obliged to.)

Before you ask. It is quite coincidental that this license change occurred in the same week that Pentaho Data Integration (Kettle) also switched to Apache Software License. Although I'm sure that Pentaho's motivations were similar to ours.

Thanks to everyone who has contributed fixes and valuable feedback since olap4j 1.0.0, and in particular to Luc for wrangling the release out of the door.

Changes to Mondrian's caching architecture

2012-01-14T16:05:00.000-08:00

I checked in some architectural changes to Mondrian's cache this week.

First the executive summary:

1. Mondrian should do the same thing as it did before, but scale up better to more concurrent queries and more cores.

2. Since this is a fairly significant change in the architecture, I'd appreciate if you kicked the tires, to make sure I didn't break anything.

Now the longer version.

Since we introduced external caches in Mondrian 3.3, we were aware that we were putting a strain on the caching architecture. The caching architecture has needed modernization for a while, but external caches made it worse. First, a call to an external cache can take a significant amount of time: depending on the cache, it might do a network I/O, and so take several orders of magnitude longer than a memory access. Second, we introduced external caching and introduced in-cache rollup, and for both of these we had to beef up the in-memory indexes needed to organize the cache segments.

Previously we'd used a critical section approach: any thread that wanted to access an object in the cache locked out the entire cache. As the cache data structures became more complex, those operations were taking longer. To improve scalability, we adopted a radically different architectural pattern, called the Actor Model. Basically, one thread, called the Cache Manager is dedicated to looking after the cache index. Any query thread that wants to find a segment in the cache, or to add a segment to the cache, or create a segment by rolling up existing segments, or flush the cache sends a message to the Cache Manager.

Ironically, the cache manager does not get segments from external caches. As I said earlier, external cache accesses can take a while, and the cache manager is super-busy. The cache manager tells the client the segment key to ask the external cache for, and the client does the asking. When a client gets a segment, it stores it in its private storage (good for the duration of a query) so it doesn't need to ask the cache manager again. Since a segment can contain thousands of cells, even large queries typically only make a few requests to the cache manager.

The external cache isn't just slow; it is also porous. It can have a segment one minute, and forget it the next. The Mondrian query thread that gets the cache miss will tell the cache manager to remove the segment from its index (so Mondrian doesn't ask for it again), and formulate an alternative strategy to find it. Maybe the required cell exists in another cached segment; maybe it can be obtained by rolling up other segments in cache (but they, too, could have gone missing without notice). If all else fails, we can generate SQL to populate the required segment from the database (a fact table, or if possible, an aggregate table).

Since the cache manager is too busy to talk to the external cache, it is certainly too busy to execute SQL statements. From the cache manager's perspective, SQL queries take an eternity (several million CPU cycles each), so it farms out SQL queries to a pool of worker threads. The cache manager marks that segment as 'loading'. If another query thread asks the cache manager for a cell that would be in that segment, it receives a Future<SegmentBody> that will be populated as soon as the segment arrives. When that segment returns, the query thread pushes the segment into the cache, and tells the cache manager to update the state of that segment from 'loading' to 'ready'.

The Actor Model is a radically different architecture. First, let's look at the benefits. Since one thread is managing an entire subsystem, you can just remove all locking. This is liberating. Within the subsystem, you can code things very simply, rather than perverting your data structures for thread-safety. You don't even need to use concurrency-safe data structures like CopyOnWriteArrayList, you can just use the fastest data structure that does the job. Once you remove concurrency controls such as 'synchronized' blocks, and access from only one thread, the data structure becomes miraculously faster. How can that be? The data structure now resides in the thread's cache, and when you removed the concurrency controls, you were also removing memory barriers that forced changes to be written through L1 and L2 cache to RAM, which is up to 200 times slower.

Migrating to the Actor Model wasn't without its challenges. First of all, you need to decide which data structures and actions should be owned by the actor. I believe we got that one right. I found that most of the same things needed to be done, but by different threads than previously; so the task we mainly about moving code around. We needed to refine the data structures that were passed between "query", "cache manager" and "worker" threads, to make sure that they were immutable. If, for instance, you want the query thread to find other useful work to do while it is waiting for a segment, it shouldn't be modifying a data structure that it put into the cache manager's request queue. In a future blog post, I'll describe in more detail the challenges & benefits of migrating one component of a complex software system to the Actor Model.

Not all caches are equal. Some, like JBoss Infinispan, are able to share cache items (in our case, segments containing cell values) between nodes in a cluster, and to use redundancy to ensure that cache items are never lost. Infinispan calls itself a "data grid", which first I dismissed as mere marketing, but I became convinced that it is genuinely a different kind of beast than a regular cache. To support data grids, we added hooks so that a cache can tell Mondrian about segments that have been added to other nodes in a cluster. This way, Mondrian becomes a genuine cluster. If I execute query X on node 1, it will put segments into the data grid that will make the query you are about to submit, query Y on node 2, execute faster.

As you can tell by the enthusiastic length of this post, I am very excited about this change to Mondrian's architecture. Outwardly, Mondrian executes the same MDX queries the same as it ever did. But the internal engine can scale better when running on a modern CPU with many cores; due to the external caches, the cache behave much more predictably; and you can create clusters of Mondrian nodes that share their work and memory.

The changes will be released soon as Mondrian version ~~3.3.1~~ 3.4, but you can help by downloading from the main line (or from CI), kicking the tires, and letting us know if you find any problems.

[Edited 2011/1/16, to fix version number.]

How Mondrian names hierarchies

2011-08-24T10:40:00.000-07:00

You may or may not be aware of the property mondrian.olap.SsasCompatibleNaming. It controls the naming of elements, in particular how Mondrian names hierarchies when there are multiple hierarchies in the same dimension.

Let's suppose that there is a dimension called 'Time', and it contains hierarchies called 'Time' and 'Weekly'.

If SsasCompatibleNaming is false, the dimension and the first hierarchy will both be called '[Time]', and the other hierarchy will be called '[Time.Weekly]'.

If SsasCompatibleNaming is true, the dimension will be called '[Time]', the first hierarchy be called '[Time].[Time]', and the other hierarchy will be called '[Time].[Weekly]'.

As you can see, SsasCompatibleNaming makes life simpler, if slightly more verbose, because it gives each element a distinct name. There are knock-on effects, beyond the naming of hierarchies. The most subtle and confusing effect is in the naming of levels when the dimension, hierarchy and level all have the same name. If SsasCompatibleNaming is false, then [Gender].[Gender].Members is asking for the members of the gender level, and yields two members. If SsasCompatibleNaming is true, then [Gender].[Gender].Members is asking for the members of the gender hierarchy, and yields three members (all, F and M).

Usually, however, Mondrian is forgiving in how it resolves names, and if elements have different names, it will usually find the element you intend.

The default value is false. However, that leads to naming behavior which is not compatible with other MDX implementations, in particular Microsoft SQL Server Analysis Services (versions 2005 and later).

From mondrian-4 onwards, the property will be set to true. (You won't be able to set it to false.) This makes sense, because in mondrian-4, with attribute-hierarchies, there will typically be several hierarchies in each dimension. We will really need to get our naming straight.

What do we recommend? If you are using Pentaho Analyzer, Saiku or JPivot today, we recommend that you use the default value, false. But if you are writing your own MDX (or have built your own client), try setting the value to true. The new naming convention actually makes more sense, and moving to it now will minimize the disruption when you move to mondrian-4.

I am just about to check in a change that uses a new, and better name resolution algorithm. It will be more forgiving, and standards-compliant, in how it resolves the names of calculated members. However, it might break compatibility, so it will only be enabled if SsasCompatibleNaming is true.

Are you using this property today? Let us know how it's working for you.

Real-Time Seismic Monitoring

2011-07-22T10:51:00.000-07:00

Marc Berkowitz wrote a blog post describing an application of SQLstream to power a seismic monitoring project that is a collaboration between several leading research institutions.

The project is interesting in several respects:

The project involves signal processing. Unlike the "event-processing" application that we see most often at SQLstream, events arrive at a regular rate (generally 40 readings every second, per sensor). In signal processing, events are more likely to be processed using complex mathematical formulas (such as Fourier transforms) than by boolean logic (event A happened, then event B happened). Using SQLstream's user-defined function framework, we were easily able to accommodate this form of processing.
It illustrates how a stream-computing "fabric" can be created, connecting multiple SQLstream processing nodes using RabbitMQ.
One of the reasons for building a distributed system was to allow an agile approach. Researchers can easily deploy new algorithms without affecting the performance or correctness of other algorithms running in the cloud.
Another goal of the distributed system was performance and scalability. Nodes can easily be added to accommodate greater numbers of sensors. The system is not embarassingly parallel, but we were still able to parallelize the solution effectively.
Lastly, the system needs to be both continuous and real-time. "Continuous" meaning that data is processed as it arrives; a smoother, more predictable and more efficient mode of operation than ETL. "Real-time" because some of the potential outputs of the system, such as tsunami alerts, need to be delivered as soon as possible in order to be useful.

In all, a very interesting case study of what SQLstream is capable of. Marc plans to make follow-up posts describing the solution in more detail, so stay tuned.

Yellowfin BI release 5.2 moves to olap4j

2011-06-09T10:25:00.000-07:00

According to their press release, Yellowfin BI version 5.2 "includes a significant OLAP overhaul, with the introduction of OLAP4j and support for PALO, BW as well as enhanced connectivity for SQL Server 2005+".

Nice to see olap4j gaining wider adoption. Though not too surprising, given connectivity options that it opens up. And bear in mind that because olap4j is open source, for every product that mentions olap4j in a press release, there may be dozens or hundreds of others that are using it and not talking about it publicly.

Increased adoption is good, whether or not vendors choose to announce it. We know if vendors run into issues, they will log them and someone would fix them. It makes olap4j better for everyone.

Roll your own high-performance Java collections classes

2011-06-03T12:58:00.000-07:00

The Java collections framework is great. You can create maps, sets, lists with various element types, various performance characteristics (e.g. if you want O(1) insert, use a linked list), iterate over them, and you can decorate them to give them other behaviors.

But suppose that you want to create a high-performance, memory efficient immutable list of integers? You'd write

List<Integer> list =

  Collections.unmodifiableList(

    new ArrayList(

      Arrays.asList(1000, 1001, 1002)));

There will be 6 objects allocated in the JVM: three Integer objects, an array Object[3] to hold the Integers, an ArrayList, and an UnmodifiableRandomAccessList. Not to mention the Arrays.ArrayList and Integer[3] used to construct the list and quickly thrown away.

The resulting list is no longer high-performance. A call to say 'int n = list.get(2)' requires 3 method calls (UnmodifiableRandomAccessList.get, ArrayList.get, Integer.intValue) and 3 indirections. And the sheer number of objects created reduces the chance that a given stretch of code will be able to operate solely from the contents of L1 cache.

So, what next? Should I write my own class, like this?

public class UnmodifiableNativeIntArrayList

  implements List<Integer>

{

  ...

}

Well, maybe. But there are rather a lot of variations to cover, and each one needs to be hand-coded and tested.

Do I use library code? I searched and turned up Apache Commons Primitives, Primitive Collections for Java (PCJ), and GNU Trove (trove4j). Of these, only GNU Trove is still active.

None of the libraries supports features such as maps with two or more keys, unmodifiable collections, synchronized collections, flat collections (similar to Apache Flat3Map). It's not surprising that they don't: each combination of features would require its own class, so the size of the jar file would grow exponentially.

So, I'd like to propose an alternate approach. You configure a factory, specifying the precise kind of collection you would like, and the factory generates the collection class in bytecode. You can use the factory to quickly create as many instances of the collection as you wish. The collection implements the Java collections interfaces, plus additional interfaces that allow you to efficiently access the collection without boxing/unboxing.

The above example would be written as follows:

// Initialize the factory when the program is loaded.

// Then the bytecode gets generated just once.

static final Factory factory =

  new FactoryBuilder()

    .list()

    .elementType(Integer.TYPE)

    .modifiable(false)

    .factory();



int[] ints = {1000, 1001, 1002};

IntList list = factory.createIntList(ints);

Variants are expressed as FactoryBuilder methods:

FactoryBuilder FactoryBuilder.list()
FactoryBuilder FactoryBuilder.map()
FactoryBuilder FactoryBuilder.set()
FactoryBuilder FactoryBuilder.keyType(Class...) (for maps only)
FactoryBuilder FactoryBuilder.valueType(Class...) (for maps only)
FactoryBuilder FactoryBuilder.elementType(Class...) (for list and set only)
FactoryBuilder FactoryBuilder.sorted(boolean) (cf. the difference between Set and SortedSet)
FactoryBuilder FactoryBuilder.deterministic(boolean) (cf. the difference between HashMap and LinkedHashMap)
FactoryBuilder FactoryBuilder.modifiable(boolean)
FactoryBuilder FactoryBuilder.fixedSize(boolean) (cf. the difference between Flat3Map and Map)
FactoryBuilder FactoryBuilder.synchronized(boolean)

And so forth. Additional variants could be added as the project evolved. Templates could be fine-tuned for particular combinations of variants.

The projects I mentioned above clearly use a template system, and we could use and extend those templates. The janino facility can easily convert the generated java code into bytecode. And the JVM would be able to apply JIT (just-in-time compilation) to these classes; in fact, these classes would be more amenable to compilation, because they would be compact and final.

The existing projects have invested a lot of effort designing high-performance collections. I'd like to build on that work; this project could even be an extension to those projects.

I'd like to hear if you're interested in working with me on this.

Removing Mondrian's 'high cardinality dimension' feature

2011-06-01T16:26:00.000-07:00

I would like to remove the 'high cardinality dimension' feature in mondrian 4.0.

To specify that a dimension is high-cardinality, you set the highCardinality attribute of the Dimension element to true. This will cause mondrian to scan over the dimension, rather than trying to load all of the children of a given parent member into memory.

The goal is a worthy one, but the implementation — making iterators look like lists — has a number of architectural problems: it duplicates code; because it allows backtracking for a fixed amount, it works with small dimensions but unpredictably fails with larger ones; and because lists are based on iterators, re-starting an iteration multiple times (e.g. from within a crossjoin) can re-execute complex SQL statements.

There are other architectural features designed to help with large dimensions. Many functions can operate in an 'iterable' mode (except that here the iterators are explicit). And for many of the most data-intensive operators, such as crossjoin, filter, semijoin (non-empty), and topcount, we can push down the operator to SQL, and thereby reduce the number of records coming out of the RDBMS.

It's always hard to remove a feature. But over the years we have seen numerous inconsistencies, and if we removed this feature in mondrian 4.0, we could better focus our resources.

If you are using this feature and getting significant performance benefit, I would like to hear from you. I would like to understand about your use case, and either direct you to another feature that solves the problem, or try to develop an alternative solution in mondrian 4.0. The best place to make comments about these use cases is on the Jira case MONDRIAN-949.

Scripted plug-ins in LucidDB and Mondrian

2011-05-31T00:03:00.000-07:00

I saw a demo last week of scripted user-defined functions in LucidDB, and was inspired this weekend to add them to Mondrian.

Kevin Secretan of DynamoBI has just contributed some extensions to LucidDB to allow you to call script code (such as JavaScript or Python) in any place where you can have a user-defined function, procedure, or transform. This feature builds on a JVM feature introduced in Java 1.6, scripting engines.

Scripted functions may be a little slower than Java user-defined functions, but what they lose in performance they more than make up in flexibility. Writing user-defined functions in Java has always been laborious: you need to write a Java class, compile it, put it in a jar, put the jar on the server's class path, and restart the server. Each time you find a bug, you need to repeat that process, and that can easily take a number of minutes each cycle. Because scripted functions are compiled on the fly, you can cycle faster, and spend more of your valuable time working on the actual application.

I am speaking about LucidDB (and SQLstream) here, but the same problems exist for Mondrian plug-ins. Scripting is an opportunity to radically speed up development of application extensions, because everything can be done in the schema file. (Or via the workbench... but that part isn't implemented yet.)

Mondrian has several plug-in types, all today implemented using a Java SPI. I chose to make scriptable those plug-ins that are defined in a mondrian schema file: user-defined function, member formatter, property formatter, and cell formatter. A small syntax change to the schema file allowed you to chose whether to implement these plug-ins by specifying the name of a Java class (as before) or an inline script.

As an example, here is the factorial function defined in JavaScript:

<UserDefinedFunction name="Factorial">
  <Script language="JavaScript">
    function getParameterTypes() {
      return new Array(new mondrian.olap.type.NumericType());
    }
    function getReturnType(parameterTypes) {
      return new mondrian.olap.type.NumericType();
    }
    function execute(evaluator, arguments) {
      var n = arguments[0].evaluateScalar(evaluator);
      return factorial(n);
    }
    function factorial(n) {
      return n <= 1 ? 1 : n * factorial(n - 1);
    }
  </Script>
</UserDefinedFunction>

A user-defined function ironically requires several functions in order to provide the metadata needed by the MDX type system. The member, property and cell formatters are simpler. They require just one function, so mondrian dispenses with the function header, and requires just the 'return' expression inside the Script element. For example, here is a member formatter:

<Level name="name" column="column">
  <MemberFormatter>
    <Script language="JavaScript">
      return member.getName().toUpperCase();
    </Script>
  </MemberFormatter>
</Level>

You can of course write multiple statements, if you wish. Since JavaScript is embedded in the JVM, your code can call back into Java methods, and use the full runtime Java library.

There are examples of cell formatters and property formatters in the latest schema guide.

If you are concerned about performance, you could always translate this code back to a Java UDF when it is fully debugged. However, you might be pleasantly surprised by the performance of JavaScript: I was able to invoke a script function about 20,000 times per second. And I hear that there is a Janino "scripting engine" that compiles Java code into bytecode on the fly. In principle, it should be as fast as a real Java UDF.

I'd love to hear about Janino, or in fact any other scripting engine, with the Mondrian or LucidDB scripted functions.

By the way, you can expect to see scripted functions in a release of SQLstream not too far in the future. The Eigenbase project makes it easy to propagate features between projects, and this feature is too good not to share.