Wednesday, March 19, 2014

Improvements to Optiq's MongoDB adapter

It’s been a while since I posted to this blog, but I haven’t been idle. Quite the opposite; I’ve been so busy writing code that I haven’t had time to write blog posts. A few months ago I joined Hortonworks, and I’ve been improving Optiq on several fronts, including several releases, adding a cost-based optimizer to Hive and some other initiatives to make Hadoop faster and smarter.

More about those other initiatives shortly. But Optiq’s mission is to improve access to all data, so here I want to talk about improvements to how Optiq accesses data in MongoDB. Optiq can now translate SQL queries to extremely efficient operations inside MongoDB.

MongoDB 2.2 introduced the aggregation framework, which allows you to compose queries as pipelines of operations. They have basically implemented relational algebra, and we wanted to take advantage of this.

As the following table shows, most of those operations map onto Optiq’s relational operators. We can exploit that fact to push SQL query logic down into MongoDB.

MongoDB operatorOptiq operator
$projectProjectRel
$matchFilterRel
$limitSortRel.limit
$skipSortRel.offset
$unwind-
$groupAggregateRel
$sortSortRel
$geoNear-

In the previous iteration of Optiq’s MongoDB adapter, we could push down project, filter and sort operators as $project, $match and $sort. A bug pointed out that it would be more efficient if we evaluated $match before $project. As I fixed that bug yesterday, I decided to push down limit and offset operations. (In Optiq, these are just attributes of a SortRel; a SortRel sorting on 0 columns can be created if you wish to apply limit or offset without sorting.)

That went well, so I decided to go for the prize: pushing down aggregations. This is a big performance win because the output of a GROUP BY query is often a lot smaller than its input. It is much more efficient for MongoDB aggregate the data in memory, returning a small result, than to return a large amount of raw data to be aggregated by Optiq.

Now queries involving SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, OFFSET, FETCH (or LIMIT if you prefer the PostgreSQL-style syntax), not to mention sub-queries, can be evaluated in MongoDB. (JOIN, UNION, INTERSECT, MINUS cannot be pushed down because MongoDB does not support those relational operators; Optiq will still evaluate those queries, pushing down as much as it can.)

Let's see some examples of push-down in action.

Given the query:
SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
Optiq evaluates:
db.zips.aggregate(
   {$project: {STATE: '$state'}},
   {$group: {_id: '$STATE', C: {$sum: 1}}},
   {$project: {STATE: '$_id', C: '$C'}})
and returns
STATE=WV; C=659
STATE=WA; C=484
...
Now let’s add a HAVING clause to find out which states have more than 1,500 zip codes:
SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
HAVING COUNT(*) > 1500
Optiq adds a $match operator to the previous query's pipeline:
db.zips.aggregate(
   {$project: {STATE: '$state'}},
   {$group: {_id: '$STATE', C: {$sum: 1}}},
   {$project: {STATE: '$_id', C: ‘$C'}},
   {$match: {C: {$gt: 1500}}})
and returns
STATE=NY; C=1596
STATE=TX; C=1676
STATE=CA; C=1523
Now the pièce de résistance. The following query finds the top 5 states in terms of number of cities (and remember that each city can have many zip-codes).
SELECT state, COUNT(DISTINCT city) AS cdc
FROM zips
GROUP BY state
ORDER BY cdc DESC
LIMIT 5
COUNT(DISTINCT x) is difficult to implement because it requires the data to be aggregated twice — once to compute the set of distinct values, and once to count them within each group. For this reason, MongoDB doesn’t implement distinct aggregations. But Optiq translates the query into a pipeline with two $group operators. For good measure, we throw in ORDER BY and LIMIT clauses.

The result is an awe-inspiring pipeline that includes two $group operators (implementing the two phases of aggregation for distinct-count), and finishes with $sort and $limit.

db.zips.aggregate(
  {$project: {STATE: '$state', CITY: '$city'}},
  {$group: {_id: {STATE: '$STATE', CITY: '$CITY'}}},
  {$project: {_id: 0, STATE: '$_id.STATE', CITY: '$_id.CITY'}},
  {$group: {_id: '$STATE', CDC: {$sum: {$cond: [ {$eq: ['CITY', null]}, 0, 1]}}}},
  {$project: {STATE: '$_id', CDC: '$CDC'}},
  {$sort: {CDC: -1}}, {$limit: 5})
I had to jump through some hoops to get this far, because MongoDB’s expression language can be baroque. In one case I had to generate
{$ifNull: [null, 0]}
in order to include the constant 0 in a $project operator. And I was foiled by MongoDB bug SERVER-4589 when trying to access the values inside the zips table's loc column, which contains (latitude, longitude) pairs represented as an array.

In conclusion, Optiq on MongoDB now does a lot of really smart stuff. It can evaluate any SQL query, and push down a lot of that evaluation to be executed efficiently inside MongoDB.

I encourage you to download Optiq and try running some sophisticated SQL queries (including those generated by the OLAP engine I authored, Mondrian).

Monday, June 17, 2013

Efficient SQL queries on MongoDB

How do you integrate MongoDB with other data in your organization? MongoDB is great for building applications, and it has its own powerful query API, but it's difficult to mash up data between MongoDB and other tools, or to make tools that speak SQL, such as Pentaho Analysis (Mondrian), connect to MongoDB.

Building a SQL interface isn't easy, because MongoDB's data model is such a long way from SQL's model. Here are some of the challenges:
  • MongoDB doesn't have a schema. Each database has a number of named 'collections', which are the nearest thing to a SQL table, but each row in a collection can have a completely different set of columns.
  • In MongoDB, data can be nested. Each row consists of a number of fields, and each field can be a scalar value, null, a record, or an array of records.
  • MongoDB supports a number of relational operations, but doesn't use the same terminology as SQL: the find method supports the equivalent of SELECT and WHERE, while the aggregate method supports the equivalent of SELECT, WHERE, GROUP BY, HAVING and ORDER BY.
  • For efficiency, it's really important to push as much of the processing down to MongoDB's query engine, without the user having to re-write their SQL.
  • But MongoDB doesn't support anything equivalent to JOIN.
  • MongoDB can't access external data.

I decided to tackle this using Optiq. Optiq already has a SQL parser and a powerful query optimizer that is powered by rewrite rules. Building on Optiq's core rules, I can add rules that map tables onto MongoDB collections, and relational operations onto MongoDB's find and aggregate operators.

What I produced is a effectively a JDBC driver for MongoDB. Behind it is a hybrid query-processing engine that pushes as much of the query processing down to MongoDB, and does whatever is left (such as joins) in the client.

Let's give it a try. First, install MongoDB, and import MongoDB's zipcode data set:
$ curl -o /tmp/zips.json http://media.mongodb.org/zips.json
$ mongoimport --db test --collection zips --file /tmp/zips.json
Tue Jun  4 16:24:14.190 check 9 29470
Tue Jun  4 16:24:14.469 imported 29470 objects
Log into MongoDB to check it's there:
$ mongo
MongoDB shell version: 2.4.3
connecting to: test
> db.zips.find().limit(3)
{ "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" : "AL", "_id" : "35004" }
{ "city" : "ADAMSVILLE", "loc" : [ -86.959727, 33.588437 ], "pop" : 10616, "state" : "AL", "_id" : "35005" }
{ "city" : "ADGER", "loc" : [ -87.167455, 33.434277 ], "pop" : 3205, "state" : "AL", "_id" : "35006" }
> exit
bye
Now let's see the same data via SQL. Download and install Optiq:
$ git clone https://github.com/julianhyde/optiq.git
$ mvn install
Optiq comes with a sample model in JSON format, and the sqlline SQL shell. Connect using the mongo-zips-model.json Optiq model, and use sqlline's !tables command to list the available tables.
$ ./sqlline
sqlline> !connect jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json admin admin
Connecting to jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json
Connected to: Optiq (version 0.4.13)
Driver: Optiq JDBC Driver (version 0.4.13)
Autocommit status: true
Transaction isolation: TRANSACTION_REPEATABLE_READ
sqlline> !tables
+------------+--------------+-----------------+---------------+
| TABLE_CAT  | TABLE_SCHEM  |   TABLE_NAME    |  TABLE_TYPE   |
+------------+--------------+-----------------+---------------+
| null       | mongo_raw    | zips            | TABLE         |
| null       | mongo_raw    | system.indexes  | TABLE         |
| null       | mongo        | ZIPS            | VIEW          |
| null       | metadata     | COLUMNS         | SYSTEM_TABLE  |
| null       | metadata     | TABLES          | SYSTEM_TABLE  |
+------------+--------------+-----------------+---------------+
Each collection in MongoDB appears here as a table. There are also the COLUMNS and TABLES system tables provided by Optiq, and a view called ZIPS defined in mongo-zips-model.json.

Let's try a simple query. How many zip codes in America?
sqlline> SELECT count(*) FROM zips;
+---------+
| EXPR$0  |
+---------+
| 29467   |
+---------+
1 row selected (0.746 seconds
Now a more complex one. How many states have a city called Springfield?
sqlline> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';
+-----+
|   C |
+-----+
| 20  |
+-----+
1 row selected (0.549 seconds)
Let's use the SQL EXPLAIN command to see how the query is implemented.
sqlline> !set outputformat csv
sqlline> EXPLAIN PLAN FOR
. . . .> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';
'PLAN' 'EnumerableAggregateRel(group=[{}], C=[COUNT($0)]) EnumerableAggregateRel(group=[{0}]) EnumerableCalcRel(expr#0..4=[{inputs}], expr#5=['SPRINGFIELD'], expr#6=[=($t0, $t5)], STATE=[$t3], $condition=[$t6]) MongoToEnumerableConverter MongoTableScan(table=[[mongo_raw, zips]], ops=[[<{city: 1, state: 1, _id: 1}, {$project ...}>]]) ' 1 row selected (0.115 seconds)

The last line of the plan shows that Optiq calls MongoDB's find operator asking for the "city", "state" and "_id" fields. The first three lines of the plan show that the filter and aggregation are implemented using in Optiq's built-in operators, but we're working on pushing them down to MongoDB.

Finally, quit sqlline.
sqlline> !quit
Closing: net.hydromatic.optiq.jdbc.FactoryJdbc41$OptiqConnectionJdbc41

Optiq and its MongoDB adapter shown here are available on github. If you are interested in writing your own adapter, check out optiq-csv, a sample adapter for Optiq that makes CSV files appear as tables. It has own tutorial on writing adapters.

Check back at this blog over the next few months, and I'll show how to write views and advanced queries using Optiq, and how to use Optiq's other adapters.

Monday, June 03, 2013

Gathering requirements for olap4j 2.0

It's time to start thinking about olap4j version 2.0.

My initial goal for olap4j version 1.0 was to decouple application developers from Mondrian's legacy API. We've far surpassed that goal. Many applications are using olap4j to connect to OLAP servers like Microsoft SQL Server Analysis Services, Palo and SAP BW. And projects are leveraging the olap4j-xmlaserver sister project to provide an XMLA interface on their own OLAP server. The need is greater than ever to comply with the latest standards.

The difference between products and APIs is that you can't change APIs without pissing people off. Even if you improve the API, you force the developers of the drivers to implement the improvements, and the users of the API get upset because they don't have their new drivers yet. There are plenty of improvements to make to olap4j, so let's try to do it without pissing too many people off!

Since olap4j version 1.0, there has been a new release of Mondrian (well, 4.0 is not released officially yet, but the metamodel and API are very nearly fully baked) and a new release of SQL Server Analysis Services, the home of the de facto XMLA standard.

Also, the Mondrian team have spun out their XMLA server as a separate project (olap4j-xmlaserver) that can run against any olap4j driver. If this server is to implement the latest XMLA specification, it needs the underlying olap4j driver to give it all the metadata it needs.

Here's an example of the kind of issue that we'd like to fix. In olap4j 1.x, you can't tell whether a hierarchy is a parent-child hierarchy. People have asked for a method
boolean isParentChild();
Inspired by the the STRUCTURE attribute of the MDSCHEMA_HIERARCHIES XMLA request, we instead propose to add
enum Structure {
  FULLYBALANCED,
  RAGGEDBALANCED,
  RAGGED,
  NETWORK
}
Structure isParentChild();
We can't add this without requiring a new revision of all drivers, but let's be careful gather all the requirements so we can do it just this once.

Here are my goals for olap4j 2.0:

  • Support Analysis Services 2012 metamodel and XMLA as of Analysis Services 2012.
  • Create an enum for each XMLA enum. (Structure, above, is an example.)
  • Support Mondrian 4.0 metamodel. Many of the new Mondrian features, such as measure groups and attributes, are already in SSAS and XMLA.
  • Allow user-specified metadata, such as those specified in Mondrian's schema as annotations, to be passed through the olap4j API and XMLA driver.
  • We'll know that we've done the right thing if we can remove MondrianOlap4jExtra.
I'd also like to maintain backwards compatibility. As I already said, drivers will need to be changed. But any application that worked against olap4j 1.1 should work against olap4j 2.0, and any driver for olap4j 2.0 should also function as an olap4j 1.x driver. That should simplify things for the users.

I'll be gathering a detailed list of API improvements in the olap4j 2.0 specification. If you have ideas for what should be in olap4j version 2.0, now is the time to get involved!

Tuesday, April 30, 2013

Need help

I was amused by this note I just received via email.

Subject: Need help
To: jhyde@users.sourceforge.net
From: <retracted>

Respected Sir, we are doing a final year project  as Student Data warehouse for BE degree and we came to know to about olap4j at the end of our project  , we are presently in unknown way and we are seeking your help, since we are left with only 15 days for project submission, so if we could get any sample application which is built on olap4j ,will help us to understanding in usage of APIs for our project ,since i find it too difficult in usage of APIs and we are out of time , so any help from your side would greatly be appreciated and remembered
Thanks in advance
I get quite a few like these. (I suppose they are a fact of life for any open source developer.)

The spelling in this one is much better than most. And usually the subject like is more like 'Need help, please, please!!!!!'. But I always wonder how anyone who uses punctuation in such an arbitrary way could ever write code that works. Probably the author's supervisor is wondering the same thing.

Friday, March 01, 2013

Optiq latest


Optiq has been developing steadily over the past few months. Those of you who watch github will know most of this already, but I thought I'd summarize what's been going on.

(This post originally appeared as an email to the optiq-dev mailing list. Since I compose email messages a lot faster than blog posts, and the email message contained a lot of stuff that you'd all find interesting, it made sense to recycle it. Hope you don't mind.)

There are two exciting new projects using Optiq:
This week I attended the Strata conference in Santa Clara, and met lots of people who are interested in Optiq for various reasons. There are at least 4 back-end platforms or front-end languages that people would like to see. I can't describe them all here, but this space. Some exciting stuff will play out in this forum over the next few months.

One of my personal favorite projects is to get Optiq running on compressed, in-memory tables managed by a JSR-107-compliant cache/data-grid such as ehCache or Infinispan. ArrayTable and CloneSchema are the beginnings of that project. The end result will be a high-performance, distributed, in-memory SQL database... how cool is that? (Certainly, my own Mondrian project will be able to make massive use of it.)

And, some people were asking for the Splunk adapter (the so-called "JDBC driver for Splunk") to be improved. Good to hear that it's proving useful.

Now regarding the code.

One person noted that "mvn clean install" should just work for any maven-based project, and it doesn't. He's right. It should. I fixed it. Now it does.

I made some breaking API changes last week, so I upped the version to 0.2.

Expect the version numbers to continue to move erratically, because in our current development mode, it doesn't seem to make sense to have specific milestones. We're basically working on general stability rather than a few big features. We are trying to maintain backwards compatibility, but if we need to change API, we'll do it. I'll help dependent projects such as Lingual and Drill migrate to the new API, and make it as easy as possible for the rest of you.

Over the last week I'd been working on the code generation that powers SQL scalar expressions and built-in functions. This code generation is, obviously, used by the Java provider, but it can also be used by other providers. For instance, Lingual generates Java strings for filters that it passes to Janino. I've been working on OptiqSqlOperatorTest to get more and more of the built-in SQL functions to pass, and I've added OptiqAssert.planContains so that we can add tests to make sure that the minitiae of java code generation are as efficient as possible.

I still need to tell you about the extensions I've been making to Optiq SQL to support Drill (but useful to any project that wants to use nested data or late-binding schemas), but that will have to wait for its own blog post. Watch this space.

Tuesday, February 26, 2013

Announcing Lingual

The last few months, I've been collaborating on a project with Chris Wensel, the author of Cascading. Last week we announced Lingual, an open source project that puts a SQL interface on top of Cascading.

Architecturally, Lingual combines the Cascading engine with my own Optiq framework. Optiq provides the SQL interface (including JDBC), reads table and column definitions from Cascading's metadata store, and few custom Optiq rules target relational operations (project, filter, join and so forth) onto a Cascading operator graph. The queries are executed, on top of Hadoop, using Cascading's existing runtime engine.

Not everyone has heard of Cascading, so let me explain what it is, and why I think it fits well with Optiq. Cascading is a Java API for defining data flows. You write a Java program to build data flows using constructs such as pipes, filters, and grouping operators, Cascading converts that data flow to a MapReduce job, and runs it on Hadoop. Cascading was established early, picked the right level of abstraction to be simple and useful, and has grown to industry strength as it matured.

As a result, companies who are doing really serious things with Hadoop often use Cascading. Some of the very smartest Hadoop users are easily smart enough to have built their own Hadoop query language, but they did something even smarter — they layered DSLs such as Scalding and Cascalog on top of Cascading. In a sense, Optiq-powered SQL is just another DSL for Cascading. I'm proud to be in such illustrious company.

Newbies always ask, "What is Hadoop?" and then a few moments later, "Is Hadoop a database?". (The answer to the second question is easy. Many people would love Hadoop to be an "open source Teradata", but wanting it doesn't make it so. No Virginia, Hadoop is not a database.)

A cottage industry has sprung up of bad analogies for Hadoop, so forgive me if I make another one: Hadoop is, in some sense, an operating system for the compute cluster. After mainframes, minicomputers, and PCs, the next generation of hardware is the compute cluster. Hadoop is the OS, and MapReduce is the assembly language for that hardware — all powerful, but difficult to write and debug. UNIX came about to serve the then-new minicomputers, and crucial to its success was the C programming language. C allowed developers to be productive while writing code almost as efficient as assembler, and it allowed UNIX to move beyond its original PDP-7 hardware.

Cascading is the C of the Hadoop ecosystem. Sparse, elegantly composable, low-level enough to get the job done, but it abstracts away the nasty stuff unless you really want to roll up your sleeves.

It makes a lot of sense to put SQL on top of Cascading. There has been a lot of buzz recently about SQL on Hadoop, but we're not getting caught up in the hype. We are not claiming that Lingual will give speed-of-thought response times (Hadoop isn't a database, remember?), nor will it make advanced predictive analytics will be easy to write (Lingual is not magic). But Hadoop is really good at storing, processing, cleaning and exporting data at immense scale. Lingual brings that good stuff to a new audience.

A large part of that SQL-speaking audience is machines. I'd guess that 80% of the world's SQL statements are generated by tools. Machine-generated SQL is pretty dumb, so it essential that you have an optimizer. (As author of a tool that speaks SQL — Mondrian — and several SQL engines — Broadbase, LucidDB, SQLstream — I have been on both sides of this problem.) Once you have an optimizer, you can start doing clever stuff like re-organizing your data to make the queries run faster. Maybe the optimizer will even help.

Lingual is not a "SQL-like language". Because it is based on Optiq, Lingual is a mature implementation of ANSI/ISO standard SQL. This is especially important for those SQL-generating tools, which cannot rephrase a query to work around a bug or missing feature. As part of our test suite, we ran Mondrian on PostgreSQL, and captured the SQL queries it issued and the results the database gave. Then we replayed those queries — over 6,200 of them — to Lingual and checked that Lingual gave the same results. (By the way, putting Optiq and Cascading together was surprisingly easy. The biggest challenge we had was removing the Postgres-isms from thousands of generated queries.)

Lingual is not the only thing I've been working on. (You can tell when I'm busy by the deafening silence on this blog.) I've also been working on Apache Drill, using Optiq to extend SQL for JSON-shaped data, and I'll blog about this shortly. Also, as Optiq is integrated with more data formats and engines, the number of possibilities increases. If you happen to be at Strata conference tomorrow (Wednesday), drop me a line on twitter and we can meet up and discuss. Probably in the bar.

Tuesday, January 01, 2013

Explaining holidays to a 3 year old

Your birthday involves presents and cake. The day after your birthday you are allowed to say "I am x", where x is the number of candles on your cake.

4th of July is the birthday of our country, which is called America. America has decided that it doesn't want presents or a cake, just fireworks.

Thanksgiving is a big meal with your family. We eat a turkey, which is basically a very large chicken.

Christmas celebrates the first birthday, a very long time ago, of a baby who was very special, a bit like a king but without a palace or anything, who grew up to be very, very nice indeed, so nice that he made everyone else a bit nicer. And there were shepherds and animals. Also, kind of like a birthday for everyone, because everyone gets presents, although giving them is the important part. Also, we have an indoor tree with lights and things on it. Also, we have a meal exactly like Thanksgiving.

New Year. We all need new calendars tomorrow. Train calendars are best.

Happy New Year, everyone. I hope you get the train calendar your heart desires.