Julian Hyde on Streaming Data, Open Source OLAP. And stuff.

Monday, May 05, 2014

Table macros

Table macros are a new Optiq feature (since release 0.6) that combine the efficiency of tables with the flexibility of functions.

Optiq offers a convenient model for presenting data from multiple external sources via a single, efficient SQL interface. Using adapters, you create a schema for each external source, and a table for each data set within a source.

But sometimes the external data source does not consist of a fixed number of data sets, known ahead of time. Consider, for example, Optiq’s web adapter, optiq-web, which makes any HTML table in any web page appear as a SQL table. Today you can create an Optiq model and define within it several tables.

Optiq-web’s home page shows an example where you can create a schema with tables “Cities” and “States” (based on the Wikipedia pages List of states and territories of the United States and List of United States cities by population) and execute a query to find out the proportion of the California’s population that live in cities:

SELECT COUNT(*) "City Count", 
SUM(100 * c."Population" / s."Population") "Pct State Population" 
FROM "Cities" c, "States" s 
WHERE c."State" = s."State" AND s."State" = 'California';

But what if you want to query a URL that isn’t in the schema? A table macro will allow you to do this:

SELECT * FROM TABLE(
web(‘http://en.wikipedia.org/wiki/List_of_countries_by_population’));

web is a function that returns a table. That is, a Table object, which is the definition of a table. In Optiq, a table definition doesn’t need to be assigned a name and put inside a schema, although most do; this is a free-floating table. A table just needs to be able to describe its columns, and to be able to convert itself to relational algebra. Optiq invokes it while the query is being planned.

Here is the WebTableMacro class:

public class WebTableMacro {
  public Table eval(String url) {
    Map operands = new HashMap();
    operands.put(“url”, url);
    return new WebTable(operands, null); 
  }
 }

And here is how you define a WEB function based upon it in your JSON model:

{
  version: '1.0',
  defaultSchema: ‘ADHOC’,
  schemas: [
    { 
      name: 'ADHOC',
      functions: [
        { 
          name: ‘WEB’,
          className: 'com.example.WebTableMacro'
        } 
      ]
    }
  ]
  }

Table macros are a special kind of table function. They are defined in the same in the model, and invoked in the same way from a SQL statement. A table function can be used at prepare time if (a) its arguments are constants, and (b) the table it returns implements TranslatableTable. If it fails either of those tests, it will be invoked at runtime; it will still produce results, but will have missed out on the advantages of being part of the query optimization process.

What kind of advantages can the optimization process being? Suppose a web page that produces a table supports URL parameters to filter on a particular column and sort on another. We could write planner rules that push take a FilterRel or SortRel on top of a WebTableScan and convert them into a scan with extra URL parameters. A table that came from the web function would be able to participate in that process.

The name ‘table macros’ is inspired by Lisp macros — functions that are invoked at compile time rather than run time. Macros are an extremely powerful feature in Lisp and I hope they will prove to be a powerful addition to SQL. But to SQL users, a more familiar name might be ‘parameterized views’.

Views and table macros are both expanded to relational algebra before the query is optimized. Views are specified in SQL, whereas table macros invoke user code (it takes some logic to handle those parameters). Under the covers, Optiq’s views are implemented using table macros. (They always have been — we’ve only just got around to making table macros a public feature.)

To sum up. Table macros are powerful new Optiq feature that extend the reach of Optiq to data sources that have not been pre-configured into an Optiq model. They are a generalization of SQL views, and share with views the efficiency of expanding relational expressions at query compilation time, where they can be optimized. Table macros will help bring a SQL interface to yet more forms of data.

Wednesday, March 19, 2014

Improvements to Optiq's MongoDB adapter

It’s been a while since I posted to this blog, but I haven’t been idle. Quite the opposite; I’ve been so busy writing code that I haven’t had time to write blog posts. A few months ago I joined Hortonworks, and I’ve been improving Optiq on several fronts, including several releases, adding a cost-based optimizer to Hive and some other initiatives to make Hadoop faster and smarter.

More about those other initiatives shortly. But Optiq’s mission is to improve access to all data, so here I want to talk about improvements to how Optiq accesses data in MongoDB. Optiq can now translate SQL queries to extremely efficient operations inside MongoDB.

MongoDB 2.2 introduced the aggregation framework, which allows you to compose queries as pipelines of operations. They have basically implemented relational algebra, and we wanted to take advantage of this.

As the following table shows, most of those operations map onto Optiq’s relational operators. We can exploit that fact to push SQL query logic down into MongoDB.

MongoDB operator	Optiq operator
$project	ProjectRel
$match	FilterRel
$limit	SortRel.limit
$skip	SortRel.offset
$unwind	-
$group	AggregateRel
$sort	SortRel
$geoNear	-

In the previous iteration of Optiq’s MongoDB adapter, we could push down project, filter and sort operators as $project, $match and $sort. A bug pointed out that it would be more efficient if we evaluated $match before $project. As I fixed that bug yesterday, I decided to push down limit and offset operations. (In Optiq, these are just attributes of a SortRel; a SortRel sorting on 0 columns can be created if you wish to apply limit or offset without sorting.)

That went well, so I decided to go for the prize: pushing down aggregations. This is a big performance win because the output of a GROUP BY query is often a lot smaller than its input. It is much more efficient for MongoDB aggregate the data in memory, returning a small result, than to return a large amount of raw data to be aggregated by Optiq.

Now queries involving SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, OFFSET, FETCH (or LIMIT if you prefer the PostgreSQL-style syntax), not to mention sub-queries, can be evaluated in MongoDB. (JOIN, UNION, INTERSECT, MINUS cannot be pushed down because MongoDB does not support those relational operators; Optiq will still evaluate those queries, pushing down as much as it can.)

Let's see some examples of push-down in action.

Given the query:

SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state

Optiq evaluates:

db.zips.aggregate(
{$project: {STATE: '$state'}},
{$group: {_id: '$STATE', C: {$sum: 1}}},
{$project: {STATE: '$_id', C: '$C'}})

and returns

STATE=WV; C=659
STATE=WA; C=484
...

Now let’s add a HAVING clause to find out which states have more than 1,500 zip codes:

SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
HAVING COUNT(*) > 1500

Optiq adds a $match operator to the previous query's pipeline:

db.zips.aggregate(
{$project: {STATE: '$state'}},
{$group: {_id: '$STATE', C: {$sum: 1}}},
{$project: {STATE: '$_id', C: ‘$C'}},
{$match: {C: {$gt: 1500}}})

and returns

STATE=NY; C=1596
STATE=TX; C=1676
STATE=CA; C=1523

Now the pièce de résistance. The following query finds the top 5 states in terms of number of cities (and remember that each city can have many zip-codes).

SELECT state, COUNT(DISTINCT city) AS cdc
FROM zips
GROUP BY state
ORDER BY cdc DESC
LIMIT 5

COUNT(DISTINCT x) is difficult to implement because it requires the data to be aggregated twice — once to compute the set of distinct values, and once to count them within each group. For this reason, MongoDB doesn’t implement distinct aggregations. But Optiq translates the query into a pipeline with two $group operators. For good measure, we throw in ORDER BY and LIMIT clauses.

The result is an awe-inspiring pipeline that includes two $group operators (implementing the two phases of aggregation for distinct-count), and finishes with $sort and $limit.

db.zips.aggregate(
{$project: {STATE: '$state', CITY: '$city'}},
{$group: {_id: {STATE: '$STATE', CITY: '$CITY'}}},
{$project: {_id: 0, STATE: '$_id.STATE', CITY: '$_id.CITY'}},
{$group: {_id: '$STATE', CDC: {$sum: {$cond: [ {$eq: ['CITY', null]}, 0, 1]}}}},
{$project: {STATE: '$_id', CDC: '$CDC'}},
{$sort: {CDC: -1}}, {$limit: 5})

I had to jump through some hoops to get this far, because MongoDB’s expression language can be baroque. In one case I had to generate

{$ifNull: [null, 0]}

in order to include the constant 0 in a $project operator. And I was foiled by MongoDB bug SERVER-4589 when trying to access the values inside the zips table's loc column, which contains (latitude, longitude) pairs represented as an array.

In conclusion, Optiq on MongoDB now does a lot of really smart stuff. It can evaluate any SQL query, and push down a lot of that evaluation to be executed efficiently inside MongoDB.

I encourage you to download Optiq and try running some sophisticated SQL queries (including those generated by the OLAP engine I authored, Mondrian).

Monday, June 17, 2013

Efficient SQL queries on MongoDB

How do you integrate MongoDB with other data in your organization? MongoDB is great for building applications, and it has its own powerful query API, but it's difficult to mash up data between MongoDB and other tools, or to make tools that speak SQL, such as Pentaho Analysis (Mondrian), connect to MongoDB.

Building a SQL interface isn't easy, because MongoDB's data model is such a long way from SQL's model. Here are some of the challenges:

MongoDB doesn't have a schema. Each database has a number of named 'collections', which are the nearest thing to a SQL table, but each row in a collection can have a completely different set of columns.
In MongoDB, data can be nested. Each row consists of a number of fields, and each field can be a scalar value, null, a record, or an array of records.
MongoDB supports a number of relational operations, but doesn't use the same terminology as SQL: the find method supports the equivalent of SELECT and WHERE, while the aggregate method supports the equivalent of SELECT, WHERE, GROUP BY, HAVING and ORDER BY.
For efficiency, it's really important to push as much of the processing down to MongoDB's query engine, without the user having to re-write their SQL.
But MongoDB doesn't support anything equivalent to JOIN.
MongoDB can't access external data.

I decided to tackle this using Optiq. Optiq already has a SQL parser and a powerful query optimizer that is powered by rewrite rules. Building on Optiq's core rules, I can add rules that map tables onto MongoDB collections, and relational operations onto MongoDB's find and aggregate operators.

What I produced is a effectively a JDBC driver for MongoDB. Behind it is a hybrid query-processing engine that pushes as much of the query processing down to MongoDB, and does whatever is left (such as joins) in the client.

Let's give it a try. First, install MongoDB, and import MongoDB's zipcode data set:

$ curl -o /tmp/zips.json http://media.mongodb.org/zips.json
$ mongoimport --db test --collection zips --file /tmp/zips.json
Tue Jun  4 16:24:14.190 check 9 29470
Tue Jun  4 16:24:14.469 imported 29470 objects

Log into MongoDB to check it's there:

$ mongo
MongoDB shell version: 2.4.3
connecting to: test
> db.zips.find().limit(3)
{ "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" : "AL", "_id" : "35004" }
{ "city" : "ADAMSVILLE", "loc" : [ -86.959727, 33.588437 ], "pop" : 10616, "state" : "AL", "_id" : "35005" }
{ "city" : "ADGER", "loc" : [ -87.167455, 33.434277 ], "pop" : 3205, "state" : "AL", "_id" : "35006" }
> exit
bye

Now let's see the same data via SQL. Download and install Optiq:

$ git clone https://github.com/julianhyde/optiq.git
$ mvn install

Optiq comes with a sample model in JSON format, and the sqlline SQL shell. Connect using the mongo-zips-model.json Optiq model, and use sqlline's !tables command to list the available tables.

$ ./sqlline
sqlline> !connect jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json admin admin
Connecting to jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json
Connected to: Optiq (version 0.4.13)
Driver: Optiq JDBC Driver (version 0.4.13)
Autocommit status: true
Transaction isolation: TRANSACTION_REPEATABLE_READ
sqlline> !tables
+------------+--------------+-----------------+---------------+
| TABLE_CAT  | TABLE_SCHEM  |   TABLE_NAME    |  TABLE_TYPE   |
+------------+--------------+-----------------+---------------+
| null       | mongo_raw    | zips            | TABLE         |
| null       | mongo_raw    | system.indexes  | TABLE         |
| null       | mongo        | ZIPS            | VIEW          |
| null       | metadata     | COLUMNS         | SYSTEM_TABLE  |
| null       | metadata     | TABLES          | SYSTEM_TABLE  |
+------------+--------------+-----------------+---------------+

Each collection in MongoDB appears here as a table. There are also the COLUMNS and TABLES system tables provided by Optiq, and a view called ZIPS defined in mongo-zips-model.json.

Let's try a simple query. How many zip codes in America?

sqlline> SELECT count(*) FROM zips;
+---------+
| EXPR$0  |
+---------+
| 29467   |
+---------+
1 row selected (0.746 seconds)

Now a more complex one. How many states have a city called Springfield?

sqlline> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';
+-----+
|   C |
+-----+
| 20  |
+-----+
1 row selected (0.549 seconds)

Let's use the SQL EXPLAIN command to see how the query is implemented.

sqlline> !set outputformat csv
sqlline> EXPLAIN PLAN FOR
. . . .> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';

'PLAN'
'EnumerableAggregateRel(group=[{}], C=[COUNT($0)])
  EnumerableAggregateRel(group=[{0}])
    EnumerableCalcRel(expr#0..4=[{inputs}], expr#5=['SPRINGFIELD'], expr#6=[=($t0, $t5)], STATE=[$t3], $condition=[$t6])
      MongoToEnumerableConverter
        MongoTableScan(table=[[mongo_raw, zips]], ops=[[<{city: 1, state: 1, _id: 1}, {$project ...}>]])
'
1 row selected (0.115 seconds)

The last line of the plan shows that Optiq calls MongoDB's find operator asking for the "city", "state" and "_id" fields. The first three lines of the plan show that the filter and aggregation are implemented using in Optiq's built-in operators, but we're working on pushing them down to MongoDB.

Finally, quit sqlline.

sqlline> !quit
Closing: net.hydromatic.optiq.jdbc.FactoryJdbc41$OptiqConnectionJdbc41

Optiq and its MongoDB adapter shown here are available on github. If you are interested in writing your own adapter, check out optiq-csv, a sample adapter for Optiq that makes CSV files appear as tables. It has own tutorial on writing adapters.

Check back at this blog over the next few months, and I'll show how to write views and advanced queries using Optiq, and how to use Optiq's other adapters.

Monday, June 03, 2013

Gathering requirements for olap4j 2.0

It's time to start thinking about olap4j version 2.0.

My initial goal for olap4j version 1.0 was to decouple application developers from Mondrian's legacy API. We've far surpassed that goal. Many applications are using olap4j to connect to OLAP servers like Microsoft SQL Server Analysis Services, Palo and SAP BW. And projects are leveraging the olap4j-xmlaserver sister project to provide an XMLA interface on their own OLAP server. The need is greater than ever to comply with the latest standards.

The difference between products and APIs is that you can't change APIs without pissing people off. Even if you improve the API, you force the developers of the drivers to implement the improvements, and the users of the API get upset because they don't have their new drivers yet. There are plenty of improvements to make to olap4j, so let's try to do it without pissing too many people off!

Since olap4j version 1.0, there has been a new release of Mondrian (well, 4.0 is not released officially yet, but the metamodel and API are very nearly fully baked) and a new release of SQL Server Analysis Services, the home of the de facto XMLA standard.

Also, the Mondrian team have spun out their XMLA server as a separate project (olap4j-xmlaserver) that can run against any olap4j driver. If this server is to implement the latest XMLA specification, it needs the underlying olap4j driver to give it all the metadata it needs.

Here's an example of the kind of issue that we'd like to fix. In olap4j 1.x, you can't tell whether a hierarchy is a parent-child hierarchy. People have asked for a method

boolean isParentChild();

Inspired by the the STRUCTURE attribute of the MDSCHEMA_HIERARCHIES XMLA request, we instead propose to add

enum Structure {
  FULLYBALANCED,
  RAGGEDBALANCED,
  RAGGED,
  NETWORK
}
Structure isParentChild();

We can't add this without requiring a new revision of all drivers, but let's be careful gather all the requirements so we can do it just this once.

Here are my goals for olap4j 2.0:

Support Analysis Services 2012 metamodel and XMLA as of Analysis Services 2012.
Create an enum for each XMLA enum. (Structure, above, is an example.)
Support Mondrian 4.0 metamodel. Many of the new Mondrian features, such as measure groups and attributes, are already in SSAS and XMLA.
Allow user-specified metadata, such as those specified in Mondrian's schema as annotations, to be passed through the olap4j API and XMLA driver.
We'll know that we've done the right thing if we can remove MondrianOlap4jExtra.

I'd also like to maintain backwards compatibility. As I already said, drivers will need to be changed. But any application that worked against olap4j 1.1 should work against olap4j 2.0, and any driver for olap4j 2.0 should also function as an olap4j 1.x driver. That should simplify things for the users.

I'll be gathering a detailed list of API improvements in the olap4j 2.0 specification. If you have ideas for what should be in olap4j version 2.0, now is the time to get involved!

Tuesday, April 30, 2013

Need help

I was amused by this note I just received via email.

Subject: Need help
To: jhyde@users.sourceforge.net
From: <retracted>

Respected Sir, we are doing a final year project as Student Data warehouse for BE degree and we came to know to about olap4j at the end of our project , we are presently in unknown way and we are seeking your help, since we are left with only 15 days for project submission, so if we could get any sample application which is built on olap4j ,will help us to understanding in usage of APIs for our project ,since i find it too difficult in usage of APIs and we are out of time , so any help from your side would greatly be appreciated and remembered
Thanks in advance

I get quite a few like these. (I suppose they are a fact of life for any open source developer.)

The spelling in this one is much better than most. And usually the subject like is more like 'Need help, please, please!!!!!'. But I always wonder how anyone who uses punctuation in such an arbitrary way could ever write code that works. Probably the author's supervisor is wondering the same thing.

Friday, March 01, 2013

Optiq latest

Optiq has been developing steadily over the past few months. Those of you who watch github will know most of this already, but I thought I'd summarize what's been going on.

(This post originally appeared as an email to the optiq-dev mailing list. Since I compose email messages a lot faster than blog posts, and the email message contained a lot of stuff that you'd all find interesting, it made sense to recycle it. Hope you don't mind.)

There are two exciting new projects using Optiq:

I have been working in private with Chris Wensel on a project to use Optiq to provide a SQL interface to Cascading, and last week we announced Lingual.
I am working a SQL interface for the Apache Drill project.

This week I attended the Strata conference in Santa Clara, and met lots of people who are interested in Optiq for various reasons. There are at least 4 back-end platforms or front-end languages that people would like to see. I can't describe them all here, but this space. Some exciting stuff will play out in this forum over the next few months.

One of my personal favorite projects is to get Optiq running on compressed, in-memory tables managed by a JSR-107-compliant cache/data-grid such as ehCache or Infinispan. ArrayTable and CloneSchema are the beginnings of that project. The end result will be a high-performance, distributed, in-memory SQL database... how cool is that? (Certainly, my own Mondrian project will be able to make massive use of it.)

And, some people were asking for the Splunk adapter (the so-called "JDBC driver for Splunk") to be improved. Good to hear that it's proving useful.

Now regarding the code.

One person noted that "mvn clean install" should just work for any maven-based project, and it doesn't. He's right. It should. I fixed it. Now it does.

I made some breaking API changes last week, so I upped the version to 0.2.

Expect the version numbers to continue to move erratically, because in our current development mode, it doesn't seem to make sense to have specific milestones. We're basically working on general stability rather than a few big features. We are trying to maintain backwards compatibility, but if we need to change API, we'll do it. I'll help dependent projects such as Lingual and Drill migrate to the new API, and make it as easy as possible for the rest of you.

Over the last week I'd been working on the code generation that powers SQL scalar expressions and built-in functions. This code generation is, obviously, used by the Java provider, but it can also be used by other providers. For instance, Lingual generates Java strings for filters that it passes to Janino. I've been working on OptiqSqlOperatorTest to get more and more of the built-in SQL functions to pass, and I've added OptiqAssert.planContains so that we can add tests to make sure that the minitiae of java code generation are as efficient as possible.

I still need to tell you about the extensions I've been making to Optiq SQL to support Drill (but useful to any project that wants to use nested data or late-binding schemas), but that will have to wait for its own blog post. Watch this space.

Tuesday, February 26, 2013

Announcing Lingual

The last few months, I've been collaborating on a project with Chris Wensel, the author of Cascading. Last week we announced Lingual, an open source project that puts a SQL interface on top of Cascading.

Architecturally, Lingual combines the Cascading engine with my own Optiq framework. Optiq provides the SQL interface (including JDBC), reads table and column definitions from Cascading's metadata store, and few custom Optiq rules target relational operations (project, filter, join and so forth) onto a Cascading operator graph. The queries are executed, on top of Hadoop, using Cascading's existing runtime engine.

Not everyone has heard of Cascading, so let me explain what it is, and why I think it fits well with Optiq. Cascading is a Java API for defining data flows. You write a Java program to build data flows using constructs such as pipes, filters, and grouping operators, Cascading converts that data flow to a MapReduce job, and runs it on Hadoop. Cascading was established early, picked the right level of abstraction to be simple and useful, and has grown to industry strength as it matured.

As a result, companies who are doing really serious things with Hadoop often use Cascading. Some of the very smartest Hadoop users are easily smart enough to have built their own Hadoop query language, but they did something even smarter — they layered DSLs such as Scalding and Cascalog on top of Cascading. In a sense, Optiq-powered SQL is just another DSL for Cascading. I'm proud to be in such illustrious company.

Newbies always ask, "What is Hadoop?" and then a few moments later, "Is Hadoop a database?". (The answer to the second question is easy. Many people would love Hadoop to be an "open source Teradata", but wanting it doesn't make it so. No Virginia, Hadoop is not a database.)

A cottage industry has sprung up of bad analogies for Hadoop, so forgive me if I make another one: Hadoop is, in some sense, an operating system for the compute cluster. After mainframes, minicomputers, and PCs, the next generation of hardware is the compute cluster. Hadoop is the OS, and MapReduce is the assembly language for that hardware — all powerful, but difficult to write and debug. UNIX came about to serve the then-new minicomputers, and crucial to its success was the C programming language. C allowed developers to be productive while writing code almost as efficient as assembler, and it allowed UNIX to move beyond its original PDP-7 hardware.

Cascading is the C of the Hadoop ecosystem. Sparse, elegantly composable, low-level enough to get the job done, but it abstracts away the nasty stuff unless you really want to roll up your sleeves.

It makes a lot of sense to put SQL on top of Cascading. There has been a lot of buzz recently about SQL on Hadoop, but we're not getting caught up in the hype. We are not claiming that Lingual will give speed-of-thought response times (Hadoop isn't a database, remember?), nor will it make advanced predictive analytics will be easy to write (Lingual is not magic). But Hadoop is really good at storing, processing, cleaning and exporting data at immense scale. Lingual brings that good stuff to a new audience.

A large part of that SQL-speaking audience is machines. I'd guess that 80% of the world's SQL statements are generated by tools. Machine-generated SQL is pretty dumb, so it essential that you have an optimizer. (As author of a tool that speaks SQL — Mondrian — and several SQL engines — Broadbase, LucidDB, SQLstream — I have been on both sides of this problem.) Once you have an optimizer, you can start doing clever stuff like re-organizing your data to make the queries run faster. Maybe the optimizer will even help.

Lingual is not a "SQL-like language". Because it is based on Optiq, Lingual is a mature implementation of ANSI/ISO standard SQL. This is especially important for those SQL-generating tools, which cannot rephrase a query to work around a bug or missing feature. As part of our test suite, we ran Mondrian on PostgreSQL, and captured the SQL queries it issued and the results the database gave. Then we replayed those queries — over 6,200 of them — to Lingual and checked that Lingual gave the same results. (By the way, putting Optiq and Cascading together was surprisingly easy. The biggest challenge we had was removing the Postgres-isms from thousands of generated queries.)

Lingual is not the only thing I've been working on. (You can tell when I'm busy by the deafening silence on this blog.) I've also been working on Apache Drill, using Optiq to extend SQL for JSON-shaped data, and I'll blog about this shortly. Also, as Optiq is integrated with more data formats and engines, the number of possibilities increases. If you happen to be at Strata conference tomorrow (Wednesday), drop me a line on twitter and we can meet up and discuss. Probably in the bar.