Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 09/01/2008

Friday, September 26, 2008

The only pie chart you can really trust

[From logicnest via infosthetics.]

Don't worry, this isn't turning into a humor site. Normal service will resume shortly. The last couple of days, I just needed a laugh.

Tuesday, September 23, 2008

From Mr. Goodman this morning. A blogger equivalent of “send this mail to 10 people you know”.

1. Take a picture of yourself right now.
2. Don’t change your clothes, don’t fix your hair... just take a picture.
3. Post that picture with NO editing.
4. Post these instructions with your picture.

PS I'm on workation a friend's cottage in Pacific Grove for a few days, so don't have my camera to hand. I had to learn how to send a picture from my phone to my PC via Bluetooth. Cool.

Sunday, September 21, 2008

Is Oracle about to embrace MPP?

Oracle's Larry Ellison has some major announcements to make during Oracle's OpenWorld conference this coming week in San Francisco. A few months ago he was promising to announce a "major database innovation", but declined to give further details, so the Oracle community has been speculating furiously.

With a keynote entitled "Extreme Performance," and product announcements coming in areas of grid computing and database acceleration, all the indications are that Oracle is getting serious about problems that require massive scalability, massive throughput, and low latency.

This is an area where Oracle has been falling behind. In Oracle's approach, which independent database analyst Curt Monash calls a "shared everything" architecture, multiple servers belong to the same Oracle Real Application Cluster (RAC) and share a common pool of memory and disk storage. But this approach does not allow Oracle to be run on hundreds or thousands of servers, which is how companies such as Google are solving problems which require large amounts of storage and processing. That sort of massively parallel processing (MPP) requires a "shared nothing" architecture, and internet companies have been rolling their own architectures out of simpler components.

The result is that Oracle "is way behind in the 'scale-out' world," said Paul Vallee, CEO of The Pythian Group, an Ottawa, Ontario-based database services provider. "MySQL is eating its lunch in terms of Internet-scaled deployments."

Oracle's own experts seem to agree. In the abstract for a talk "Oracle's New Database Accelerator: A Technical Overview", Ron Weiss writes:

"New and revolutionary solutions and methodologies are coming together to handle the exploding data volumes real-world systems are being required to store and serve up. Supporting ever-larger databases, with ever-increasing demands for getting "answers" faster, requires a new way to approach the problem."

Weiss's solution uses improvements to storage management, but I doubt that it would satisfy Google's requirements, or even the price/performance requirements of a medium-sized internet media company.

Meanwhile, those who have adopted shared-nothing architectures are feeling the pain too. Having stitched together hundreds or thousands of databases, the problem is how to populate and coordinate them. For example, internet companies' transaction rates are so high that it is not possible to load the day's data during an eight hour nightly load window, and besides, business owners want to see data in near real time. Organizations are adapting a 'trickle ETL' process to populate the data warehouse continuously and with low latency.

So data architects seem to be caught between a rock and a hard place. Either stick with Oracle's shared-everything (or indeed IBM DB2 or Microsoft SQL Server - they have the same approach) and live with the scalability limitations, or move to the wild frontier of shared-nothing, and be prepared to spend a lot of effort managing, populating and coordinating your farm of databases.

Ironically, the answer, as Larry Ellison and his cohorts taught us thirty years ago, is in the relational model. By extending the relational model beyond stored data to include streaming data, SQL can be used to efficiently manage data flowing into and between multiple databases, as well as storage and retrieval within those databases. This creates a scalable shared-nothing system, with databases decoupled from each other, but because the data flow is managed by declarative SQL, it is as manageable as a shared-everything system such as Oracle.

SQLstream is an implementation of this new SQL, and can be applied to continuous ETL, real-time BI and monitoring problems. For example, if there are many data sources for your ETL process, and many servers to be populated, SQLstream can act as a cross-hatch, load-balancing the data, aggregating, and routing each row to the correct database engine with low latency. And because SQLstream's SQL encompasses both data at rest and data in flight, it can correlate data in the warehouse with arriving data.

SQLstream is partnering with companies that are building next-generation data warehousing architectures on Oracle and on other databases. Aeturnum is an exciting new delivery partner for SQLstream with extensive expertise in data warehousing (Netezza) and business intelligence (Pentaho).

Come and see SQLstream at Oracle OpenWorld. We will be at the Aeturnum stand (2716 Moscone South) all this week.

Monday, September 15, 2008

Fall migration kicks in...

You can't fault nature.

We've been counting raptors for almost a month now, and the numbers have been really, really low. I've been promising everyone that the peak of raptor migration, will be within a day or two of the equinox, because that's how it always goes, but I was getting worried.

We weren't seeing enough birds. In particular, the accipiters (the Cooper's Hawk and Sharp-shinned Hawk), which form the bulk of the equinox peak, were nowhere to be seen. The counts consisted mainly of the Turkey Vultures and Red-tailed Hawks which are ubiquitous around Hawk Hill. Had something gone wrong? Had this summer's forest fires disrupted the breeding season and delayed the migration?

Look at the accipiter numbers for last week:

Sun 8: 2 sharpies, no coops
Mon 9: no birds (fog)
Tue 10: no sharpies, 1 coop
Wed 12: 1 sharpie, 1 coop
Thu 11: 8 sharpies, 3 coops
Fri 12: 5 sharpies, 3 coops

Then came my day on the hill, Saturday. It was a slow start, foggy at first, and overcast for most of the day, but the birds started coming. We had 40 sharpies, 8 coops. We also got a juvenile Golden Eagle (at 10.30am even -- conventional wisdom has it that eagles are late risers, like the thermals they soar upon), and a couple of Broad-winged Hawks. (According to the books, you won't see a broadie west of Kansas, but the Marin Headlands are very effective at channeling the few we do have into a narrow stream.)

Sat 13: 40 sharpies, 8 coops

And in the last couple of days, the trend has accelerated:

Sun 14: 111 sharpies, 29 coops
Mon 15: 133 sharpies, 44 coops

Yes folks, it looks like we have a fall migration after all. As sure as clockwork, the changing day-length is telling those birds to head south. Check the counts over the next week or two, you should see the spectacle continue.

(Note that all statistics quoted are copyright of the GGRO and may not be reproduced without permission.)

Wednesday, September 03, 2008

SQL extension to allow FIRST_VALUE and LAST_VALUE in GROUP BY query

At SQLstream, we have come across an interesting query pattern that seems to be difficult to express in standard SQL (SQL:2003 or SQL:2008). It turns out to be applicable to regular SQL as well as streaming SQL, and therefore it would make sense as an extension to the SQL standard.

First some background, for those of you who don't fall asleep every night reading the SQL standard. There are two kinds of aggregation in standard SQL: windowed aggregation, of the form

function(arg {, arg}) OVER window-specification

and grouped aggregation, which is of the form

function(arg {, arg})

and requires a GROUP BY clause. (If the GROUP BY clause is not present, 'GROUP BY ()' is assumed.)

According to the standard, these two forms should never meet. It is illegal to use a windowed aggregation in a SELECT that has a GROUP BY, or to mix grouped aggregation and windowed aggregation in the same SELECT. (It's OK to use one in a sub-query and another in an enclosing query.)

However, here is a very reasonable query that is difficult to express in standard SQL: Given a record of every trade on a stock exchange, give me the volume and closing price of each ticker symbol. You might try

SELECT
    day,
    ticker,
    SUM(shares) AS volume,
    LAST_VALUE(price) AS closingPrice
FROM Trades
GROUP BY day, ticker

but this is illegal SQL. Why is it illegal? Because the LAST_VALUE function (like FIRST_VALUE and RANK) is a windowed aggregate function and is only meaningful on an ordered set.

To introduce the notion of ordering, I propose that the following query should be valid:

SELECT
    day,
    ticker,
    SUM(shares) AS volume,
    LAST_VALUE(price) OVER (ORDER BY timeOfDay) AS closingPrice
FROM Trades
GROUP BY day, ticker

With the OVER clause, LAST_VALUE is now a windowed aggregate function within the context of a GROUP BY query, which was previously illegal. Every windowed aggregate is applied to a window, so what is the window in this case? We want the window to contain all of the rows with the same day and ticker value, and to be sorted by timeOfDay. In other words, the window inherits the GROUP BY columns as its implicit PARTITION BY clause. It is as if they had written

LAST_VALUE(price) OVER (PARTITION BY day, ticker ORDER BY timeOfDay)

Now, if you know that I work for SQLstream, you will guess that I am motivated to make this work for streaming queries. A streaming aggregation query over the Trades stream would look like this:

SELECT STREAM
    day,
    ticker,
    SUM(shares) AS volume,
    LAST_VALUE(price) OVER (ORDER BY timeOfDay) AS closingPrice
FROM Trades
GROUP BY day, ticker

This is identical to the traditional, non-streaming SQL above, except for the STREAM keyword that tells SQLstream that the result of the query should be a stream.

In idiomatic SQLstream SQL, we would typically express the query as follows:

SELECT STREAM
    FLOOR(t.ROWTIME TO DAY),
    ticker,
    SUM(shares) AS volume,
    LAST_VALUE(price) OVER () AS closingPrice
FROM Trades AS t
GROUP BY FLOOR(t.ROWTIME TO DAY), ticker

This form uses SQLstream's system ROWTIME column and the 'FLOOR(datetime expression TO time unit)' operator, and so can dispense with the day and timeOfDay columns. Also, streams are ordered by ROWTIME by default, so we can abbreviate 'OVER (ORDER BY ROWTIME)' to 'OVER ()'. This form is more terse, and more typical of how the query would be written in a SQLstream application, but the previous form works also.

The end result is powerful and, I think, consistent with the spirit of standard SQL.

Julian Hyde on Streaming Data, Open Source OLAP. And stuff.