Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 08/01/2008

Wednesday, August 27, 2008

Database virtualization, distributed caching and streaming SQL

James Kobelius writes in Network World how the need for scalable real-time business intelligence will create a convergence of technologies centered on database virtualization:

"Real-time is the most exciting new frontier in business intelligence, and virtualization will facilitate low-latency analytics more powerfully than traditional approaches. Database virtualization will enable real-time business intelligence through a policy-driven, latency-agile, distributed-caching memory grid that permeates an infrastructure at all levels.

As this new approach takes hold, it will provide a convergence architecture for diverse approaches to real-time business intelligence, such as trickle-feed extract transform load (ETL), changed-data capture (CDC), event-stream processing and data federation. Traditionally deployed as stovepipe infrastructures, these approaches will become alternative integration patterns in a virtualized information fabric for real-time business intelligence."

Kobelius makes it clear that this "virtualized information fabric" is an ambitious program that will be accomplished only over a number of years, but the underlying trends are visible now: for example, the convergence of distributed caches with databases, as evidenced by Oracle's acquisition of Tangosol, and Microsoft's recently announced Project Velocity.

This envisioned system contains so many moving parts that a new paradigm will be needed to link them together. I don't think that databases are the answer. They elegantly handle stored data, but founder when dealing with change, caching, and the kind of replication problems you encounter when implementing virtualized and distributed systems. For example, database triggers are the standard way of managing change in a database, and are still clunky fifteen years after they were introduced; and Enterprise Information Integration (EII) systems were an attempt to extend the database model to handle federated data, but only work well for a proscribed set of distribution patterns.

I wrote recently about how SQLstream can implement trickle-feed ETL and use the knowledge it gleans from the passing data to proactively manage the mondrian OLAP engine's cache. SQLstream also has adapters to implement change-data capture (CDC) and to manage data federation.

In SQLstream, the lingua franca for all of these integration patterns is SQL, whereas ironically, if you tried to achieve these things in Oracle or Microsoft SQL Server, you would end up writing procedural code: PL/SQL or Transact SQL. Therefore streaming SQL - a variant of what Kobelius calls event-stream processing where, crucially, the language for event-processing language is SQL - seems the best candidate for that unifying paradigm.

Hawkwatch

It's that time of year when the days are getting imperceptibly shorter, birds start thinking of heading south, and a couple of hundred volunteer birders of the Golden Gate Raptor Observatory (GGRO) head to Hawk Hill to watch them.

Rufous-morph Red-Tailed Hawk over Hawk Hill

Hawk Hill is at the southern tip of the Marin Headlands overlooking the Golden Gate Bridge, which naturally funnels migrating raptors from a fifty mile stretch into less than a mile. The result is a huge concentration of raptors. During peak season — which, not coincidentally, is usually within a day or two of the autumn solstice — you will typically see over 100 birds an hour from 12 or 13 species of raptors, including eagles, falcons, accipiters and some of the rarer buteo hawks.

It has to be seen to be believed, so if you're curious, come up to Hawk Hill and see for yourself. When you get there, on any day between late August and early December, as long as Hawk Hill isn't shrouded in fog, you'll find about a dozen volunteers with binoculars counting hawks.

The GGRO has been counting and banding raptors at this site for over twenty years. I have been volunteering for 6 years, and this year I have stepped up to be day leader of the Saturday II hawkwatch team. With the depth of hawk-watching experience on the team, it's not too onerous a job. The main responsibility is to ensure that the numbers are recorded systematically — this is a scientific study, after all — and, when the fog rolls in, to tell the team that it's time to hang up the binoculars for the day.

To help me keep up with the action, I built an RSS feed that contains a summary of each day's hawk watching, including a count of each species of raptor. The first week was a wash — fog every day — but we can expect to see numbers, particularly of the accipiters (Cooper's hawk and Sharp-shinned hawk), climbing rapidly over the next two or three weeks. Subscribe to that feed and you can get daily updates too; or even better, join us on the hill!

Tuesday, August 19, 2008

Mondrian on TimesTen

Funny what you find while googling for error messages: apparently mondrian runs on TimesTen.

One of the bizarre things about open source is that you have no way of knowing who is using your project, and on what platform. (Until they find something wrong, that is.)

Monday, August 18, 2008

Really urgent analytics

A Forrester report entitled "Really Urgent Analytics: The Sweet Spot for Real-Time Data Warehousing" makes the connection between event-stream processing (ESP) and data warehousing, and Intelligent Enterprise published a nice summary.

A traditional data warehouse contained huge amounts of data but was loaded infrequently: say monthly, or nightly at best. Modern businesses demand actions at lower latencies, and data warehousing professionals have been able tune the traditional data warehouse load process to reduce latency.

But even when cranked up to the maximum, the load process cannot achieve latencies of less than a few seconds, whereas many business processes need their answers faster than that. And this performance comes at the cost of higher complexity, so it takes more time and effort to modify the load process to incorporate new data or ask new questions.

To solve the squeeze between lower latency and increasing complexity -- and, I would mention, ever-increasing data volumes and a trend towards distributed systems -- data warehousing needs a new architectural component, and Forrester rightly point to Event Stream Processing to fill that gap. I would add that, given the skill set of data warehousing professionals, it makes a lot of sense for that Event Stream Processing to be in SQL.

At SQLstream, we saw this need four years ago, and are dedicated to solving the latency-complexity problem using SQL.