Wednesday, August 27, 2008

Database virtualization, distributed caching and streaming SQL

James Kobelius writes in Network World how the need for scalable real-time business intelligence will create a convergence of technologies centered on database virtualization:
"Real-time is the most exciting new frontier in business intelligence, and virtualization will facilitate low-latency analytics more powerfully than traditional approaches. Database virtualization will enable real-time business intelligence through a policy-driven, latency-agile, distributed-caching memory grid that permeates an infrastructure at all levels.

As this new approach takes hold, it will provide a convergence architecture for diverse approaches to real-time business intelligence, such as trickle-feed extract transform load (ETL), changed-data capture (CDC), event-stream processing and data federation. Traditionally deployed as stovepipe infrastructures, these approaches will become alternative integration patterns in a virtualized information fabric for real-time business intelligence."
Kobelius makes it clear that this "virtualized information fabric" is an ambitious program that will be accomplished only over a number of years, but the underlying trends are visible now: for example, the convergence of distributed caches with databases, as evidenced by Oracle's acquisition of Tangosol, and Microsoft's recently announced Project Velocity.

This envisioned system contains so many moving parts that a new paradigm will be needed to link them together. I don't think that databases are the answer. They elegantly handle stored data, but founder when dealing with change, caching, and the kind of replication problems you encounter when implementing virtualized and distributed systems. For example, database triggers are the standard way of managing change in a database, and are still clunky fifteen years after they were introduced; and Enterprise Information Integration (EII) systems were an attempt to extend the database model to handle federated data, but only work well for a proscribed set of distribution patterns.

I wrote recently about how SQLstream can implement trickle-feed ETL and use the knowledge it gleans from the passing data to proactively manage the mondrian OLAP engine's cache. SQLstream also has adapters to implement change-data capture (CDC) and to manage data federation.

In SQLstream, the lingua franca for all of these integration patterns is SQL, whereas ironically, if you tried to achieve these things in Oracle or Microsoft SQL Server, you would end up writing procedural code: PL/SQL or Transact SQL. Therefore streaming SQL - a variant of what Kobelius calls event-stream processing where, crucially, the language for event-processing language is SQL - seems the best candidate for that unifying paradigm.

2 comments:

Samir said...

I didn't think I'd live to see the day when Julian Hyde would say "databases are the answer" to any problem. ;)

Anonymous said...

Hi Julian,

Thanks for the blog about database virtualization and distributed caching. It’s true that lots of applications need application caching to enhance the performance and scalability. Databases efficiently store the data but are not scalable.

NCache has been the caching solution of choice for various mission critical applications throughout the world since 2005. Its scalability, high availability and performance are the reasons it has earned the trust of developers, senior IT and management personnel within high profile companies. These companies are involved in a range of endeavors including ecommerce, financial services, health services and airline services.

Download a 60 day trial enterprise/developer version or totally FREE NCache Express from www.alachisoft.com/download.html

Team NCache