Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: 12/01/2008

Monday, December 15, 2008

Streaming content feeds part 2: forging the Streaming Web

My previous blog post "Streaming analytics over content feeds (and how content feeds could be better)" drew some excellent comments, so I thought I'd follow up with some more thoughts about a protocol for streaming web content, and a vision that I'll dub the "Streaming Web".

To John Kalucki's points first. I absolutely agree that the driver for this protocol is latency. But it is difficult to answer the question "what latency is necessary?", because we don't yet know what applications people will devise.

(An illustration of how latency changes everything, from a very different business: when my wife worked for Niman Ranch, I was amazed to hear that they dispatch steaks via FedEx (packed in ice and insulation, and sent overnight); this would be out of the question using the USPS and a three day delivery time.)

I believe that real-time web content feeds are a game changer. I call it the Streaming Web — a web where every piece of content is accessible via a URL and you can subscribe to be alerted immediately if a piece of content changes. Every page would become a potential feed, and there would be agents that allow us to collect and filter content we are interested in: be it a friend's photo album or the price of a plane ticket.

A huge effort is required to make the Streaming Web a reality. The first steps, the web content formats such as RSS and Atom, are already in place. The next step is to introduce a protocols so that subscribers are notified of changes as soon as they happen.

John says:

What experience can you offer with feeds at a 50ms push latency vs a 180,000ms pull latency? If a machine is consuming the feed, not much. If a human is immediately consuming a feed, perhaps a great deal.

I agree that a human can benefit from low-latency content, although there is little benefit for content arriving faster than the human's think time — say 5,000ms. But if a computer is the consumer, ideal latencies span a broad spectrum: a mail server would operate more efficiently if it is allowed latencies in the minutes or hours, whereas an automated stock trading system needs information to arrive within 50ms.

Today, not much web content is of interest to automated stock trading systems. Most web content feeds today are textual — written by humans, and consumed by humans — but I believe that once we remove the latency constraints and introduce some standard protocols, we will start to see more structured data in feeds. Also, we will see algorithms for extracting information from textual feeds.

As for the right protocol for the job, I am not really the best judge, so I am going to punt for now, and focus on the architecture. Richard Taylor suggests XMPP. It seems to have the right qualifications, and I'm sure that it could be made to work technically. (And I see that XMPP is already a central part of the Twitter ecosystem.) It comes down to power versus simplicity: the power of an established standard versus the simplicity required to reach a new audience of developers.

I've been around long enough to see new approaches overturn "over-complex" existing technologies and then, in time, acquire the features that made their predecessors complicated. Take for example SOAP overturning CORBA, or PCs overturning minicomputers. I'm not going to take sides: these revolutions are part of the process of how technology moves forward. But it does seem that each revolution will only be successful if the new technology serves a new audience. And, to borrow Einstein's words, a protocol should be as simple as possible, but no simpler; otherwise, even if the technology finds its initial audience, it will not survive its growing pains.

I'm not a big fan of XML as a protocol for transmitting data over a network, mainly because it is bulky, and that makes it expensive to produce and consume at high data rates. But for this protocol, I would choose XML over a binary format. If you're a developer learning a new protocol, it's a lot easier to debug your code if you can read the messages being sent over the nextwork as text.

Which brings us to the audience for this protocol. I do agree with Stefan Tilkov that "[f]or the majority of use cases, [the polling] approach is vastly superior to a push model". That majority is already well served, so I'm focusing on the minority that need low latency. I think those use cases are important, and we'll all be using them if the "streaming web" thing catches on.

To achieve low latency feeds, push is more efficient than high-frequency polling, but it is still more expensive than low-frequency polling, which is what people are doing today. So, if every web content aggregator and RSS reader switched to a low-latency push protocol overnight, the system would collapse.

But luckily, there is no need for those millions of clients who would like to receive low-latency feed updates to connect using this new protocol. If those clients are humans, they will be happy to receive their updates via XMPP or SMS, or slower protocols like email. A single server could speak the streaming web feed protocol to various source feeds, and route the results to thousands of end users via XMPP or SMS. This approach means that each source feed is serving a modest number of downstream servers.

I'd describe it as a 'wholesale' architecture. A food producer has a central depot, where it loads its goods onto the trucks of several client stores. The food company allows consumers to buy from the depot, if they are prepared to buy their goods in bulk, but most consumers opt for the convenience of visiting a local store and buying their goods in smaller quantities.

(If you're Twitter, no problem is ever small, so that 'modest number' is probably in the tens of thousands. But I suppose that problem can be solved using multiple tiers of servers and fanning out streams between one tier and the next.)

The next step in the evolution of the architecture would be to introduce a query language. Queries present a more convenient interface for clients, but they would have architectural advantages. For example, using a query, a client can specify more precisely which content it is interested in. It would save CPU effort on the client and possibly the server, and bandwidth for everyone, so there would be a strong incentive to use queries rather than raw feeds.

Queries would also allow feeds to be virtualized: rather than talk directly to blogger and typepad, a client could talk to a third party that aggregates the content into a single feed.

Streaming SQL would be a good candidate for expressing these queries, but is by no means the only choice. And in fact the architecture and protocol would work well enough for clients that did not use queries and wanted to consume only raw feeds.

The resulting system, the Streaming Web, would enable applications yet to be imagined.

Friday, December 12, 2008

Streaming analytics over content feeds (and how content feeds could be better)

We have been experimenting with different web-based data sources for SQLstream. Seth Grimes saw the demo, and wrote a piece "BI on Content Feeds, a.k.a. Continuous (Twitter) Transformation" in Intelligent Enterprise.

Social networks and web content feeds such as RSS have, in a few short years, added a dynamic component to the vast static content on the web. As less-sophisticated users have become more accustomed to consuming them, these feeds have become a ubiquitous part of the web experience.

Web feeds have an information content that is at present untapped. In the same way that a radical new approach — the search engine — was needed to harness the static information content of the web, a streaming analytics solution in this area becomes important sooner rather than later.

SQLstream Studio showing web content feeds

The SQLstream prototype illustrates how several data formats (tweets from Twitter, USGS quake data in RSS format, news from Google's Atom feed, and so forth) can be integrated into SQLstream.

For each data format we built an adapter that implemented the SQL/MED specification, and using these adapters we mapped each feed into SQLstream as a foreign stream. Once data is in SQL format, you can build views on top of these streams to filter, join and aggregate records.

Now we've done the hard part — getting the data feeds into a common format — there are plenty of ways to extract information from the feeds. For instance, it would be easy to find out which Twitter users are the most active over the last hour or the last seven days.

Or you could pull apart messages to discover word frequencies, and write a stream that detects words that are being used more frequently than usual (similar to Google zeitgeist but in real time).

But the prototype has some limitations: news items tend to arrive in bursts every couple of minutes, and many Twitter messages are missing. These are all limitations of the data sources with respect to latency (how soon messages arrive) and throughput (how many messages per second the system can handle), and the limitations stem from the inefficiencies of the web feed protocols.

You would think that something called a 'feed' would push content to subscribers as soon as it arrives, but in fact RSS and the other feed types in the prototype use a pull protocol. With a pull protocol, the subscriber needs to continually poll the feed to get the content (typically an XML document a few kilobytes long), parse the content, and figure out what, if anything, is new since the last time we polled.

This process soaks up a lot of network bandwidth and resources for both the provider and the subscriber, and the cost goes up the more regularly we poll. Typically the provider has to throttle the feed to prevent their servers from being overwhelmed. For example, Twitter updates its feed only once per minute and limits the number of tweets on the page. At times of high volume, only a small percentage of tweets make it into the feed.

This may not sound that serious if the content is a twitter conversation between friends, or a blog with one or two posts a week. But web feed protocols are becoming part of the IT infrastructure, and business users require lower latency, higher throughput and higher availability. (The existence of services like Gnip is evidence of the need to control the web content chaos.)

I would like to see the emergence of a genuine 'push' protocol for web-based content. It doesn't have to be particularly complicated. To illustrate what I have in mind, here is an example of a simple, stateless protocol, built using XML over HTTP, like the current feed formats. A subscriber sends a request

<readRequest>
  <minimumRowtime>2008-12-04 18:00:46.000</minimumRowtime>
  <maximumCount>1000</maximumCount>
  <maximumWait>10s</maximumWait>
</readRequest>

over HTTP, and the provider replies with a set of content records

<rows>
  <row>
    <rowtime>2008-12-04 18:00:46.217</rowtime>
  <category>U.S.</category>
  <title>Ex-FBI agent faces 30 years to life for mob hit - CNN</title>
  </row>
  <row>
    <rowtime>2008-12-04 18:00:46.714</rowtime>
    <category>More Top Stories</category>
    <title>Bill Richardson chalks up another Cabinet job for the resume - Los Angeles Times</title>
  </row>
  <row>
    <rowtime>2008-12-04 18:00:48.104</rowtime>
    <category>More Top Stories</category>
    <title>Showdown in Hebron as settlers evicted - Jewish Telegraphic Agency</title>
  </row>
</rows>

According to the protocol, the provider sends the results after 10 seconds, or when there are 1000 records to return, whichever occurs sooner. After it has received a result, the subscriber will typically ask for the next set of rows with a higher rowtime threshold.

Even though it is simple, the protocol ensures that data flows efficiently for feeds of all data rates. For a high volume feed, the 1000 record limit will be reached before the 10 second timeout, so latency naturally decreases. For a low volume feed, many requests may time out and return an empty result; but the 10 second wait limits the number of requests per minute that the server has to handle.

Naturally, I have in mind an even better protocol that allows subscribers to submit SQL queries, and of course every web would have a SQLstream server behind the curtain. But seriously folks... I would be satisfied with a lot less than that. A simple, open protocol for streaming content syndication would unlock the web and make it the medium of choice for streaming as well as static content.