Monday, December 15, 2008

Streaming content feeds part 2: forging the Streaming Web

My previous blog post "Streaming analytics over content feeds (and how content feeds could be better)" drew some excellent comments, so I thought I'd follow up with some more thoughts about a protocol for streaming web content, and a vision that I'll dub the "Streaming Web".

To John Kalucki's points first. I absolutely agree that the driver for this protocol is latency. But it is difficult to answer the question "what latency is necessary?", because we don't yet know what applications people will devise.

(An illustration of how latency changes everything, from a very different business: when my wife worked for Niman Ranch, I was amazed to hear that they dispatch steaks via FedEx (packed in ice and insulation, and sent overnight); this would be out of the question using the USPS and a three day delivery time.)

I believe that real-time web content feeds are a game changer. I call it the Streaming Web — a web where every piece of content is accessible via a URL and you can subscribe to be alerted immediately if a piece of content changes. Every page would become a potential feed, and there would be agents that allow us to collect and filter content we are interested in: be it a friend's photo album or the price of a plane ticket.

A huge effort is required to make the Streaming Web a reality. The first steps, the web content formats such as RSS and Atom, are already in place. The next step is to introduce a protocols so that subscribers are notified of changes as soon as they happen.

John says:
What experience can you offer with feeds at a 50ms push latency vs a 180,000ms pull latency? If a machine is consuming the feed, not much. If a human is immediately consuming a feed, perhaps a great deal.
I agree that a human can benefit from low-latency content, although there is little benefit for content arriving faster than the human's think time — say 5,000ms. But if a computer is the consumer, ideal latencies span a broad spectrum: a mail server would operate more efficiently if it is allowed latencies in the minutes or hours, whereas an automated stock trading system needs information to arrive within 50ms.

Today, not much web content is of interest to automated stock trading systems. Most web content feeds today are textual — written by humans, and consumed by humans — but I believe that once we remove the latency constraints and introduce some standard protocols, we will start to see more structured data in feeds. Also, we will see algorithms for extracting information from textual feeds.

As for the right protocol for the job, I am not really the best judge, so I am going to punt for now, and focus on the architecture. Richard Taylor suggests XMPP. It seems to have the right qualifications, and I'm sure that it could be made to work technically. (And I see that XMPP is already a central part of the Twitter ecosystem.) It comes down to power versus simplicity: the power of an established standard versus the simplicity required to reach a new audience of developers.

I've been around long enough to see new approaches overturn "over-complex" existing technologies and then, in time, acquire the features that made their predecessors complicated. Take for example SOAP overturning CORBA, or PCs overturning minicomputers. I'm not going to take sides: these revolutions are part of the process of how technology moves forward. But it does seem that each revolution will only be successful if the new technology serves a new audience. And, to borrow Einstein's words, a protocol should be as simple as possible, but no simpler; otherwise, even if the technology finds its initial audience, it will not survive its growing pains.

I'm not a big fan of XML as a protocol for transmitting data over a network, mainly because it is bulky, and that makes it expensive to produce and consume at high data rates. But for this protocol, I would choose XML over a binary format. If you're a developer learning a new protocol, it's a lot easier to debug your code if you can read the messages being sent over the nextwork as text.

Which brings us to the audience for this protocol. I do agree with Stefan Tilkov that "[f]or the majority of use cases, [the polling] approach is vastly superior to a push model". That majority is already well served, so I'm focusing on the minority that need low latency. I think those use cases are important, and we'll all be using them if the "streaming web" thing catches on.

To achieve low latency feeds, push is more efficient than high-frequency polling, but it is still more expensive than low-frequency polling, which is what people are doing today. So, if every web content aggregator and RSS reader switched to a low-latency push protocol overnight, the system would collapse.

But luckily, there is no need for those millions of clients who would like to receive low-latency feed updates to connect using this new protocol. If those clients are humans, they will be happy to receive their updates via XMPP or SMS, or slower protocols like email. A single server could speak the streaming web feed protocol to various source feeds, and route the results to thousands of end users via XMPP or SMS. This approach means that each source feed is serving a modest number of downstream servers.

I'd describe it as a 'wholesale' architecture. A food producer has a central depot, where it loads its goods onto the trucks of several client stores. The food company allows consumers to buy from the depot, if they are prepared to buy their goods in bulk, but most consumers opt for the convenience of visiting a local store and buying their goods in smaller quantities.

(If you're Twitter, no problem is ever small, so that 'modest number' is probably in the tens of thousands. But I suppose that problem can be solved using multiple tiers of servers and fanning out streams between one tier and the next.)

The next step in the evolution of the architecture would be to introduce a query language. Queries present a more convenient interface for clients, but they would have architectural advantages. For example, using a query, a client can specify more precisely which content it is interested in. It would save CPU effort on the client and possibly the server, and bandwidth for everyone, so there would be a strong incentive to use queries rather than raw feeds.

Queries would also allow feeds to be virtualized: rather than talk directly to blogger and typepad, a client could talk to a third party that aggregates the content into a single feed.

Streaming SQL would be a good candidate for expressing these queries, but is by no means the only choice. And in fact the architecture and protocol would work well enough for clients that did not use queries and wanted to consume only raw feeds.

The resulting system, the Streaming Web, would enable applications yet to be imagined.

8 comments:

Kirk Wylie said...

XMPP is exactly the wrong protocol for anything in terms of Streaming Web. AMQP is the right protocol.

Kirk Wylie said...

Hey, Julian, actually, I think you're going for something that I've been thinking about and working on for quite some time. Rather than try to do it entirely in comments, I've written up a full posting about extending RESTful services with asynchronous notifications (whether they're just normal web content or something more architectural) on my blog. Have a look.

Summary for anybody else reading this: AMQP + REST == Win. Polling RSS == Fail.

Cheers,
Kirk

John Kalucki said...

I agree that XMPP is probably the wrong thing here, or for most any problem other than IM. Both Twitter and GNIP have backed away from XMPP as means of delivering streams.

Consider Stomp and AMQP: Interesting protocols, yet, as with XMPP, inadequate server implementations.

Kirk Wylie said...

@John: I think you're definitely right that STOMP has inadequate server support, but it's better than XMPP and quite easy to code against.

AMQP has a number of quite stable brokers, but the lack of finalization of the standard has I think held it back. However, RabbitMQ is an excellent choice and there are a fair number of people rolling it into production now with quite large workloads.

John Kalucki said...

@Kirk: Yes, there are a number of queues with interesting feature sets. After considerable load and latency testing, we found none that are mature or configurable enough for high-volume workloads. The hardware cost-per-client on most message queues is also quite spendy, and their behavior in certain common congestion cases is often unacceptable. One queue we evaluated scales very poorly with additional clients, making it impractical for anything approaching internet-scale fan-out.

There's a space out there for a clustered (redundant, scalable) service that could provide streaming fan-out against hostile and erratic clients. If it exists already, we haven't found it.

What I don't yet understand is why, for a public service, a protocol is needed atop HTTP. Why aren't HTTP query params and dumping JSON down the pipe sufficient? Why consider AMQP and Stomp?

-John

Kirk Wylie said...

@John: I think you're probably right, in that nobody's really cracked internet-scale MOM yet. I know there are people working at it, and while using some commercial products I think I could have a stab at it (using SonicMQ in a clustered mode with multiple layers or routing in between publishers and subscrivers), I agree with you in that it's not there yet.

But the thing is that I believe that the MOM guys will get there before the XMPP guys. In an enterprise case you're not dealing with millions of simultaneous clients, but that's not something you can't work at, it's just not something that anybody's really attacked yet. My worldview is that standards like Stomp and AMQP give people the basis that they can use to actually test these things: just like Facebook found out with memcached, until you're pushing something to the outer limits, you don't know where it's going to fall over.

That being said, HTTP is a reasonable protocol for non-latency-critical MOM (and there's some work just emerging this week from some AMQP people on doing a RESTful tack on MOM). What it isn't is a great protocol for latency-sensitive applications in any way, for all the reasons I outlined in the post listed above. In short, polling is dumb. More importantly, trying to run every single thing in the world over HTTP when a binary protocol is much better is browser limitations coming to the fore.

AMQP is better than anything you can do over HTTP because:
1) It's binary and thus lower bandwidth. That matters, particularly if you care about internet scale.
2) It's designed from a semantic level (even the RESTms stuff I linked to above has to limit things) by MOM people for performance and semantic correctness; the problems aren't as much with HTTP as with the particular verbs, but HTTP and RESTful architectures require that you do things differently.
3) It's not a hack. While it's fine to play tricks with HTTP chunked encoding and stuff to get around browser JS security limitations, it's not ideal in any way, and if you don't absolutely have to play the game, why do so?

Julian Hyde said...

@Kirk and @John: Luckily it's not an either-or choice. It comes down to audience: the MOM solution isn't going to please the Web 2.0 audience, and the solution based on STOMP is not going to please the J2EE audience.

It's like comparing trains to bicycles; both have their strengths and passionate adherents, and luckily the world is big enough to accommodate both.

My goal in starting this thread was to get the content providers to deliver streaming content in a standard, accessible way. This thing will only catch fire if several people are offering the content in the same protocol.

History has shown that a web technology catches on when it is super-simple (yes, sometimes a little too simple to actually do the job) and amenable to being easily combined with other content. I'm talking mashups, and I can't see anyone building mashups on top of AMQP.

The protocol can be as simple as a few generally agreed HTTP header fields, and output in XML, comma-separated text, anything as long as it is obvious. And, the world being what it is, the protocol will need a name, so that folks can blog about it and find code fragments.

The enterprise/system software part of me loves the idea of using AMQP too. If I was hooking up SQLstream to a commercial source of real time data, say a stock ticker feed, that's what I'd use.

If you're a content provider reading this... give us a push feed...

Anonymous said...
This comment has been removed by a blog administrator.