Monday, October 05, 2009

Pentaho Analyzer

Pentaho today announced a new OLAP viewer, called Pentaho Analyzer Enterprise Edition, based on LucidEra's ClearView component.

This is great news for Pentaho customers, the community, and the BI world at large. While Pentaho Analysis (Mondrian) is one of its strongest components, the current OLAP viewer (based on JPivot) has been one of its weakest.

The new viewer puts Pentaho at the top of the heap, in competition with best-of-breed OLAP viewers. It is designed to be intuitive for business users (yes, those people who don't speak MDX!), is built using the latest web technologies, and integrates seamlessly with Mondrian and the rest of the Pentaho suite.

It is going to revolutionize the experience of using OLAP within the Pentaho suite.

Naturally, there are concerns. First, the new viewer is only part of Pentaho's Enterprise Edition (EE) suite. If Pentaho is committed to open source BI, why not release it open source? Second, what will happen to Pentaho Analysis Tool (PAT), the successor to JPivot being developed by the Pentaho community? I'd like to take the opportunity to answer these concerns, because I think this is news that everyone should be celebrating.

Why is the new Analyzer not open source?

There's been a lot of talk about open source business models, 'open core', good and evil, and all that. Releasing ClearView as part of Enterprise Edition is perfectly in sync with Pentaho's business model and with my intuitions about what makes sense for open source. Here's my rationale.

If you release a piece of software open source out of sheer, 'I love the world!' altruism, you won't necessarily see much benefit. Pentaho is a for-profit business, and they are savvy about leveraging the benefits of open source software. And let's not kid ourselves, there are considerable downsides to releasing something open source. Your competitors can pick up the software and incorporate your hard work into their suite. And your customers may decide that the free version is so good that they aren't going to give you any of their money.

Open source allows you to bring a component to a wider audience, an audience that will test, document and improve the component, and will support each other on the forums. Only the Community Edition (CE) components get that boost. Therefore, Pentaho's strategy is to release the core functionality in CE. That means the high-performance core of the system, the code paths that get run trillions of times an hour, and that means all the components that are necessary to build a functional and useful BI application.

In particular, people ask me whether there is a high-performance 'Mondrian on steroids' in EE. No there isn't. None of us want to maintain alternative code-paths, because the extra complexity would slow down future development. If I were to create a performance optimization in EE, the community would probably replicate that optimization in CE within a few weeks. Improving the core Mondrian system for everyone brings more people into the community, and that brings more people to EE.

And by the way, this doesn't just apply to the Pentaho Analysis part of the suite. Pentaho adds major new functionality to the suite each release, and most of that goes into open source components.

So, what's left to go into EE? Bells and whistles, things that make the product easier to use, easier to manage, and things that make your boss want to reach for his or her checkbook. And of course support, releases that are certified and indemnified, and more regular. I don't think that's a bad deal, however you look at it.

It also helps if the components are delivered under a business-friendly license like LGPL or EPL. Otherwise you will not attract contributions from OEM vendors, who are the companies with the skills to extend components as complex as Mondrian or Pentaho Data Integration (Kettle). Once again, Pentaho is taking a risk by using business-friendly licenses, because there is always a chance that Pentaho's competitors will scoop up the fruits of its labors. (As in fact they do.)

But Pentaho's faith in the open source process pays off. ClearView is proof of that. If Mondrian had not been available under a business-friendly open source license, LucidEra would probably have written it on top of another vendor's engine, and Pentaho would not have been able to use it. Incidentally, LucidEra has contributed many important enhancements to Mondrian in areas of both performance and functionality over the past three years. This has improved Mondrian for everyone, and we know that ClearView performs very well against Mondrian.

What will happen to PAT?

To restate what I said above, there is a network effect when you make a component open source. The more people that use a component, the more people are going to contribute to it. We want as many people to use Mondrian as possible, and in particular we want the right people to use it (the people who are going to make major improvements).

So, for Mondrian's continuing health as an open source component, we need the Community Edition of Mondrian to be good enough to build business applications on. For that, we need to make PAT successful.

I personally have been laying the ground work for PAT for a number of years. I spearheaded the olap4j API, knowing that the community would be more likely to write the next generation OLAP viewer if it was guaranteed to be portable across OLAP engines. Then I kicked off the halogen project, a collaboration between Pentaho developers and the community to build a viewer using olap4j and GWT. Pentaho developers contributed code and user interface design to that project, even working in their spare time when the current Pentaho sprint used up all of their 'official' cycles. And the PAT project used the halogen code, and the knowledge of the halogen developers, as a starting point.

It's not healthy to have too close a relationship between an OLAP server and viewer. There should always be room for competition, an opportunity to use a new viewer or (gasp!) different OLAP server if the 'standard' one isn't ideal. I created olap4j with competition in mind, and the experiment seems to be working: PAT can run against Mondrian's native interface, Mondrian's XMLA server, and against SQL Server Analysis Services via XMLA.

I want to make it easier to build alternative front-ends on top of olap4j, so I have been encouraging PAT developers to contribute to olap4j's query model and library of transforms. I would like to see Analyzer move to olap4j internally (it currently uses Mondrian's native API), and perhaps migrate some of the logic in Analyzer to olap4j so that we can share the costs of maintaining it with the community.

Lastly, as I realized at the recent community meetup in Barcelona, we have a great team, and we need to harness their energy. After a beer or two with PAT developers Tom and Paul, some inspiring demos from Pedro and Daniel, we hatched ideas of incorporating spark lines and writeback into PAT, and I'm sure the ideas will keep on flowing. With this much inspiration and hard work coming from the community, how can we possibly fail?

Monday, August 17, 2009

What API should Facebook and FriendFeed use to publish the social stream?

Ars Technica reports that "social networking giant Facebook has acquired FriendFeed. This deal reflects Facebook's growing fixation on the social stream, but it's hard to see how the two services will be merged. [...]

[Facebook's] powerful but esoteric SQL-like query system all add up to a steep learning curve. By comparison, FriendFeed has a simple and elegant API that exposes a lot of information and is much more accommodating to developers.
"

It seems to me that streaming SQL is the correct solution to this problem. Not a SQL-like language, not an API (although you of course have to use an API to execute queries and get the results), and not just traditional SQL on finite relations, but SQL where streams are a first-class construct.

I'm not a big believer in 'SQL-like' languages; they give SQL a bad name. Someone once said that the C programming language combines all the power of assembly language with all the ease-of-use of assembly language. The same could be said for 'SQL-like' languages: they tend offer limited capabilities of a fixed API, but you have to learn a new language to do so.

Full SQL is difficult to implement because it must be possible to combine the relational operators (join, filter, union, project, and so forth), and other language features such as types and built-in operators, in any combination. Implementors often give up on this (what language designers call orthogonality), and what they get to is termed a SQL-like language. The full power of SQL only accrues when the implementor has implemented the whole language, and achieved orthogonality.

Nor can the problem be solved particularly easily or efficiently using regular SQL, because every query is going to be of the form 'tell me what has changed since I last ran the query'. That kind of activity throws a conventional database into a tailspin.

So, streaming SQL could solve this problem. Has anyone tried it?

Wednesday, July 29, 2009

Twitter makes the realtime web look more like the old web

Twitter has a new home page, in the time-honored style of a search engine home page. Claire Cain Miller writes in the New York Times:
It has become a cliché that first-time visitors to Twitter respond with some version of: "I don’t get it." [The new home page] tries to solve that problem.
That problem is worth solving, but the home page is also an interesting sign of the melding of the old and the new.

Old web, new web

The old web is that vast repository of content, ranked by how many people reference that content, and navigated by search engines such as Google. The new web is populated by dynamic content, where what happened in the last minute is much more important than what happened yesterday.

The new, real-time web has been a wild frontier. There's a cachet to being a Twitter user; you're among pioneers, one of the elite who 'get it', not one of the ordinary folks. That's a problem for Twitter, because they need those millions of ordinary folks first to 'get it', then to get something useful out of it, come back, and start spending their click-through dollars.

But harnessing the power of the real-time web is no easy problem. First of all, streaming content is a new paradigm. Facebook are doing well at introducing a lot of people to that idea of the ever-changing home page; Twitter's minimalist concept needs more getting used to, but the search engine front-end to the stream of chatter will surely help.

Second, you need different tools to convert the noise in Twitter other social media feeds into information. A search engine is not going to cut it. The new tools cannot work on the streaming data alone; they have to combine the new data with old, organize the data, and cluster the data with other data that is similar based on subject matter, geographical proximity, or proximity of users in the social network. The stream of content hurtling past our eyes looks like chatter, just noise, until we rank it, look for trends, and put it into historical context.

Old analytics, new analytics

I find it interesting because at SQLstream we are dealing with a very similar problems for enterprise data. Business users would like to see the full spectrum of data, from right now to the distant past, but when making decisions, they want more recent data to carry more weight; they also want to take into account similarity of subject matter, geographical proximity, and the structure of social networks.

Traditional analytic solutions use data warehouses, analogous to the old, static, web and its search engine guardians. A data warehouse treats all data equally, regardless of its age. There is so much data that it has to be stored on disk, and it takes several hours to organize that data, so while a typical data warehouse will contain data from five years ago until close of business yesterday, the most important data — what happened today — hasn't reached the data warehouse yet.

SQLstream melds the old (the data warehouse) with the new (streaming events and transactions arriving over the wire), presenting a unified view via the SQL query language. We say that we "query the future", meaning that you can place standing queries that react when events of interest occur. These queries cache their working sets in memory, so the response time is a few milliseconds, and throughput tens or hundreds of thousands of records per second.

The data sources that SQLstream can handle are diverse. Some of the data comes from traditional sources, like corporate transaction processing systems. Some sources are often considered too high-volume to process in a data warehouse, such as click-stream and system monitoring data. And there are new sources like Twitter, social media, Atom and RSS feeds.

The problems of the real-time web and the real-time enterprise are surprisingly similar. Without tools to filter, aggregate, rank, and provide historical context, all of these data sources just look like noise and have little apparent value. At SQLstream, we are providing the tools to convert streams into valuable information.

Sunday, July 26, 2009

An unfortunate fellow named Hyde

A limerick featuring my family name and ending with a pun. What could be better?

An unfortunate fellow named Hyde
fell down an outhouse and died.
By mischance, his brother
fell down another.
And now they're interred side-by-side.

(Reportedly due to Johnny Carson.)

Thursday, July 23, 2009

Introduction to Pentaho Analysis

Joshua Tolley has written a nice step-by-step guide to using Pentaho Analysis. The tutorial is from the end-user's point of view, which of course is the most important perspective.

If you're interested in the back-end stuff, a couple of weeks ago Joshua also wrote a nice MDX primer.

Friday, July 10, 2009

Functional dependency optimizations in Mondrian

Eric McDermid just checked in a nice new feature into Mondrian which optimizes the SQL generated by MySQL. It takes advantage of the fact that in MySQL, if some of your GROUP BY columns are unique, you can leave the other columns out of the GROUP BY clause, and MySQL does less work.

In some cases, a lot less work. MySQL implements GROUP BY by sorting, and since this reduces the volume of data being sorted, Eric reports significant performance improvements. Unfortunately it only works on MySQL, since MySQL is the database I know which has this feature.

See the latest schema documentation for more details.

I'll note that we reserve the right to change the syntax a little in future versions. In mondrian-4.0 we're adding physical schemas, which will include much more information about tables and relationships, so it would make sense to declare unique keys along with that. But rest assured that even if we do change the syntax, the feature will still be present.

Tuesday, June 30, 2009

SQLstream powers Firefox 3.5 realtime downloads monitor

Mozilla launched Firefox 3.5 today, and with it, a neat applet, powered by SQLstream, to monitor downloads in real time.

You can see the results at Mozilla's download stats page.

A few weeks ago, Apple's Hyperwall was awe-inspiring as a piece of visual art, but it was less impressive as a piece of real-time data integration, because the data was delayed five minutes from the app store.

SQLstream gathers data from Mozilla's download centers around the world, assigns each record a latitude and longitude, and summarizes the information in a continuously executing SQL query. Data is read with sub-second latencies, and then aggregated (using SQLstream's streaming GROUP BY operator) into summary records each describing a second of activity.

A server-side Java program reads the data using JDBC, serializes it as JSON, and transmits it to all connected web clients. Clients render the charts using the Canvas tag, newly introduced in HTML 5. The results are very impressive visually, but to a back-end guy like myself, the plumbing is impressive too.

The amazing thing is that SQLstream makes this so easy. Our official company blurb talks about "shortening data integration projects from months to weeks", but this project took just a couple of days of work.

By the way, don't try to view the page in Microsoft's Internet Explorer. Ten years ago, Internet Explorer led the charge to enhance the capabilities of the web browser, introducing dynamic HTML (DHTML), XML handling in the browser, ActiveX controls and other capabilities, but those days are over. With HTML 5 there is a renaissance in web standards; Firefox is leading the pack, with other 'modern' browsers such as Safari, Opera and Chrome not far behind.