First some background, for those of you who don't fall asleep every night reading the SQL standard. There are two kinds of aggregation in standard SQL: windowed aggregation, of the form
function(arg {, arg}) OVER window-specificationand grouped aggregation, which is of the form
function(arg {, arg})and requires a GROUP BY clause. (If the GROUP BY clause is not present, 'GROUP BY ()' is assumed.)
According to the standard, these two forms should never meet. It is illegal to use a windowed aggregation in a SELECT that has a GROUP BY, or to mix grouped aggregation and windowed aggregation in the same SELECT. (It's OK to use one in a sub-query and another in an enclosing query.)
However, here is a very reasonable query that is difficult to express in standard SQL: Given a record of every trade on a stock exchange, give me the volume and closing price of each ticker symbol. You might try
SELECTbut this is illegal SQL. Why is it illegal? Because the LAST_VALUE function (like FIRST_VALUE and RANK) is a windowed aggregate function and is only meaningful on an ordered set.
day,
ticker,
SUM(shares) AS volume,
LAST_VALUE(price) AS closingPrice
FROM Trades
GROUP BY day, ticker
To introduce the notion of ordering, I propose that the following query should be valid:
SELECTWith the OVER clause, LAST_VALUE is now a windowed aggregate function within the context of a GROUP BY query, which was previously illegal. Every windowed aggregate is applied to a window, so what is the window in this case? We want the window to contain all of the rows with the same day and ticker value, and to be sorted by timeOfDay. In other words, the window inherits the GROUP BY columns as its implicit PARTITION BY clause. It is as if they had written
day,
ticker,
SUM(shares) AS volume,
LAST_VALUE(price) OVER (ORDER BY timeOfDay) AS closingPrice
FROM Trades
GROUP BY day, ticker
LAST_VALUE(price) OVER (PARTITION BY day, ticker ORDER BY timeOfDay)Now, if you know that I work for SQLstream, you will guess that I am motivated to make this work for streaming queries. A streaming aggregation query over the Trades stream would look like this:
SELECT STREAMThis is identical to the traditional, non-streaming SQL above, except for the STREAM keyword that tells SQLstream that the result of the query should be a stream.
day,
ticker,
SUM(shares) AS volume,
LAST_VALUE(price) OVER (ORDER BY timeOfDay) AS closingPrice
FROM Trades
GROUP BY day, ticker
In idiomatic SQLstream SQL, we would typically express the query as follows:
SELECT STREAMThis form uses SQLstream's system ROWTIME column and the 'FLOOR(datetime expression TO time unit)' operator, and so can dispense with the day and timeOfDay columns. Also, streams are ordered by ROWTIME by default, so we can abbreviate 'OVER (ORDER BY ROWTIME)' to 'OVER ()'. This form is more terse, and more typical of how the query would be written in a SQLstream application, but the previous form works also.
FLOOR(t.ROWTIME TO DAY),
ticker,
SUM(shares) AS volume,
LAST_VALUE(price) OVER () AS closingPrice
FROM Trades AS t
GROUP BY FLOOR(t.ROWTIME TO DAY), ticker
The end result is powerful and, I think, consistent with the spirit of standard SQL.
3 comments:
I couldn't agree more that this should be standard part of SQL. I find myself creating work-arounds to achieve this all the time, and my work arounds make me feel like a hack. For example, I might have solved you last query with a chain of views. In these views would include a query that got the max trade time for each day, and then another view to join the day and last trade time back up with price ..... yuck .... I sincerely would appreciate a better way to do this.
Oracle would allows this as
SELECT DISTINCT
day,
ticker,
SUM(shares) OVER (PARTITION BY day, ticker) AS volume,
LAST_VALUE(price) OVER (PARTITION BY day, ticker ORDER BY timeOfDay) AS closingPrice
FROM Trades
I tried your query on Oracle 10g, and it doesn't work. You get multiple rows with the same hour and ticker.
SQL> create table trades0 (ts date, ticker varchar(4), shares int, price number);
Table created.
SQL> insert into trades0 values (to_date('2008-12-30 10:12:00', 'YYYY-MM-DD HH:MI:SS'), 'ORCL', 10, 25);
1 row created.
SQL> insert into trades0 values (to_date('2008-12-30 10:14:00', 'YYYY-MM-DD HH:MI:SS'), 'MSFT', 15, 30);
1 row created.
SQL> insert into trades0 values (to_date('2008-12-30 10:17:00', 'YYYY-MM-DD HH:MI:SS'), 'ORCL', 20, 27);
1 row created.
SQL> create view trades as select trunc(ts) as day, ts - trunc(ts) as timeOfDay, ticker, shares, price from trades0;
View created.
SQL> SELECT DISTINCT
day,
ticker,
SUM(shares) OVER (PARTITION BY day, ticker) AS volume,
LAST_VALUE(price) OVER (PARTITION BY day, ticker ORDER BY timeOfDay) AS closingPrice
FROM Trades;
DAY TICKER VOLUME CLOSINGPRICE
------------------ ------------ ---------- ------------
30-DEC-08 ORCL 30 27
30-DEC-08 ORCL 30 25
30-DEC-08 MSFT 15 30
Post a Comment