Window functions ๐
Success
Windows functions allow us to summarise rows without having to use the GROUP BY
clause.
There are two ways to use window functions, and we'll see that both are very similar to row aggregations that we do in Excel!
Warning
Window functions are widely regarded as an advanced concept in SQL. Hopefully this page will show that they're not as scary as people make them seem!
Window functions are another way to summarise rows
We saw in the aggregations section that we can use the GROUP BY
clause to summarise rows. An important result of this is that the number of rows is reduced.
However, there are times when we want to summarise rows without reducing the number of rows; that is, without having to use GROUP BY
.
This is where window functions come in: they summarise rows without reducing the row count.
There are three extremely common use cases for window functions:
- Summarising entire columns
- Calculating running totals
- Calculating moving averages
I'll bet you've done these in Excel before! ๐
There's some terminology to be aware of
Info
Although we've been throwing around the term "window function", the actual SQL feature that we use to achieve this is the OVER
clause.
Before going into an example, there's a bit of terminology to clarify.
Window functions are functions that can be used with the OVER
clause, such as:
The "aggregate functions" we've seen so far (like SUM
, AVG
, COUNT
, etc.) are designed to be used with the GROUP BY
clause, but many of them can also be used with the OVER
clause. This means that these aggregate functions are also window functions!
The OVER
clause is the key to window functions
So... how do we use window functions?
Since window functions don't reduce the number of rows, we use them in the SELECT
clause to add a new column to the result set just like we would if we were doing any other kind of column calculation.
Specifically, we write the window function followed by the OVER
clause, which is followed by a set of parentheses. Inside the parentheses, we can specify the window over which we want to perform the calculation.
The following example shows how to get the count of rows in the Person.Person
table alongside showing the first few rows and columns:
SELECT TOP 5
BusinessEntityID,
FirstName,
LastName,
COUNT(*) OVER () AS TotalRows
FROM Person.Person
ORDER BY BusinessEntityID
;
BusinessEntityID | FirstName | LastName | TotalRows |
---|---|---|---|
1 | Ken | Sรกnchez | 19972 |
2 | Terri | Duffy | 19972 |
3 | Roberto | Tamburello | 19972 |
4 | Rob | Walters | 19972 |
5 | Gail | Erickson | 19972 |
The parentheses after OVER
are important (even if they're empty). We'll go through what goes inside the parentheses shortly.
To confirm that the TotalRows
number is correct, we can verify it with a separate query:
SELECT COUNT(*) AS TotalRows
FROM Person.Person
;
TotalRows |
---|
19972 |
Looks good!
Warning
We used the TOP 5
in the query with the window function to limit the number of rows returned, so why did we get 19972 instead of 5?
This is to do with the order that SQL processes the query, which is (briefly) covered in the logical processing order section -- TOP
is processed after OVER
!
Converting a subquery to a window function
Remember the example below from the subqueries section?
SELECT TOP 5
SalesOrderID,
OrderDate,
TotalDue,
(
SELECT AVG(TotalDue)
FROM Sales.SalesOrderHeader
) AS AverageTotalDue
FROM Sales.SalesOrderHeader
;
We can rewrite this example to use AVG
with OVER
instead of the subquery:
SELECT TOP 5
SalesOrderID,
OrderDate,
TotalDue,
AVG(TotalDue) OVER () AS AverageTotalDue
FROM Sales.SalesOrderHeader
;
Tip
We've just seen that two very different features -- subqueries and window functions -- can be used to achieve the same result. The natural question is: which one should you use?
The answer is: it depends on the situation. However, my personal preference is to use window functions over subqueries until I have a reason not to.
Defining the partition of a window function
The examples above just specify OVER
with empty parentheses. This is the simplest form of the OVER
clause, and it means that the window function is applied to all rows in the result set.
The output that this produces is equivalent to summarising the entire column, which was proven in the first example by comparing the TotalRows
column to the count of rows in the table.
However, just like how we can GROUP BY
specific columns, we can also specify a window over specific columns. We do this by writing PARTITION BY
inside the parentheses of the OVER
clause, followed by the column(s) that we want to partition by. Although we write PARTITION BY
, this is just like doing GROUP BY
!
The following example shows how to get the count of rows in the Person.Person
table partitioned by the EmailPromotion
column:
SELECT TOP 10
BusinessEntityID,
FirstName,
LastName,
EmailPromotion,
COUNT(*) OVER (PARTITION BY EmailPromotion) AS TotalRowsPerEmailPromotion
FROM Person.Person
ORDER BY BusinessEntityID
;
BusinessEntityID | FirstName | LastName | EmailPromotion | TotalRowsPerEmailPromotion |
---|---|---|---|---|
1 | Ken | Sรกnchez | 0 | 11158 |
2 | Terri | Duffy | 1 | 5044 |
3 | Roberto | Tamburello | 0 | 11158 |
4 | Rob | Walters | 0 | 11158 |
5 | Gail | Erickson | 0 | 11158 |
6 | Jossef | Goldberg | 0 | 11158 |
7 | Dylan | Miller | 2 | 3770 |
8 | Diane | Margheim | 0 | 11158 |
9 | Gigi | Matthew | 0 | 11158 |
10 | Michael | Raheem | 2 | 3770 |
Notice that the TotalRowsPerEmailPromotion
column shows the count of rows for each EmailPromotion
value, and it just repeats that count for each row with the same EmailPromotion
value. This is precisely why using PARTITION BY
is like using GROUP BY
-- it does the grouping just like it would if we were using GROUP BY
, but it doesn't reduce the number of rows!
Tip
When we use PARTITION BY
in a window function, we're splitting the result set into partitions based on the column(s) that we specify. The window function is then applied to each partition separately.
Success
When we create partitions, this is like using Excel's "IF" summary functions over an entire column, such as =SUMIF(A:A, A2)
.
There are two other ways to use window functions
The examples above show how to use window functions over different partitions (including a single partition, the entire table).
Other window functions are designed to be used with the rows considered in some order, so we can also specify ORDER BY
inside the parentheses of the OVER
clause.
A great example of this is the LAG
function which looks "one row up" based on the order that we specify. For example, the following query shows the first few people in the Person.Person
table with the first names of the person who came before them (based on the BusinessEntityID
column):
SELECT TOP 5
BusinessEntityID,
FirstName,
LastName,
LAG(FirstName) OVER (ORDER BY BusinessEntityID) AS PreviousFirstName
FROM Person.Person
ORDER BY BusinessEntityID
;
BusinessEntityID | FirstName | LastName | PreviousFirstName |
---|---|---|---|
1 | Ken | Sรกnchez | null |
2 | Terri | Duffy | Ken |
3 | Roberto | Tamburello | Terri |
4 | Rob | Walters | Roberto |
5 | Gail | Erickson | Rob |
This is just like using a relative cell reference in Excel!
Although this isn't a super helpful example, this demonstrates a case where we need to tell SQL which order to use to understand "previous" and "next" rows. There are technical reasons for it that we won't go into, but SQL will not assume any order unless we specify it.
With this in mind, there are two types of windows that we can specify:
- Cumulative windows, which are things like "all rows up to this one" and are useful for calculating running totals
- Sliding windows, which are things like "the last three rows" and are useful for calculating moving averages
Cumulative windows are great for running totals
One of the most common things to do in Excel is to calculate running totals.
We can do this in SQL too by using SUM
with the OVER
clause, making sure that we specify a row order:
SELECT TOP 10
SalesOrderID,
TotalDue,
SUM(TotalDue) OVER (ORDER BY SalesOrderID) AS RunningTotal
FROM Sales.SalesOrderHeader
ORDER BY SalesOrderID
;
SalesOrderID | TotalDue | RunningTotal |
---|---|---|
43659 | 23153.2339 | 23153.2339 |
43660 | 1457.3288 | 24610.5627 |
43661 | 36865.8012 | 61476.3639 |
43662 | 32474.9324 | 93951.2963 |
43663 | 472.3108 | 94423.6071 |
43664 | 27510.4109 | 121934.0180 |
43665 | 16158.6961 | 138092.7141 |
43666 | 5694.8564 | 143787.5705 |
43667 | 6876.3649 | 150663.9354 |
43668 | 40487.7233 | 191151.6587 |
There are a few ways that you could do this in Excel. The SQL approach is just like using the formula =SUM($B$2:B2)
in cell C2
and dragging it down (assuming that the TotalDue
column is in column B
and the RunningTotal
column is in column C
).
In contrast, if we didn't specify ORDER BY
inside the parentheses of the OVER
clause, the SUM
would be calculated over the entire table:
SELECT TOP 10
SalesOrderID,
TotalDue,
SUM(TotalDue) OVER () AS RunningTotal /* This is now a misnomer */
FROM Sales.SalesOrderHeader
ORDER BY SalesOrderID
;
SalesOrderID | TotalDue | RunningTotal |
---|---|---|
43659 | 23153.2339 | 123216786.1159 |
43660 | 1457.3288 | 123216786.1159 |
43661 | 36865.8012 | 123216786.1159 |
43662 | 32474.9324 | 123216786.1159 |
43663 | 472.3108 | 123216786.1159 |
43664 | 27510.4109 | 123216786.1159 |
43665 | 16158.6961 | 123216786.1159 |
43666 | 5694.8564 | 123216786.1159 |
43667 | 6876.3649 | 123216786.1159 |
43668 | 40487.7233 | 123216786.1159 |
This is quite a different result!
Sliding windows are great for moving averages
Another common thing to do in Excel is to calculate moving averages.
We can do this in SQL too by using AVG
with the OVER
clause, making sure that we specify a row order and a window size. The "window size" is some more new syntax, so let's go through an example to see how it works.
The following query shows the first few orders in the Sales.SalesOrderHeader
table with the moving average of the TotalDue
column over the last three orders (based on the SalesOrderID
column):
SELECT TOP 10
SalesOrderID,
TotalDue,
AVG(TotalDue) OVER (
ORDER BY SalesOrderID
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS MovingAverage
FROM Sales.SalesOrderHeader
ORDER BY SalesOrderID
;
SalesOrderID | TotalDue | MovingAverage |
---|---|---|
43659 | 23153.2339 | 23153.2339 |
43660 | 1457.3288 | 12305.2813 |
43661 | 36865.8012 | 20492.1213 |
43662 | 32474.9324 | 23599.3541 |
43663 | 472.3108 | 23271.0148 |
43664 | 27510.4109 | 20152.5513 |
43665 | 16158.6961 | 14713.8059 |
43666 | 5694.8564 | 16454.6544 |
43667 | 6876.3649 | 9576.6391 |
43668 | 40487.7233 | 17686.3148 |
To compare this to Excel, this is just like using the formula =AVERAGE(B2:B4)
in cell C4
and dragging it down (assuming that the TotalDue
column is in column B
and the MovingAverage
column is in column C
).
To define the sliding window (the window size), we wrote:
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
This sounds a bit funny, but should be fairly intuitive:
- The
ROWS BETWEEN
tells SQL that we're going to define a window based on the number of rows - The
2 PRECEDING
means "two rows before the current row" - The
CURRENT ROW
part is clear and means "the current row" ๐
Altogether, this means "the last three rows".
We could also look a few rows after the current row by using FOLLOWING
instead of PRECEDING
; the following window would be "the current row and the next two rows":
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
In fact, we can also specify the cumulative window that we saw earlier by using UNBOUNDED PRECEDING
(this is the default window size):
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Warning
The syntax for defining window functions is actually very flexible, and there are a few different ways to specify the window. The examples above are just the most common ways to do it to keep things simple for this course.
Define common windows with the WINDOW
clause
If you find yourself using the same window definition in multiple places of the same statement, you can define it once in the WINDOW
clause and then reference it by name:
SELECT TOP 10
SalesOrderID,
AVG(SubTotal) OVER LastThreeSalesOrderIDs AS SubTotalMovingAverage,
AVG(TaxAmt) OVER LastThreeSalesOrderIDs AS TaxAmtMovingAverage,
AVG(Freight) OVER LastThreeSalesOrderIDs AS FreightMovingAverage,
AVG(TotalDue) OVER LastThreeSalesOrderIDs AS TotalDueMovingAverage
FROM Sales.SalesOrderHeader
WINDOW LastThreeSalesOrderIDs AS (
ORDER BY SalesOrderID
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)
SalesOrderID | SubTotalMovingAverage | TaxAmtMovingAverage | FreightMovingAverage | TotalDueMovingAverage |
---|---|---|---|---|
43659 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 |
43660 | 10929.9367 | 1047.8816 | 327.4630 | 12305.2813 |
43661 | 18195.4507 | 1749.8442 | 546.8263 | 20492.1213 |
43662 | 20951.0868 | 2017.7275 | 630.5398 | 23599.3541 |
43663 | 20659.4888 | 1989.7341 | 621.7919 | 23271.0148 |
43664 | 17894.8655 | 1720.1416 | 537.5442 | 20152.5513 |
43665 | 13068.2796 | 1253.7343 | 391.7919 | 14713.8059 |
43666 | 14613.9565 | 1402.4365 | 438.2614 | 16454.6544 |
43667 | 8505.4476 | 816.1459 | 255.0456 | 9576.6391 |
43668 | 15702.5759 | 1511.4201 | 472.3188 | 17686.3148 |
You can use PARTITION BY
and ORDER BY
together
Wherever you're defining a window, you can use PARTITION BY
and ORDER BY
together to define the window.
There are a few use-cases for this, but a great one is to compute within-year running totals.
Consider the following query which shows the quarterly sales totals for 2012 and 2013:
SELECT
YEAR(OrderDate) AS OrderYear,
DATEPART(QUARTER, OrderDate) AS OrderQuarter,
SUM(TotalDue) AS TotalSales
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) IN (2012, 2013)
GROUP BY
YEAR(OrderDate),
DATEPART(QUARTER, OrderDate)
ORDER BY
OrderYear,
OrderQuarter
;
OrderYear | OrderQuarter | TotalSales |
---|---|---|
2012 | 1 | 9443736.8161 |
2012 | 2 | 9935495.1729 |
2012 | 3 | 10164406.8281 |
2012 | 4 | 8132061.4949 |
2013 | 1 | 8771886.3577 |
2013 | 2 | 12225061.3830 |
2013 | 3 | 14339319.1851 |
2013 | 4 | 13629621.0374 |
We'll whack this into a CTE to make using it easier, and then we can use the ORDER BY
and PARTITION BY
clauses to calculate the running total for each year:
WITH YearlySales AS (
SELECT
YEAR(OrderDate) AS OrderYear,
DATEPART(QUARTER, OrderDate) AS OrderQuarter,
SUM(TotalDue) AS TotalSales
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) IN (2012, 2013)
GROUP BY
YEAR(OrderDate),
DATEPART(QUARTER, OrderDate)
)
SELECT
OrderYear,
OrderQuarter,
TotalSales,
SUM(TotalSales) OVER (
PARTITION BY OrderYear
ORDER BY OrderQuarter
) AS WithinYearRunningTotal
FROM YearlySales
ORDER BY
OrderYear,
OrderQuarter
;
OrderYear | OrderQuarter | TotalSales | WithinYearRunningTotal |
---|---|---|---|
2012 | 1 | 9443736.8161 | 9443736.8161 |
2012 | 2 | 9935495.1729 | 19379231.9890 |
2012 | 3 | 10164406.8281 | 29543638.8171 |
2012 | 4 | 8132061.4949 | 37675700.3120 |
2013 | 1 | 8771886.3577 | 8771886.3577 |
2013 | 2 | 12225061.3830 | 20996947.7407 |
2013 | 3 | 14339319.1851 | 35336266.9258 |
2013 | 4 | 13629621.0374 | 48965887.9632 |
Super simple!
Further reading
Check out the official Microsoft documentation for more information on the OVER
clause at:
Microsoft SQL Server splits the window functions into two categories (other than aggregate functions), ranking functions and analytic functions:
- https://learn.microsoft.com/en-us/sql/t-sql/functions/ranking-functions-transact-sql
- https://learn.microsoft.com/en-us/sql/t-sql/functions/analytic-functions-transact-sql
The video version of this content is also available at: