First of all, we should mention that this topic is intended to be used by an advanced SQL user. Frankly speaking, without basic SQL knowledge, a lot of things that are discussed here can be confusing and complicated. Apart from that, it would be fair to say that all operators that are described here, are oriented to optimize and increase the productivity of your work. It means you can get the same result with the basic SQL commands.
Let's state it. You are more than an experienced SQL user who can write difficult queries with subqueries and groupings, multiple aggregations, you name it. You can always improve your skills and queries, and make them more readable and efficient.
In this topic, you will be taught how to optimize your queries with GROUPING SETS/CUBE/ROLLUP functions.
CUBE
To begin with, imagine that you want to make several queries to the same table, differing only by the condition in GROUP BY. For instance, you are curious about who is the most "violent" director of the 21st century and how many killings were committed in movies in the 21st century. So we will be working with this source of data, the sample of which is looks like this:
Film | Year | Body_Count | ... | Director |
24 Hour Party People | 2002 | 7 | ... | Michael Winterbottom |
Apocalypse Now | 1979 | 62 | ... | Francis Ford Coppola |
Fight Club | 1999 | 14 | ... | David Fincher |
Pulp Fiction | 1994 | 10 | ... | Quentin Tarantino |
Our table will be called death_count_table. And normally, you would write something like this:
WITH movies as (SELECT * FROM death_count_table WHERE Year >= 2000)
SELECT
Director,
SUM(Body_Count) FROM movies
GROUP BY Director
UNION
SELECT
'ALL',
SUM(Body_Count)
FROM movies
ORDER BY 2 descDirector | SUM(Body_Count) |
ALL | 11808 |
Peter Jackson | 1445 |
Edward Zwick | 745 |
Ridley Scott | 717 |
... | ... |
What if I told you, that there is a way to get the same result with fewer lines of code and less execution time?
Look at the code below:
SELECT
COALESCE(Director, 'ALL'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY CUBE(Director)
ORDER BY 2 DESC;What we did is that we used a CUBE operator.
The CUBE operator in SQL allows for data grouping by multiple dimensions. It generates a result set that represents data aggregation on all possible combinations of specified columns. In our case there was only one combination, but if there were two or three of them, our result query would be different.
For instance, for the case of CUBE(a1, a2, a3) (where a1, a2, a3 are column names), the following combinations would be returned:
(a1, NULL, NULL)
(NULL, a2, NULL)
(NULL, NULL, a3)
(a1, a2, NULL)
(a1, NULL, a3)
(NULL, a2, a3)
(a1, a2, a3)
(NULL, NULL, NULL)
PS: That is why we used COALESCE(Director,'ALL')
Let's modify our previous query and add "Year":
SELECT
COALESCE(Director, 'All directors'),
COALESCE(CAST(Year as TEXT), 'All years'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY CUBE(Director, Year)COALESCE(Director, 'All directors') | COALESCE(CAST(Year as TEXT), 'All years') | SUM(Body_Count) |
All directors | All years | 11808 |
Mel Gibson | 2006 | 114 |
Christophe Gans | 2001 | 40 |
Roland Emmerich | 2000 | 123 |
... | ... | ... |
Let's list all four combinations by which "Body_Count" is grouped in the result set:
1. Grouped by "Director" and "Year";
2. Grouped by "Director" and "All years";
3. Grouped by "Year" and "All directors";
4. Grouped by "All years" and "All directors", also known as the ground total.
ROLLUP
Rollup is another way of creating subtotals for groups of data in a table.
Let's write a query that returns the same result as our first query:
SELECT
COALESCE(Director, 'ALL'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY ROLLUP (Director)
ORDER BY 2 DESC;Director | SUM(Body_Count) |
ALL | 11808 |
Peter Jackson | 1445 |
Edward Zwick | 745 |
Ridley Scott | 717 |
... | ... |
So, if the result is the same and, as you can see, syntax also is quite similar, why not just use CUBE instead of ROLLUP?
The ROLLUP operator in SQL is similar to the CUBE operator, but it generates a result set that represents data aggregation on a specified hierarchy of columns. It produces a hierarchy of subtotals and a grand total, based on the order of the specified columns. In our case there is only one grouping – so hierarchy is of only one parameter.
Look at the result table of this query:
SELECT
COALESCE(Director, 'All directors'),
COALESCE(CAST(Year as TEXT), 'All years'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup(Director, Year);COALESCE(Director, 'All directors') | COALESCE(CAST(Year as TEXT), 'All years') | SUM(Body_Count) |
All directors | All years | 11808 |
Mel Gibson | 2006 | 114 |
Christophe Gans | 2001 | 40 |
Roland Emmerich | 2000 | 123 |
... | ... | ... |
You see that it is just like our result set with CUBE operator, but a combination with (Grouped by "Year" and "All directors") is absent. This is because the order of grouping fields is vital in the convolution function that ROLLUP provides.
While the CUBE operator produces results by generating all combinations of columns specified in the GROUP BY CUBE clause, the ROLLUP operator generates aggregated results for the selected columns in a hierarchical way.
Let's analyze what is the average IMDB rating, broken down by "MPAA rating" and "Year".
Look at these two queries:
-- First query
SELECT
COALESCE(MPAA_Rating, 'All MPAA Ratings'),
COALESCE(CAST(Year as TEXT), 'All years'),
avg(IMDB_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup(MPAA_Rating, Year)
Order BY 1, 2;COALESCE(MPAA_Rating, 'All MPAA Ratings') | COALESCE(CAST(Year as TEXT), 'All years') | avg(IMDB_Rating) |
All MPAA Ratings | 2006 | 6.80 |
All MPAA Ratings | 2008 | 6.65 |
All MPAA Ratings | All years | 6.70 |
All MPAA Ratings | All years | 6.81 |
PG | 2002 | 6.80 |
PG | 2005 | 6.57 |
... | ... | ... |
-- Second query
SELECT
COALESCE(MPAA_Rating, 'All MPAA Ratings'),
COALESCE(CAST(Year as TEXT), 'All years'),
avg(IMDB_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup(MPAA_Rating), Year
Order BY 1, 2;COALESCE(MPAA_Rating, 'All MPAA Ratings') | COALESCE(CAST(Year as TEXT), 'All years') | avg(IMDB_Rating) |
All MPAA Ratings | 2000 | 6.81 |
All MPAA Ratings | 2001 | 6.44 |
All MPAA Ratings | 2002 | 6.73 |
All MPAA Ratings | 2003 | 6.44 |
All MPAA Ratings | 2004 | 7.06 |
All MPAA Ratings | 2005 | 6.73 |
... | ... | ... |
The first query works this way – it takes each unique value from the "MPAA_Rating", calculates the average IMDB rating for each existing year, and then calculates subtotal for "MPAA_Rating" for all years and gives the ground total in rows "All MPAA Ratings" for each of the unique value of the "MPAA rating".
The second query calculates the average IMDB value for all unique values from "MPAA_Rating" for each year, without counting the subtotal for each "MPAA_Rating" of the year. Then it calculates the ground total average for each year.
It may sound confusing, so I suggest you try to use the "grouping" function. You can use it to find out whether a particular column in a given row is involved in the grouping or not.
Let's try this:
SELECT
COALESCE(MPAA_Rating, 'All MPAA Ratings'),
COALESCE(CAST(Year as TEXT), 'All years'),
avg(IMDB_Rating),
grouping(MPAA_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup (MPAA_Rating), Year
Order BY 1, 2;COALESCE(MPAA_Rating, 'All MPAA Ratings') | COALESCE(CAST(Year as TEXT), 'All years') | avg(IMDB_Rating) | grouping(MPAA_Rating) |
All MPAA Ratings | 2000 | 6.81 | 1 |
All MPAA Ratings | 2001 | 6.44 | 1 |
... | ... | ... | ... |
All MPAA Ratings | 2006 | 6.80 | 0 |
... | ... | ... | ... |
All MPAA Ratings | 2008 | 6.65 | 0 |
... | ... | ... | ... |
There are two rows with "All MPAA Ratings" and "Year": 2006 and 2008. This happened because there is a Null value of "MPAA_Rating", so it calculated the values for Null "All MPAA Ratings" and then calculated the subtotal for a whole year. The column grouping ("MPAA_Rating") with a value of 1/0 speaks for that.
Make a note that when you want to use ROLLUP operator in MySQL you should put ROLLUP at the end of the query after the name/names of the column/columns. Like this:
SELECT
COALESCE(MPAA_Rating, 'All MPAA Ratings'),
COALESCE(CAST(Year as TEXT), 'All years'),
avg(IMDB_Rating),
grouping(MPAA_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY (MPAA_Rating, Year) WITH ROLLUP
Order BY 1, 2;GROUPING SETS
The GROUPING SETS operator in SQL allows us to specify multiple grouping sets in a single query, instead of the necessity to write separate queries for each grouping. This comes in handy when you need to generate multiple levels of aggregation in a single report.
SELECT
Director,
Year,
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY GROUPING SETS (Director, Year);Director | Year | SUM(Body_Count) |
Richard Loncraine | NULL | 5 |
Mark Mylod | NULL | 11 |
Danny Boyle|Loveleen Tandan | NULL | 8 |
Patrick Lussier | NULL | 42 |
... | ... | ... |
As you can see, GROUPING SETS returns data as a result of the query as follows: in each row, one of the columns listed in parentheses corresponds to a value, while the other columns (from the list in parentheses) are filled with NULL. Columns not listed in GROUPING SETS are calculated as usual.
Let's modify our query from ROLLUP:
SELECT
COALESCE(MPAA_Rating, 'All MPAA Ratings'),
COALESCE(CAST(Year as TEXT), 'All years'),
avg(IMDB_Rating),
grouping(MPAA_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY GROUPING SETs (MPAA_Rating), Year
Order BY 1, 2;COALESCE(MPAA_Rating, 'All MPAA Ratings') | COALESCE(CAST(Year as TEXT), 'All years') | avg(IMDB_Rating) | grouping(MPAA_Rating) |
All MPAA Ratings | 2006 | 6.80 | 0 |
All MPAA Ratings | 2008 | 6.65 | 0 |
PG | 2002 | 6.80 | 0 |
PG | 2005 | 6.57 | 0 |
... | ... | ... | ... |
As you can see, it is like our ROLLUP query except there are no ground totals. Simply put, it is like rollup but all grouping ("MPAA_Rating") equals zero.
Comparison of the functions
Let's get back to our first query and rewrite it with all three operators that we learned earlier:
-- CUBE
SELECT
COALESCE(Director, 'ALL'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY CUBE(Director)
ORDER BY 2 DESC;-- ROLLUP
SELECT
COALESCE(Director, 'ALL'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY ROLLUP(Director)
ORDER BY 2 DESC;Director | SUM(Body_Count) |
ALL | 11808 |
Peter Jackson | 1445 |
Edward Zwick | 745 |
Ridley Scott | 717 |
... | ... |
-- GROUPING SETS
SELECT
COALESCE(Director, 'ALL'),
SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY GROUPING SETS(Director)
ORDER BY 2 DESC;Director | SUM(Body_Count) |
Peter Jackson | 1445 |
Edward Zwick | 745 |
Ridley Scott | 717 |
Wolfgang Petersen | 572 |
... | ... |
As you can see ROLLUP and CUBE gave us the same result tables and GROUPING SETS query almost the same except ground total row.
To sum up, ROLLUP and CUBE produce subtotals and totals in a single query.
GROUPING SETS defines levels of grouping in the GROUP BY without computing every level of the dimensions.
Conclusion
Congratulation! You've reached the end of the topic! You not only figured out the principle of the work of 3 operators (CUBE, ROLLUP, GROUPING SETS), but also learned how you can optimize and improve your existing queries.
And, as a bonus, you now know that Peter Jackson's movies are much more violent than Quentin Tarantino's. I guess hobbits are more dangerous than a vindictive mother with katana.