Computer scienceSoftware development foundationsSQL and DatabasesBasics SQLRetrieving Data

CUBE, ROLLUP and GROUPING SETS

13 minutes read

First of all, we should mention that this topic is intended to be used by an advanced SQL user. Frankly speaking, without basic SQL knowledge, a lot of things that are discussed here can be confusing and complicated. Apart from that, it would be fair to say that all operators that are described here, are oriented to optimize and increase the productivity of your work. It means you can get the same result with the basic SQL commands.

Let's state it. You are more than an experienced SQL user who can write difficult queries with subqueries and groupings, multiple aggregations, you name it. You can always improve your skills and queries, and make them more readable and efficient.

In this topic, you will be taught how to optimize your queries with GROUPING SETS/CUBE/ROLLUP functions.

CUBE

To begin with, imagine that you want to make several queries to the same table, differing only by the condition in GROUP BY. For instance, you are curious about who is the most "violent" director of the 21st century and how many killings were committed in movies in the 21st century. So we will be working with this source of data, the sample of which is looks like this:

Film	Year	Body_Count	...	Director
24 Hour Party People	2002	7	...	Michael Winterbottom
Apocalypse Now	1979	62	...	Francis Ford Coppola
Fight Club	1999	14	...	David Fincher
Pulp Fiction	1994	10	...	Quentin Tarantino

Our table will be called death_count_table. And normally, you would write something like this:

WITH movies as (SELECT * FROM death_count_table WHERE Year >= 2000)
SELECT
    Director,
    SUM(Body_Count) FROM movies
GROUP BY Director
UNION
SELECT
    'ALL',
    SUM(Body_Count)
FROM movies
ORDER BY 2 desc

Director	SUM(Body_Count)
ALL	11808
Peter Jackson	1445
Edward Zwick	745
Ridley Scott	717
...	...

What if I told you, that there is a way to get the same result with fewer lines of code and less execution time?

Look at the code below:

SELECT
    COALESCE(Director, 'ALL'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY CUBE(Director)
ORDER BY 2 DESC;

What we did is that we used a CUBE operator.

The CUBE operator in SQL allows for data grouping by multiple dimensions. It generates a result set that represents data aggregation on all possible combinations of specified columns. In our case there was only one combination, but if there were two or three of them, our result query would be different.

For instance, for the case of CUBE(a1, a2, a3) (where a1, a2, a3 are column names), the following combinations would be returned:

(a1, NULL, NULL)

(NULL, a2, NULL)

(NULL, NULL, a3)

(a1, a2, NULL)

(a1, NULL, a3)

(NULL, a2, a3)

(a1, a2, a3)

(NULL, NULL, NULL)

PS: That is why we used COALESCE(Director,'ALL')

Let's modify our previous query and add "Year":

SELECT
    COALESCE(Director, 'All directors'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY CUBE(Director, Year)

COALESCE(Director, 'All directors')	COALESCE(CAST(Year as TEXT), 'All years')	SUM(Body_Count)
All directors	All years	11808
Mel Gibson	2006	114
Christophe Gans	2001	40
Roland Emmerich	2000	123
...	...	...

Let's list all four combinations by which "Body_Count" is grouped in the result set:

1. Grouped by "Director" and "Year";

2. Grouped by "Director" and "All years";

3. Grouped by "Year" and "All directors";

4. Grouped by "All years" and "All directors", also known as the ground total.

ROLLUP

Rollup is another way of creating subtotals for groups of data in a table.

Let's write a query that returns the same result as our first query:

SELECT
    COALESCE(Director, 'ALL'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY ROLLUP (Director)
ORDER BY 2 DESC;

Director	SUM(Body_Count)
ALL	11808
Peter Jackson	1445
Edward Zwick	745
Ridley Scott	717
...	...

So, if the result is the same and, as you can see, syntax also is quite similar, why not just use CUBE instead of ROLLUP?

The ROLLUP operator in SQL is similar to the CUBE operator, but it generates a result set that represents data aggregation on a specified hierarchy of columns. It produces a hierarchy of subtotals and a grand total, based on the order of the specified columns. In our case there is only one grouping – so hierarchy is of only one parameter.

Look at the result table of this query:

SELECT
    COALESCE(Director, 'All directors'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup(Director, Year);

COALESCE(Director, 'All directors')	COALESCE(CAST(Year as TEXT), 'All years')	SUM(Body_Count)
All directors	All years	11808
Mel Gibson	2006	114
Christophe Gans	2001	40
Roland Emmerich	2000	123
...	...	...

You see that it is just like our result set with CUBE operator, but a combination with (Grouped by "Year" and "All directors") is absent. This is because the order of grouping fields is vital in the convolution function that ROLLUP provides.

While the CUBE operator produces results by generating all combinations of columns specified in the GROUP BY CUBE clause, the ROLLUP operator generates aggregated results for the selected columns in a hierarchical way.

Let's analyze what is the average IMDB rating, broken down by "MPAA rating" and "Year".

Look at these two queries:

-- First query
SELECT
    COALESCE(MPAA_Rating, 'All MPAA Ratings'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    avg(IMDB_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup(MPAA_Rating, Year)
Order BY 1, 2;

COALESCE(MPAA_Rating, 'All MPAA Ratings')	COALESCE(CAST(Year as TEXT), 'All years')	avg(IMDB_Rating)
All MPAA Ratings	2006	6.80
All MPAA Ratings	2008	6.65
All MPAA Ratings	All years	6.70
All MPAA Ratings	All years	6.81
PG	2002	6.80
PG	2005	6.57
...	...	...

-- Second query
SELECT
    COALESCE(MPAA_Rating, 'All MPAA Ratings'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    avg(IMDB_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup(MPAA_Rating), Year
Order BY 1, 2;

COALESCE(MPAA_Rating, 'All MPAA Ratings')	COALESCE(CAST(Year as TEXT), 'All years')	avg(IMDB_Rating)
All MPAA Ratings	2000	6.81
All MPAA Ratings	2001	6.44
All MPAA Ratings	2002	6.73
All MPAA Ratings	2003	6.44
All MPAA Ratings	2004	7.06
All MPAA Ratings	2005	6.73
...	...	...

The first query works this way – it takes each unique value from the "MPAA_Rating", calculates the average IMDB rating for each existing year, and then calculates subtotal for "MPAA_Rating" for all years and gives the ground total in rows "All MPAA Ratings" for each of the unique value of the "MPAA rating".

The second query calculates the average IMDB value for all unique values from "MPAA_Rating" for each year, without counting the subtotal for each "MPAA_Rating" of the year. Then it calculates the ground total average for each year.

It may sound confusing, so I suggest you try to use the "grouping" function. You can use it to find out whether a particular column in a given row is involved in the grouping or not.

Let's try this:

SELECT
    COALESCE(MPAA_Rating, 'All MPAA Ratings'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    avg(IMDB_Rating),
    grouping(MPAA_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY rollup (MPAA_Rating), Year
Order BY 1, 2;

COALESCE(MPAA_Rating, 'All MPAA Ratings')	COALESCE(CAST(Year as TEXT), 'All years')	avg(IMDB_Rating)	grouping(MPAA_Rating)
All MPAA Ratings	2000	6.81	1
All MPAA Ratings	2001	6.44	1
...	...	...	...
All MPAA Ratings	2006	6.80	0
...	...	...	...
All MPAA Ratings	2008	6.65	0
...	...	...	...

There are two rows with "All MPAA Ratings" and "Year": 2006 and 2008. This happened because there is a Null value of "MPAA_Rating", so it calculated the values for Null "All MPAA Ratings" and then calculated the subtotal for a whole year. The column grouping ("MPAA_Rating") with a value of 1/0 speaks for that.

Make a note that when you want to use ROLLUP operator in MySQL you should put ROLLUP at the end of the query after the name/names of the column/columns. Like this:

SELECT
    COALESCE(MPAA_Rating, 'All MPAA Ratings'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    avg(IMDB_Rating),
    grouping(MPAA_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY (MPAA_Rating, Year) WITH ROLLUP
Order BY 1, 2;

GROUPING SETS

The GROUPING SETS operator in SQL allows us to specify multiple grouping sets in a single query, instead of the necessity to write separate queries for each grouping. This comes in handy when you need to generate multiple levels of aggregation in a single report.

SELECT
    Director,
    Year,
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY GROUPING SETS (Director, Year);

Director	Year	SUM(Body_Count)
Richard Loncraine	NULL	5
Mark Mylod	NULL	11
Danny Boyle\|Loveleen Tandan	NULL	8
Patrick Lussier	NULL	42
...	...	...

As you can see, GROUPING SETS returns data as a result of the query as follows: in each row, one of the columns listed in parentheses corresponds to a value, while the other columns (from the list in parentheses) are filled with NULL. Columns not listed in GROUPING SETS are calculated as usual.

Let's modify our query from ROLLUP:

SELECT
    COALESCE(MPAA_Rating, 'All MPAA Ratings'),
    COALESCE(CAST(Year as TEXT), 'All years'),
    avg(IMDB_Rating),
    grouping(MPAA_Rating)
FROM death_count_table
WHERE Year >= 2000
GROUP BY GROUPING SETs (MPAA_Rating), Year
Order BY 1, 2;

COALESCE(MPAA_Rating, 'All MPAA Ratings')	COALESCE(CAST(Year as TEXT), 'All years')	avg(IMDB_Rating)	grouping(MPAA_Rating)
All MPAA Ratings	2006	6.80	0
All MPAA Ratings	2008	6.65	0
PG	2002	6.80	0
PG	2005	6.57	0
...	...	...	...

As you can see, it is like our ROLLUP query except there are no ground totals. Simply put, it is like rollup but all grouping ("MPAA_Rating") equals zero.

Comparison of the functions

Let's get back to our first query and rewrite it with all three operators that we learned earlier:

-- CUBE
SELECT
    COALESCE(Director, 'ALL'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY CUBE(Director)
ORDER BY 2 DESC;

-- ROLLUP
SELECT
    COALESCE(Director, 'ALL'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY ROLLUP(Director)
ORDER BY 2 DESC;

Director	SUM(Body_Count)
ALL	11808
Peter Jackson	1445
Edward Zwick	745
Ridley Scott	717
...	...

-- GROUPING SETS
SELECT
    COALESCE(Director, 'ALL'),
    SUM(Body_Count)
FROM death_count_table
WHERE Year >= 2000
GROUP BY GROUPING SETS(Director)
ORDER BY 2 DESC;

Director	SUM(Body_Count)
Peter Jackson	1445
Edward Zwick	745
Ridley Scott	717
Wolfgang Petersen	572
...	...

As you can see ROLLUP and CUBE gave us the same result tables and GROUPING SETS query almost the same except ground total row.

To sum up, ROLLUP and CUBE produce subtotals and totals in a single query.

GROUPING SETS defines levels of grouping in the GROUP BY without computing every level of the dimensions.

Conclusion

Congratulation! You've reached the end of the topic! You not only figured out the principle of the work of 3 operators (CUBE, ROLLUP, GROUPING SETS), but also learned how you can optimize and improve your existing queries.

And, as a bonus, you now know that Peter Jackson's movies are much more violent than Quentin Tarantino's. I guess hobbits are more dangerous than a vindictive mother with katana.

12 learners liked this piece of theory. 18 didn't like it. What about you?

Report a typo