Earlier we've learned how useful aggregation functions can be for solving simple analytical tasks. Now imagine that we need to compute an aggregate value not for the whole table or one specific category but rather for many separate categories. For example, there is a table named stocks with historical data on prices.
stock_name | price | datetime |
|---|---|---|
WTI | 89.8 | 2020-05-17 11:00 |
FB | 26.3 | 2020-04-11 10:23 |
WTI | 18.9 | 2019-01-18 23:02 |
WTI | 20.9 | 2019-01-18 23:02 |
FB | 15.6 | NULL |
DJI | 52.7 | 2004-05-28 21:09 |
FB | 63.7 | 1998-03-31 04:18 |
It is feasible but quite inconvenient to execute a query like this for each category:
SELECT MAX(price)
FROM stocks
WHERE stock_name = 'WTI';Even more than that, what if we want to get corresponding results for all stocks in the form of a table and use it for further computations?
In order to simplify such a workload, GROUP BY statement was introduced in SQL. In this topic, you will get a perspective on how it works.
General form
Here's how we can use GROUP BY for the task:
SELECT
stock_name,
MAX(price) AS maximum
FROM
stocks
GROUP BY
stock_name;The output would look like this:
stock_name | maximum |
|---|---|
WTI | 89.8 |
FB | 63.7 |
DJI | 52.7 |
What's so special about this query? In the GROUP BY clause, we specify the name of a column from the table. Every unique value from this column will get its own result of every utilized aggregate function from SELECT block. Rows corresponding to this value will be taken as an input. It's possible to group things by a computed value that is not currently present in a table; for example, we can introduce groups based on length of stock_name using GROUP BY LENGTH(stock_name).
Nothing forbids us to use more than one aggregate function. Returned values can be completely independent of each other:
SELECT
stock_name,
MIN(datetime) AS moment,
MAX(price) AS maximum
FROM
stocks
GROUP BY
stock_name;stock_name | moment | maximum |
|---|---|---|
WTI | 2019-01-18 23:02 | 89.8 |
FB | 1998-03-31 04:18 | 63.7 |
DJI | 2004-05-28 21:09 | 52.7 |
If there are several columns in the GROUP BY clause, each unique combination of values from these columns will be aggregated separately.
SELECT
stock_name,
datetime,
MAX(price)
FROM
stocks
GROUP BY
stock_name,
datetime;stock_name | datetime | price |
|---|---|---|
WTI | 2020-05-17 11:00 | 89.8 |
FB | 2020-04-11 10:23 | 26.3 |
WTI | 2019-01-18 23:02 | 20.9 |
FB | NULL | 15.6 |
DJI | 2004-05-28 21:09 | 52.7 |
FB | 1998-03-31 04:18 | 63.7 |
Here we see that NULL value forms a separate category because it is considered to be a unique value. Two rows for WTI, 2019-01-18 23:02 got merged into one with the maximum price of 20.9. In terms of result, grouping query without any aggregate functions equals to:
SELECT DISTINCT
stock_name,
datetime
FROM
stocks;
If a column is not mentioned in GROUP BY clause and there is at least one aggregate function being used in SELECT, this column can not be used in SELECT portion of the query without being wrapped in an aggregate function.
HAVING keyword
The GROUP BY statement is usually used with WHERE statement to filter the rows and the ORDER BY to order them. There is another clause that is especially helpful with grouping tasks – HAVING. If WHERE accepts conditions on values that certain cells have, HAVING does the same but for values of already computed aggregations. For example, let's select "stock-datetime" pairs with maximum price above 50:
SELECT
stock_name,
datetime,
MAX(price) AS maximum
FROM
stocks
GROUP BY
stock_name,
datetime
HAVING
MAX(price) > 50;stock_name | datetime | maximum |
|---|---|---|
WTI | 2020-05-17 11:00 | 89.8 |
DJI | 2004-05-28 21:09 | 52.7 |
FB | 1998-03-31 04:18 | 63.7 |
A fair question would be "why can't we utilize WHERE for the filtering"? The reason for that is the order of evaluation of the statements:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
All the conditions that you put in HAVING have to relate to aggregation functions. However, besides that there are no other special restrictions.
Conclusion
Summing up, the template for queries with grouping is the following:
SELECT column_name [, list_of_other_columns]
, aggregation [, list_of_aggregations]
FROM table_name
[WHERE list_of_conditions]
GROUP BY column_name [, list_of_other_columns]
[HAVING list_of_aggregate_conditions]
[ORDER BY list_of_columns/aliases];GROUP BY is arguably one of the most common SQL statements. It would be quite hard to find a relatively serious SQL code base without grouping queries. Now let's go straight to the tasks to check how you've comprehended this topic.