Computer scienceSoftware development foundationsSQL and DatabasesDBMSMySQLOptimization

Leveraging Built-in SQL Functions and Features

12 minutes read

Introduction

SQL provides a rich set of built-in functions and features that can be used to improve the efficiency and performance of user queries. This topic will discuss three specific ways to do this: using window functions, temporary tables or table variables, and built-in analytical functions.

Window functions

Window functions are a effectual tool for performing advanced analytical queries in SQL. They allow you to perform calculations across a set of rows related to the current row. Window functions can be used to calculate running totals, moving averages, and other advanced calculations. By using window functions, you can eliminate the need for subqueries and other complex SQL constructions, improving query performance.

The window functions are implemented using the OVER() clause, which specifies the window or subset of data that the function should operate on. The OVER() clause takes two optional parameters: PARTITION BY and ORDER BY.

The PARTITION BY clause is used to group your data into partitions or subsets, based on one or more columns.
The ORDER BY clause is used to sort your data within each partition. This is important because many window functions, such as running totals or moving averages, rely on the order of the data.

Suppose we want to find the top 3 marks by total quantity for each subject. We can use the ROW_NUMBER() window function to accomplish this. Here's an example of a query:

Use this data to create tables.

Scheme of the database

WITH a AS
    (
SELECT
       sb.name AS subject_name,
       m.mark,
       count(*) AS total_quantity
FROM mark m
INNER JOIN students s ON m.student_id=s.id
INNER JOIN subject_type sb on sb.id=m.subject_id
GROUP BY subject_name,m.mark
ORDER BY 1,2)

SELECT
    subject_name,
    mark
FROM (SELECT subject_name,
             mark,
             ROW_NUMBER() OVER (PARTITION BY subject_name ORDER BY total_quantity DESC) as row_num
      FROM a
     )t
WHERE row_num<=3

In this query, the inner subquery selects the subject_name, mark columns from CTE (where we calculated the number of marks grouped by subject_nameand mark itself), and also calculates a row number for each row within a partition of subject_name, ordered by total_quantity in descending order.

*Common Table Expressions (CTEs) is a function in SQL that allows you to define a named temporary result set that can be used within the context of a single query.

The outer query selects the same columns from the subquery, but it returns only the rows where the row_num column is less than or equal to 3. This gives us the top 3 marks by total quantity for each subject effectively.

By using the ROW_NUMBER() window function this way, we were able to perform an advanced analytical query with relatively simple SQL code.

Let's get back a little bit and talk more about our row_number expression:

ROW_NUMBER() OVER (PARTITION BY subject_name ORDER BY total_quantity DESC) as row_num

We number each row by grouping it by the name of the subject (for this operation we use PARTITION BY) and sort by a total_quantity in descending order (for this operation we use ORDER BY and DESC).

Temporary tables or table variables

Temporary tables or table variables can be used to store the intermediate results of a query. By storing intermediate results, you can reduce the amount of data that needs to be processed and improve query performance. Temporary tables are created in the tempdb database and automatically dropped when the session that created them ends. Table variables are similar to temporary tables, but stored in memory instead of on disk.

Let's return to our query. Let's imagine a table mark with millions of rows. To avoid using CTE or subquery, we can create a temporary table to store the aggregated data, and then use it to get the result set that we need:

-- create a temporary table to store aggregated data
CREATE TEMPORARY TABLE mark_subject_quantity AS
SELECT
       sb.name AS subject_name,
       m.mark,
       count(*) AS total_quantity
FROM mark m
INNER JOIN students s ON m.student_id=s.id
INNER JOIN subject_type sb on sb.id=m.subject_id
GROUP BY subject_name,m.mark
ORDER BY 1,2;

-- write our query
SELECT
    subject_name,
    mark
FROM (SELECT 
          subject_name,
          mark,
          ROW_NUMBER() OVER (PARTITION BY subject_name ORDER BY total_quantity DESC) as row_num
      FROM mark_subject_quantity 
     )t
WHERE row_num<=3

In this example, we create a temporary table mark_subject_quantityusing the CREATE TEMPORARY TABLE statement, which is a built-in feature of SQL that allows us to create a table that is only visible within the current session. We use the SELECT statement with the GROUP BY clause to aggregate the quantity data by mark and subject and then we store it in the temporary table.

Next, we rewrite the previous query but with the temporary table. By using the temporary table, we avoid having to aggregate the data repeatedly for each row in the result set, which can be very demanding for large tables.

Note that the syntax for creating and using temporary tables may vary depending on the SQL database you use, so be sure to consult the documentation for your specific database.

Built-in analytical functions

SQL provides several built-in analytical functions that can be used to perform complex calculations. These functions include ROW_NUMBER(), RANK(), DENSE_RANK(), and NTILE(). These functions can be used to rank and partition data, calculate running totals, and perform other advanced calculations. By using built-in analytical functions, you can reduce the amount of code needed to perform complex calculations, which can improve query performance.

Imagine we need to rank the students based on the total A-marks they have got. We can use the ROW_NUMBER() function to achieve this:

SELECT
       s.name AS students_name,
       ROW_NUMBER() OVER (ORDER BY sum(case when m.mark=1 then 1 else 0 end) DESC) AS students_rank
FROM mark m
INNER JOIN students s ON m.student_id=s.id
INNER JOIN subject_type sb on sb.id=m.subject_id
GROUP BY students_name

In this query, we group A-marks data by students_name using the GROUP BY clause. Then we use the SUM() function to calculate the total A-marks amount for each student. Finally, we use the ROW_NUMBER() function with the OVER clause to rank the students based on their total A-marks amount in descending order.

Other built-in analytical functions like RANK(), DENSE_RANK(), and NTILE() can also be used for similar purposes, depending on the specific requirements of the query. It will also be helpful to use aggregated functions and operators. Here is a great topic on this topic.

Suppose that we need to rank the students by their marks, and divide them into 4 groups (excellent students, good students, triples, twos) based on the ranking.

SELECT
       s.name AS students_name,
       f.name AS faculty_name,
       NTILE(4) OVER(PARTITION BY avg(m.mark) ORDER BY s.name DESC) AS quartile
FROM mark m
INNER JOIN students s ON m.student_id=s.id
INNER JOIN faculty_name f ON f.id=s.faculty_id
INNER JOIN subject_type sb on sb.id=m.subject_id
GROUP BY students_name, faculty_name
ORDER BY 3;

This will partition the data by average mark, order the rows by students_name in descending order, and divide them into 4 groups using NTILE(). The result set will contain four quartile groups, where each will represent the ranking of each student by the average mark. By using NTILE(), you can easily divide your data into equal groups based on a ranking or some other criteria, which is useful for various analysis and reporting tasks.

But what if we don't know how many groups there are (4 or 5 or more)? In this case, we can use RANK() or DENSE_RANK().

Note the difference between these two functions. RANK() assigns the same rank to rows with the same values, leaving gaps in the ranking sequence when there are ties, while the DENSE_RANK() assigns the same rank to rows with the same values, but does not leave gaps in the ranking sequence when there are ties.

Since we do not want any blanks we will use DENSE_RANK(), so the query will look like this:

SELECT
       s.name AS students_name,
       f.name AS faculty_name,
       DENSE_RANK() OVER(PARTITION BY avg(m.mark) ORDER BY s.name DESC) AS dense_avg_rank
FROM mark m
INNER JOIN students s ON m.student_id=s.id
INNER JOIN faculty_name f ON f.id=s.faculty_id
INNER JOIN subject_type sb on sb.id=m.subject_id
GROUP BY students_name, faculty_name
ORDER BY 3;

By using DENSE_RANK(), we divided our data into several groups based on average marks.

Conclusion

Congratulation! You've reached the end of the topic! Now you know that leveraging built-in SQL functions and features can be a nice way to improve query performance.

You are also aware of how to use window functions, temporary tables or table variables, and built-in analytical functions to reduce the amount of code needed to perform complex calculations and eliminate the need for subqueries and other complex SQL constructs.

13 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo