11 minutes read

At some point in query writing, you may face performance problems, i.e. you write a query but its execution takes much longer than you might've expected. In this case, it's important to understand how the query is really executed by the database. Fortunately, DBMS has a built-in tool for exploring particular steps that will be performed for a query.

Analyzing sample database

We will be using the staff sample database published by Michal Juhas. In particular, we will analyze two tables: employee and salary. Here's the approximate template of these tables:

CREATE TABLE employee (
  id int unsigned NOT NULL AUTO_INCREMENT,
  first_name varchar(14) NOT NULL,
  last_name varchar(16) NOT NULL,
  birth_date date NOT NULL,

  PRIMARY KEY (id)

  -- some fields and constraints are skipped
);

CREATE TABLE salary (
  id int unsigned NOT NULL AUTO_INCREMENT,
  employee_id int unsigned NOT NULL,
  salary_amount decimal(11,2) NOT NULL,
  from_date date NULL,
  to_date date NULL,

  PRIMARY KEY (id),
  UNIQUE KEY ak_salary (employee_id,from_date,to_date),
  KEY idx_salary_amount (salary_amount),
  KEY idx_employee_id (employee_id),
  CONSTRAINT salary_ibfk_1 FOREIGN KEY (employee_id) REFERENCES employee (id)

  -- some fields and constraints are skipped
);

The employee table has a surrogate id column and contains the information about a person: their name and birth date. The salary table references the employee entries by employee_id identifier. It stores employee salary_amount for a specified period that starts from from_date and lasts till the to_date. Null to_date means that this period has not ended yet.

Consider that we need to find employees who were born in 1960s and with above-average salaries. The query might look like this:

SELECT *
FROM employee e,
     salary s
WHERE birth_date BETWEEN '1960-01-01' AND '1969-12-31'
    AND e.id = s.employee_id
    AND to_date IS NULL
    AND salary_amount >
        (SELECT avg(salary_amount)
         FROM salary
         WHERE to_date IS NULL);

Explain keyword

The list of steps for a query is called a query execution plan. It can be displayed for most of the SQL queries with the EXPLAIN statement. By default, the explain-tree is formatted as a table and is missing a lot of information. The real structure of the plan is a tree, and for that reason we will be adding FORMAT=TREE to see the complete information. So let's look at our query plan:

EXPLAIN FORMAT=TREE
SELECT *
FROM employee e,
     salary s
WHERE birth_date BETWEEN '1960-01-01' AND '1969-12-31'
    AND e.id = s.employee_id
    AND to_date IS NULL
    AND salary_amount >
        (SELECT avg(salary_amount)
         FROM salary
         WHERE to_date IS NULL);
 -> Nested loop inner join  (cost=273786.16 rows=15525)
    -> Filter: (e.birth_date between '1960-01-01' and '1969-12-31')  (cost=32498.80 rows=33009)
        -> Table scan on e  (cost=32498.80 rows=297108)
    -> Filter: (s.salary_amount > (select #2))  (cost=6.37 rows=0)
        -> Index lookup on s using ak_salary (employee_id=e.id), with index condition: (s.to_date is null)  (cost=6.37 rows=9)
        -> Select #2 (subquery in condition; run only once)
            -> Aggregate: avg(salary.salary_amount)
                -> Filter: (salary.to_date is null)  (cost=268862.68 rows=259645)
                    -> Table scan on salary  (cost=268862.68 rows=2596450

Our query forms a fairly complex tree where every new line represents a single node. In the execution phase, this tree is traversed for getting a result to the user.

The execution starts from the bottom of the tree, i.e. with a table scan on either employee table, or salary table. That kind of step reads by row from the table, then filter node removes the rows that do not match the predicate. Eventually, the sub-tree with average salary amount is calculated and then used in the complex filter for salary. Notice that the index (named ak_salary) is used for the salary search. Then the final step performs the join with the naive nested loop algorithm.

Plan details

In brackets of each step you may have noticed additional estimated information of (sub-)query. The first one is the execution cost, an abstract value that can be used for comparing the performance of different nodes. The bigger the value, the longer it takes to execute the node. That value calculation is based on the table statistics (e.g. number of rows, values distribution) and weights of low-level operations (e.g. disk block read or numerical value comparison). The second value, rows, in the plan entries is the estimated number of rows returned by the node.

That plan is a DBMS guess for the query and might be inaccurate. So the actual number of rows returned by the query might be any value. It's important to understand that the provided plan is only an estimation rather than data derived from a real execution. It might significantly deviate from the execution because it uses statistics. In particular, it does not have excessive information for the stored data. For example, it's not possible to guess the exact number of rows in the result for the predicate WHERE x * x = 10 without an appropriate index.

Using the plan for query optimization

The query execution plan might give insights on further optimizations. In particular, the most common way of speeding up a query is with indices. A query plan helps to understand which ones are used and what steps might be improved.

In our query for the birth_date DBMS has chosen a sequential scan. Notice that this filter is expected to reduce the row set significantly, as we assume that birth year distribution is mostly uniform, and we have employees who weren't born in the selected range. With a deeper knowledge of our data we may direct the DBMS for a more efficient execution. One of the solutions for that might be a B-tree index for the discussed column, so let's try it out:

CREATE INDEX birth_day_index ON employee(birth_date);

Now the DBMS has an additional way to perform a birth_date filtering. And as one may guess, it chose the index scan instead of sequential and besides produced more accurate estimates:

 -> Nested loop inner join  (cost=398274.09 rows=64911)
    -> Filter: ((s.to_date is null) and (s.salary_amount > (select #2)))  (cost=268352.49 rows=129823)
        -> Table scan on s  (cost=268352.49 rows=2596450)
        -> Select #2 (subquery in condition; run only once)
            -> Aggregate: avg(salary.salary_amount)
                -> Filter: (salary.to_date is null)  (cost=268352.49 rows=259645)
                    -> Table scan on salary  (cost=268352.49 rows=2596450)
    -> Filter: (e.birth_date between '1960-01-01' and '1969-12-31')  (cost=0.90 rows=0)
        -> Single-row index lookup on e using PRIMARY (id=s.employee_id)  (cost=0.90 rows=1)

Notice that both cost and row estimates are reduced significantly.

Comparing prediction with real execution results

As mentioned earlier, the EXPLAIN operator only prepares the data for a plan and predicts the execution. These predictions can significantly vary from the real execution numbers. Fortunately, MySQL provides the EXPLAIN ANALYZE keyword for comparing actual data with the predicted one. The default output formatting is a tree, so there's no need to specify it manually:

EXPLAIN ANALYZE
SELECT *
FROM employee e,
     salary s
WHERE birth_date BETWEEN '1960-01-01' AND '1969-12-31'
    AND e.id = s.employee_id
    AND to_date IS NULL
    AND salary_amount >
        (SELECT avg(salary_amount)
         FROM salary
         WHERE to_date IS NULL);
 -> Nested loop inner join  (cost=387620.33 rows=70028) (actual time=0.115..5081.163 rows=42086 loops=1)
    -> Filter: ((s.to_date is null) and (s.salary_amount > (select #2)))  (cost=294689.78 rows=140056) (actual time=0.067..3675.053 rows=107706 loops=1)
        -> Table scan on s  (cost=294689.78 rows=2829474) (actual time=0.061..3429.357 rows=2844047 loops=1)
        -> Select #2 (subquery in condition; run only once)
            -> Aggregate: avg(salary.salary_amount)  (actual time=826.113..826.114 rows=1 loops=1)
                -> Filter: (salary.to_date is null)  (cost=294689.78 rows=282947) (actual time=0.187..797.753 rows=240124 loops=1)
                    -> Table scan on salary  (cost=294689.78 rows=2829474) (actual time=0.176..682.934 rows=2844047 loops=1)
    -> Filter: (e.birth_date between '1960-01-01' and '1969-12-31')  (cost=0.56 rows=0) (actual time=0.013..0.013 rows=0 loops=107706)
        -> Single-row index lookup on e using PRIMARY (id=s.employee_id)  (cost=0.56 rows=1) (actual time=0.012..0.012 rows=1 loops=107706)

You may notice, that the explained output is extended with actual data: execution time and the number of rows. There we may notice a slowdown for salary scan. One of the options would be using forced index with USE INDEX operator:

EXPLAIN ANALYZE
SELECT *
FROM employee e USE INDEX(birth_day_index),
     salary s USE INDEX(ak_salary)
WHERE birth_date BETWEEN '1960-01-01' AND '1969-12-31'
    AND e.id = s.employee_id
    AND to_date IS NULL
    AND salary_amount >
        (SELECT avg(salary_amount)
         FROM salary
         WHERE to_date IS NULL);
 -> Nested loop inner join  (cost=1100331.05 rows=47230) (actual time=828.994..2296.875 rows=42086 loops=1)
    -> Filter: (e.birth_date between '1960-01-01' and '1969-12-31')  (cost=32437.23 rows=148554) (actual time=0.351..408.640 rows=117138 loops=1)
        -> Table scan on e  (cost=32437.23 rows=297108) (actual time=0.335..264.547 rows=300024 loops=1)
    -> Filter: (s.salary_amount > (select #2))  (cost=6.23 rows=0) (actual time=0.016..0.016 rows=0 loops=117138)
        -> Index lookup on s using ak_salary (employee_id=e.id), with index condition: (s.to_date is null)  (cost=6.23 rows=10) (actual time=0.008..0.008 rows=1 loops=117138)
        -> Select #2 (subquery in condition; run only once)
            -> Aggregate: avg(salary.salary_amount)  (actual time=828.390..828.391 rows=1 loops=1)
                -> Filter: (salary.to_date is null)  (cost=292648.46 rows=282947) (actual time=0.184..799.661 rows=240124 loops=1)
                    -> Table scan on salary  (cost=292648.46 rows=2829474) (actual time=0.172..681.830 rows=2844047 loops=1)

While we reduced the overall query execution time (the second number in actual time at Nested loop node), the time for the first row has increased significantly. So the final query is not necessarily better but provides a meaningful trade-off.

Conclusion

Query performance may work differently depending on many factors. One of the insights you may get from the DBMS is analyzing the query execution plans. They can be retrieved with the EXPLAIN keyword that outputs the plan with the cost and rows estimates. By choosing an alternative query, or creating indices it's possible to compare the execution plans and thus choose a better approach. EXPLAIN ANALYZE operator might be even more helpful as it provides the statistics from the actual query execution.

37 learners liked this piece of theory. 9 didn't like it. What about you?
Report a typo