MySQL Subqueries

Monday Nov 15th 2004 by Ian Gilfillan
Share:

Until now, MySQL has not supported subqueries, and this lack caused many to write off MySQL as not being a serious DBMS. While the lack was certainly a problem, many developers do not know that subqueries can often be rewritten as a join, sometimes giving a performance benefit in the process. This month Ian Gilfillan looks at subqueries, and how they can be rewritten in a more optimal way.

MYSQL 4.1 has been released as a production version, and with all the new features it is likely a new generation of developers will soon be using MySQL. This month I revisit an old area of contention - subqueries. Until now, MySQL has not supported subqueries, and this lack caused many to write off MySQL as not being a serious DBMS. While the lack was certainly a problem, many developers do not know that subqueries can often be rewritten as a join, sometimes giving a performance benefit in the process. This month I look at subqueries, and how they can be rewritten in a more optimal way.

Rewriting subqueries as joins

First, let's create some sample tables and data:


Click for full CREATE TABLE code

Rewriting subqueries as Inner Joins

Let's look at a simple subquery to return a list of all animals that have been assigned food:

mysql> SELECT name 
  FROM animal 
  WHERE id 
  IN (SELECT animal_id FROM animal_food);
+---------------------+
| name                |
+---------------------+
| Chacma Baboon       |
| Small Grey Mongoose |
+---------------------+
2 rows in set (0.00 sec)

Now, the same query as an inner join.

mysql> SELECT DISTINCT name 
  FROM animal,animal_food 
  WHERE animal.id=animal_id;
+---------------------+
| name                |
+---------------------+
| Chacma Baboon       |
| Small Grey Mongoose |
+---------------------+
2 rows in set (0.00 sec)

Note the DISTINCT keyword. Without it, we will get duplicate records, one animal returned for each record in the animal_food table:

mysql> SELECT name FROM animal,animal_food WHERE animal.id=animal_id;
+---------------------+
| name                |
+---------------------+
| Chacma Baboon       |
| Chacma Baboon       |
| Small Grey Mongoose |
| Small Grey Mongoose |
+---------------------+
4 rows in set (0.01 sec)

Which query is better? Some may argue that the former query is more readable, but I think that to anyone vaguely competent in SQL they should both be clear. However, one is definitely more efficient than the other is. Using EXPLAIN, let's see which. First, the join:


Click for full EXPLAIN SELECT DISTINCT code

MySQL would have to examine 12 rows to return the result set (4x3, as you multiply the result - for more on EXPLAIN see Optimizing MySQL: Queries and Indexes. Now the second query:


Click for full EXPLAIN SELECT code

This time MySQL has to examine 16 rows, a marginal difference in such a small example, but this difference can add up in larger datasets.

Rewriting subqueries as Outer Joins

A common kind of query is finding all records that do not have an associated record in another table--in this case, finding all animals that do not have an associated food. Here is how you would do it with a subquery:

 
mysql> SELECT name FROM animal WHERE id NOT IN (SELECT animal_id FROM animal_food);
+-------------+
| name        |
+-------------+
| Cape Gerbil |
| Porcupine   |
+-------------+
2 rows in set (0.00 sec)

Writing this sort of query without a subquery often taxes novice MySQL developers, but it is actually quite easy, simply requiring a LEFT JOIN and an IS NULL clause:

mysql> SELECT name FROM animal LEFT JOIN  animal_food ON animal.id=animal_id WHERE animal_id IS NULL;
+-------------+
| name        |
+-------------+
| Cape Gerbil |
| Porcupine   |
+-------------+
2 rows in set (0.00 sec)

This time the former query is certainly easier to understand, but does the extra complexity of the second query deliver a performance gain?


Click for full EXPLAIN SELECT (2) code

I can hear some sighs of relief - the subquery in this case is as efficient as the outer join. Outer joins can often deliver performance gains though, so benchmark your queries. Note that an alternative way to write the same subquery is as follows (this is as efficient in this case)

mysql> SELECT name FROM animal 
  WHERE NOT EXISTS (SELECT animal_id FROM animal_food WHERE animal.id=animal_id);
+-------------+
| name        |
+-------------+
| Cape Gerbil |
| Porcupine   |
+-------------+

Subqueries that can't be rewritten as joins

Standard SQL does not support joins in DELETE statements. If you wanted to delete all records in animal that did not contain an associated food, you would be able to do it quite simply with a subquery, as follows:

mysql> DELETE FROM animal 
 WHERE id NOT IN (SELECT animal_id 
   FROM animal_food 
   WHERE animal_id=animal.id);
Query OK, 2 rows affected (0.00 sec)

Only the two records remain:

mysql> SELECT * FROM animal;
+----+---------------------+
| id | name                |
+----+---------------------+
|  2 | Chacma Baboon       |
|  4 | Small Grey Mongoose |
+----+---------------------+

However, the subquery inside a DELETE statement syntax is not supported in MySQL 4.0. To achieve the same result, you would need to pass the burden to your application. You would first have to SELECT the records you want to delete, then place the results into a variable, and use this variable to create a DELETE statement. Your application logic would be as follows:

SELECT id FROM animal 
  WHERE NOT EXISTS (SELECT animal_id 
    FROM animal_food 
	WHERE animal.id=animal_id);
 

Create a string variable called $results containing the comma-separated results from the former query

 
DELETE FROM animal WHERE ID IN ($results)

However, it is not only DELETE statement subqueries that cannot be rewritten. Some SELECT subqueries are also impossible to rewrite. Here is another example:

mysql> SELECT name FROM animal WHERE id=(SELECT MAX(id) FROM animal_food);
+---------------------+
| name                |
+---------------------+
| Small Grey Mongoose |
+---------------------+

The MAX() function prevents this from being written as a join. MySQL 4.1 is a necessity for doing these sorts of queries.

Can subqueries be better?

Let's look at a more complex example:

mysql> SELECT animal.id,name FROM animal 
  WHERE animal.id IN (1,2,3) AND EXISTS (SELECT NULL FROM animal_food 
  WHERE animal_food.id>2 AND animal.id=animal_food.id ) ORDER BY name;
+----+-----------+
| id | name      |
+----+-----------+
|  3 | Porcupine |
+----+-----------+

This can also be rewritten as a join, as follows:

mysql> SELECT DISTINCT animal.id,name FROM animal,animal_food 
  WHERE animal.id IN (1,2,3) AND animal_food.id > 2 AND animal.id=animal_food.id ORDER BY name;
+----+-----------+
| id | name      |
+----+-----------+
|  3 | Porcupine |
+----+-----------+

Which is better? EXPLAIN shows us the following:


Click for full EXPLAIN SELECT(3) code

With a subquery, MySQL has to examine 4 rows. In addition, it does not use an index. Without the subquery, MySQL needs to examine 3 rows and uses an index. However, it also uses a temporary table, normally an indication of a sub-optimal query. The question of which is better is then quite complex, and highly dependant on the dataset, the tuning parameters (and the DBMS - do not assume MySQL optimizes the same way as other DBMS'). Does the temporary table have to be created on the disk or in memory? How large is the table that is going to be scanned without an index? I suggest doing some thorough benchmarking in your environment to see which suits you best.

Uncorrelated and Correlated subqueries

Let's further complicate matters, and rewrite the correlated subquery as an uncorrelated subquery. The main difference between the two kinds is that, in a correlated subquery, the inner query is dependant upon the outer query, while, with an uncorrelated query, the inner query does not depend upon the outer query. Instead, it runs just once. Whether this is better or not again depends on your data set. If the number of rows returned by the inner part of the uncorrelated query is relatively large, it can counteract the benefit of only running once. The inner part of the correlated query needs to be performed a number of times - again the efficiency of this depends on your situation. The advice of some older texts to always use a correlated query is flawed. Here is the query as an uncorrelated query:

mysql> SELECT animal.id,name FROM animal 
  WHERE animal.id IN (1,2,3) AND animal.id 
  IN (SELECT animal_id FROM animal_food WHERE animal_food.id>2 ) ORDER BY name;
+----+---------------+
| id | name          |
+----+---------------+
|  2 | Chacma Baboon |
+----+---------------+

And using EXPLAIN:


Click for full EXPLAIN SELECT (4) code

Sixteen rows need to be examined. Even though there is no temporary table, in our limited data set this is not the most efficient query. Nevertheless, results will vary. There is no easy answer as to which is best. In this case "it depends," "sometimes one, sometimes the other" really is the best answer! So benchmark your results in your environment, and you could see significant improvements. Good luck!

» See All Articles by Columnist Ian Gilfillan

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved