One of MySQL 5.6's many optimizations apply to the processing of queries that contain subqueries. These involve transforming a subquery into a semi-join operation, and then treating it like just another join operation for evaluation. The optimizer can recognize that an IN clause requires the subquery to return only one value from the subquery table. In those cases, the query is executed as a semi-join, which is an operation that returns only one instance of each row in subquery that is matched by rows in the outer one. In today's article, we're going to take a closer look at the strategies utilized by the MySQL 5.6 optimizer.
A Subquery Refresher
A subquery is one that follows the IN or equals (=) operators to join the results of an inner query with the outer one. For instance, while all three of the following queries will produce the same records, many database professionals would consider the last one to be more natural in terms of language:
1A - Without Subquery:
SELECT DISTINCT n.* FROM nation as n, region AS r WHERE n.region_key = r.region_key;
SELECT nation.* FROM nation INNER JOIN region WHERE nation.region_key = region.region_key;
2 - Using Subquery:
SELECT * FROM nation WHERE region_key IN (SELECT region_key FROM region WHERE name = 'North America');
In the first query, the use of SELECT DISTINCT is also inefficient because it generates all matching rows first only to eliminate duplicates later.
Semi-Join Transformation Execution Strategies
As mentioned above, the optimizer recognizes that the IN clause requires the subquery to return only one instance of each region_key from the region table. This results in MySQL executing the SELECT statement as a semi-join, so that only one instance of each region is matched to rows in the nation table.
There are two important differences between a semi-join and a regular join:
- In a semi-join, the inner tables do not cause duplicates in the result.
- No columns from the inner tables are added to the result of the operation.
Hence, the result of the semi-join is always a subset of the rows from the outer tables. Effectively, the optimization of semi-joins is about efficiently eliminating duplicates from the inner tables.
MySQL implements four different semi-join execution strategies in order to remove duplicates:
- Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query.
- Duplicate Weedout: Run the semi-join as if it was a join and remove duplicate records using a temporary table.
- FirstMatch: When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This provides an early exit from table scanning and eliminates production of unnecessary rows.
- LooseScan: Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.
- Materialization: Store the results of the subquery into a temporary table with an index and use the temporary table to perform a join. The index is then used to remove duplicates as well as for lookups when joining the temporary table with the outer tables.
Convert the Subquery to a Join, or use Table Pullout
One common bit of advice for optimizing MySQL has been "If possible, rewrite your subqueries as joins". Now you don't have to because the optimizer does it for us any time that columns in the subquery are related to the main SELECT and contain unique values. The optimizer will try to pull out the parent table for each column in the subquery that satisfy these requirements; those who do not may have to remain as a subquery.
In the Duplicate Weedout strategy, MySQL runs the semi-join as if it were a regular inner join, and then eliminates the duplicate records using a temporary table. While this strategy requires performing lookups in the temporary table, they are typically significantly cheaper than those of regular tables because the temporary table is in-memory.
The semi-join FirstMatch strategy executes a subquery in very similar fashion to how the IN-TO-EXISTS strategy from earlier versions of MySQL. For each matching row in the outer table, MySQL checks for a match in the inner table. When a match is found, it returns the row from the outer table. Only when no matches are found does the engine fall back to scanning the entire inner table.
The LooseScan strategy employs grouping where the column in the subquery is an index and the outer SELECT could match against many inner SELECT rows. This has the effect of grouping rows by that index. For instance, the following query lists all of the Countries that have records in the city table (hopefully all do!):
select * from Country where Country.code in (select country_code from City)
All cities of each country would be grouped together, so that none are mixed with cities of other countries. That makes it easy for the query engine to select the first city from each group, which is then joined with its country to produce a list of unique countries.
This strategy is used for a subquery that is uncorrelated, that is, which can be run independently of the main query. Materialization populates a temporary table with possible values of the subquery values, and then performs a join with those of the outer query.
The join can be done in two directions, either from the materialized table to the outer query values or from the outer query values to the materialized table.
The first direction is called a Materialization-scan and involves doing a full scan on the materialized table. The latter is called a Materialization-lookup because the quickest way to find a match in the materialized table is to perform a lookup on its primary key.
This article was only meant to provide a general overview of the MySQL 5.6 Subquery Optimizations. For more detailed information on each strategy, visit the following pages:
In today's article, we explored the strategies utilized by the MySQL 5.6 optimizer in the processing of queries that contain subqueries. In all of the strategies discussed, SELECT statements containing subqueries are executed as a semi-join, so that the fewest instances possible of each row in the subquery is returned. Doing so minimizes table scanning and facilitates the matching of the inner SELECT records to those of the outer one.
Rob Gravelle resides in Ottawa, Canada, and is the founder of GravelleConsulting.com. Rob has built systems for Intelligence-related organizations such as Canada Border Services, CSIS as well as for numerous commercial businesses. Email Rob to receive a free estimate on your software project. Should you hire Rob and his firm, you'll receive 15% off for mentioning that you heard about it here!
In his spare time, Rob has become an accomplished guitar player, and has released several CDs. His former band, Ivory Knight, was rated as one Canada's top hard rock and metal groups by Brave Words magazine (issue #92).