The GROUP BY clause is ideal for breaking up aggregate functions into groups. However, it’s not so great for displaying multiple rows within each group that meet a given criteria. A common requirement is to fetch the top N rows of each category, for example, largest cities for each country. A quick Google search will affirm its popularity. Problem is, when one looks at typical forum responses, very few answers are satisfactory, or utilize non-MySQL vendor-specific solutions. For those reasons, I’ve compiled a couple of answers that do hit the mark. Hopefully they’ll benefit you as well.
The Problem
Take a look at the following sample of city data, pulled from the World Database:
country_code city population id
________________________________________________________
ad Andorra la Vella 20430 7
ad Canillo 3292 21
ad Encamp 11224 33
ad La Massana 7211 50
ad Les Escaldes 15854 54
//etc---------------------------------------------------
pa Achutupo 1753 2053433
pa Agua Buena 1069 2053450
pa Aguadulce 17238 2053470
pa Ailigand 1652 2053491
pa Alanje 1361 2053494
//etc...
I wanted to produce a data set containing the Top 10 cities for the US and Canada based on population. Grouping by country_code is easy enough; all that’s required is an ORDER BY clause:
SELECT city, population, country_code
FROM cities
WHERE (country_code = 'us' OR country_code = 'ca')
ORDER BY country_code, population DESC;
The problem with this statement is that it does nothing to limit the number of rows returned for each country code. On the other hand, the LIMIT clause would apply to the entire result set, thereby giving unfair preference to the first country – Canada.
Almost, but not Quite...
We can remedy the above issue by first selecting the largest cities, without regards to country, and limit those results to the top N rows. The outer SELECT then sorts by country_code and population (in descending order).
SELECT city, population, country_code
FROM
(
SELECT city, population, country_code
FROM cities
WHERE (country_code = 'us' OR country_code = 'ca')
ORDER BY population DESC
Limit 20
) largest_cities
ORDER BY country_code, population DESC;
city population country_code
________________________________________________________
Toronto 4612187 ca
Montreal 3268513 ca
Vancouver 1837970 ca
Calgary 968475 ca
Ottawa 874433 ca
Edmonton 822319 ca
New York 8107916 us
Los Angeles 3877129 us
Chicago 2841952 us
Houston 2027712 us
Philadelphia 1453268 us
Phoenix 1428509 us
San Diego 1287050 us
San Antonio 1256810 us
Dallas 1211704 us
San Jose 897460 us
Detroit 884941 us
Jacksonville 797557 us
Indianapolis 773283 us
Columbus 736836 us
While a step in the right direction, the drawback of these results is that they skew towards the more populous country – the US – with only six of twenty rows being allocated to Canada.
One Solution that Works: UNION with LIMIT
The issue that we’re having with LIMIT is that it applies to the entire result set. In that case, what if we were to amalgamate one or more result sets using a UNION? That would allow us to LIMIT each country’s cities to the Top N:
(
SELECT city, population, country_code
FROM cities
WHERE country_code = 'ca'
ORDER BY population DESC
LIMIT 10
)
UNION
(
SELECT city, population, country_code
FROM cities
WHERE country_code = 'us'
ORDER BY population DESC
LIMIT 10
);
Sure enough, we now have a list of each country’s top 10 cities according to population:
city population country_code
________________________________________________________
Toronto 4612187 ca
Montreal 3268513 ca
Vancouver 1837970 ca
Calgary 968475 ca
Ottawa 874433 ca
Edmonton 822319 ca
Hamilton 653637 ca
Quebec 645623 ca
Winnipeg 632069 ca
Kitchener 409111 ca
New York 8107916 us
Los Angeles 3877129 us
Chicago 2841952 us
Houston 2027712 us
Philadelphia 1453268 us
Phoenix 1428509 us
San Diego 1287050 us
San Antonio 1256810 us
Dallas 1211704 us
San Jose 897460 us
Note that the limiting can either be symmetrical so that each city has the same number of rows, OR asymmetrical, so that different cities receive different numbers of rows, such as 10/20.
Of course, this is not a silver bullet. It doesn’t take much imagination to envision the nightmare of applying this statement to hundreds of cities! We need something that has built-in scalability so that it works with two or two hundred countries.
A Better Solution: Ranking
SQL Server has the ROW_NUMBER() function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition:
ROW_NUMBER ()
OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause ) D
Using the above function, we could assign a sequence number to each city so that we could select those that are ranked above a given number:
SELECT city, population, country_code
FROM
(
SELECT city, population, country_code,
ROW_NUMBER() over(partition by countries.country_code
order by countries.country_code,
cities.population DESC) RowNum
FROM countries
INNER JOIN cities city on countries.country_code = cities.country_code
) tbl
WHERE RowNum <= 10;
Now that’s IF we were using SQL Server. Currently MySQL does not support the ROW_NUMBER() function, but as a workaround we can use MySQL session variables. Session variables do not require declaration, and can be used in a query to perform calculations and to store intermediate results.
The following code is executed for each row. If @current_country is the same, we increment the @country_rank, otherwise, we re-initialize it to 1. For the first row @current_country is NULL, so the rank is set to 1 on the first iteration. The current country_code is then stored in the @current_country variable for the next evaluation:
@country_rank := IF(@current_country = country_code, @country_rank + 1, 1)
@current_country := country_code
For correct ranking, we need to include the “ORDER BY country_code, population DESC” clause.
Once we’ve assigned a rank number to each city within its country, we can retrieve the required range of 1 to 10:
SELECT city, population, country_code
FROM
(
SELECT city, population, country_code,
@country_rank := IF(@current_country = country_code,
@country_rank + 1,
1
) AS country_rank,
@current_country := country_code
FROM cities
WHERE country_code = 'us' OR country_code = 'ca'
ORDER BY country_code, population DESC
) ranked
WHERE country_rank <= 10;
Like the previous statement, this solution also supports asymmetrical limiting by altering the WHERE clause:
WHERE country_code = 'ca' AND country_rank <= 10
OR country_code = 'us' AND country_rank <= 20;
Another benefit of this approach is that we can now retrieve cities by their ranking. For example, we can fetch the fifth largest city in Canada by setting the country_rank in the WHERE filter to 5:
WHERE country_code = 'ca' AND country_rank = 5
Conclusion
If there’s one lesson that I’ve learned from working with MySQL over the years it’s that when it comes to queries, if there’s a will, there’s usually a way. Do you have another solution to offer? By all means, put it in a comment. I’d love to see what you come up with!