The GROUP BY clause is ideal for breaking up aggregate functions into groups. However, it’s not so great for displaying multiple rows within each group that meet a given criteria. A common requirement is to fetch the top N rows of each category, for example, largest cities for each country. A quick Google search will affirm its popularity. Problem is, when one looks at typical forum responses, very few answers are satisfactory, or utilize non-MySQL vendor-specific solutions. For those reasons, I’ve compiled a couple of answers that do hit the mark. Hopefully they’ll benefit you as well.
Take a look at the following sample of city data, pulled from the World Database:
country_code city population id ________________________________________________________
ad Andorra la Vella 20430 7 ad Canillo 3292 21 ad Encamp 11224 33 ad La Massana 7211 50 ad Les Escaldes 15854 54
pa Achutupo 1753 2053433 pa Agua Buena 1069 2053450 pa Aguadulce 17238 2053470 pa Ailigand 1652 2053491 pa Alanje 1361 2053494
I wanted to produce a data set containing the Top 10 cities for the US and Canada based on population. Grouping by country_code is easy enough; all that’s required is an ORDER BY clause:
SELECT city, population, country_code FROM cities WHERE (country_code = 'us' OR country_code = 'ca') ORDER BY country_code, population DESC;
The problem with this statement is that it does nothing to limit the number of rows returned for each country code. On the other hand, the LIMIT clause would apply to the entire result set, thereby giving unfair preference to the first country – Canada.
Almost, but not Quite...
We can remedy the above issue by first selecting the largest cities, without regards to country, and limit those results to the top N rows. The outer SELECT then sorts by country_code and population (in descending order).
SELECT city, population, country_code FROM ( SELECT city, population, country_code FROM cities WHERE (country_code = 'us' OR country_code = 'ca') ORDER BY population DESC Limit 20 ) largest_cities ORDER BY country_code, population DESC; city population country_code ________________________________________________________ Toronto 4612187 ca Montreal 3268513 ca Vancouver 1837970 ca Calgary 968475 ca Ottawa 874433 ca Edmonton 822319 ca New York 8107916 us Los Angeles 3877129 us Chicago 2841952 us Houston 2027712 us Philadelphia 1453268 us Phoenix 1428509 us San Diego 1287050 us San Antonio 1256810 us Dallas 1211704 us San Jose 897460 us Detroit 884941 us Jacksonville 797557 us Indianapolis 773283 us Columbus 736836 us
While a step in the right direction, the drawback of these results is that they skew towards the more populous country – the US – with only six of twenty rows being allocated to Canada.
One Solution that Works: UNION with LIMIT
The issue that we’re having with LIMIT is that it applies to the entire result set. In that case, what if we were to amalgamate one or more result sets using a UNION? That would allow us to LIMIT each country’s cities to the Top N:
( SELECT city, population, country_code FROM cities WHERE country_code = 'ca' ORDER BY population DESC LIMIT 10 ) UNION ( SELECT city, population, country_code FROM cities WHERE country_code = 'us' ORDER BY population DESC LIMIT 10 );
Sure enough, we now have a list of each country’s top 10 cities according to population:
city population country_code ________________________________________________________ Toronto 4612187 ca Montreal 3268513 ca Vancouver 1837970 ca Calgary 968475 ca Ottawa 874433 ca Edmonton 822319 ca Hamilton 653637 ca Quebec 645623 ca Winnipeg 632069 ca Kitchener 409111 ca New York 8107916 us Los Angeles 3877129 us Chicago 2841952 us Houston 2027712 us Philadelphia 1453268 us Phoenix 1428509 us San Diego 1287050 us San Antonio 1256810 us Dallas 1211704 us San Jose 897460 us
Note that the limiting can either be symmetrical so that each city has the same number of rows, OR asymmetrical, so that different cities receive different numbers of rows, such as 10/20.
Of course, this is not a silver bullet. It doesn’t take much imagination to envision the nightmare of applying this statement to hundreds of cities! We need something that has built-in scalability so that it works with two or two hundred countries.
A Better Solution: Ranking
SQL Server has the ROW_NUMBER() function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition:
ROW_NUMBER () OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause ) D
Using the above function, we could assign a sequence number to each city so that we could select those that are ranked above a given number:
SELECT city, population, country_code FROM ( SELECT city, population, country_code, ROW_NUMBER() over(partition by countries.country_code order by countries.country_code, cities.population DESC) RowNum FROM countries INNER JOIN cities city on countries.country_code = cities.country_code ) tbl WHERE RowNum <= 10;
Now that’s IF we were using SQL Server. Currently MySQL does not support the ROW_NUMBER() function, but as a workaround we can use MySQL session variables. Session variables do not require declaration, and can be used in a query to perform calculations and to store intermediate results.
The following code is executed for each row. If @current_country is the same, we increment the @country_rank, otherwise, we re-initialize it to 1. For the first row @current_country is NULL, so the rank is set to 1 on the first iteration. The current country_code is then stored in the @current_country variable for the next evaluation:
@country_rank := IF(@current_country = country_code, @country_rank + 1, 1) @current_country := country_code
For correct ranking, we need to include the “ORDER BY country_code, population DESC” clause.
Once we’ve assigned a rank number to each city within its country, we can retrieve the required range of 1 to 10:
SELECT city, population, country_code FROM ( SELECT city, population, country_code, @country_rank := IF(@current_country = country_code, @country_rank + 1, 1 ) AS country_rank, @current_country := country_code FROM cities WHERE country_code = 'us' OR country_code = 'ca' ORDER BY country_code, population DESC ) ranked WHERE country_rank <= 10;
Like the previous statement, this solution also supports asymmetrical limiting by altering the WHERE clause:
WHERE country_code = 'ca' AND country_rank <= 10 OR country_code = 'us' AND country_rank <= 20;
Another benefit of this approach is that we can now retrieve cities by their ranking. For example, we can fetch the fifth largest city in Canada by setting the country_rank in the WHERE filter to 5:
WHERE country_code = 'ca' AND country_rank = 5
If there’s one lesson that I’ve learned from working with MySQL over the years it’s that when it comes to queries, if there’s a will, there’s usually a way. Do you have another solution to offer? By all means, put it in a comment. I’d love to see what you come up with!