SQLAntipatterns: Avoiding the Pitfalls of Database Programming

Wednesday Jul 21st 2010 by DatabaseJournal.com Staff
Share:

Learn how you too can stamp out common database errors and learn a lot about relational databases along the way. 'Ambiguous Groups' is extracted from 'SQL Antipatterns', published by the Pragmatic Bookshelf.

Learn how you too can stamp out common database errors from an SQL Expert's playbook and learn a lot about relational databases along the way. 'Ambiguous Groups' is extracted from 'SQL Antipatterns', published by the Pragmatic Bookshelf.

Title: SQLAntipatterns: Avoiding the Pitfalls of Database Programming
Author: Bill Karwin
Publisher: The Pragmatic Bookshelf
ISBN-10: 1-934356-55-7
ISBN-13: 978-1-934356-55-5
Printed on acid-free paper.
P1.0 printing, May 2010
Version: 2010-7-8

Ambiguous Groups

Suppose your boss needs to know which projects in the bugs database are still active and which projects have been abandoned. One report he asks you to generate is the latest bug reported per product. You write a query using the MySQL database to calculate the greatest value in the date_reported column per group of bugs sharing a given product_id. The report looks like this:

product_name

date_reported

bug_id

 

Open RoundFile

2010-06-01

1234

 

Visual TurboBuilder

2010-02-16

3456

 

ReConsider

2010-01-01

5678

 

Your boss is a detail-oriented person, and he spends some time looking up each bug listed in the report. He notices that the row listed as the most recent for “Open RoundFile” shows a bug_id that isn’t the latest bug. The full data shows the discrepancy:

product_name

date_reported

bug_id

 

Open RoundFile

2009-12-19

1234

This bug_id...

Open RoundFile

2010-06-01

2248

doesn’t match this date

Visual TurboBuilder

2010-02-16

3456

 

Visual TurboBuilder

2010-02-10

4077

 

Visual TurboBuilder

2010-02-16

5150

 

ReConsider

2010-01-01

5678

 

ReConsider

2009-11-09

8063

 

How can you explain this problem? Why does it affect one product but not the others? How can you get the desired report?

Objective: Get Row with Greatest Value per Group

Most programmers who learn SQL get to the stage of using GROUP BY in a query, applying some aggregate function to groups of rows, and getting a result with one row per group. This is a powerful feature that makes it easy to get a wide variety of complex reports using relatively little code.

For example, a query to get the latest bug reported for each product in the bugs database looks like this:


SELECT product_id, MAX(date_reported) AS latest 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

A natural extension to this query is to request the ID of the specific bug with the latest date reported:


SELECT product_id, MAX(date_reported) AS latest, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

However, this query results in either an error or an unreliable answer. This is a common source of confusion for programmers using SQL. The objective is to run a query that not only reports the greatest value in a group (or the least value or the average value) but also includes other attributes of the row where that value is found.

Antipattern: Reference Nongrouped Columns

The root cause of this antipattern is simple, and it reveals a common misconception that many programmers have about how grouping queries work in SQL.

The Single-Value Rule

The rows in each group are those rows with the same value in the column or columns you name after GROUP BY. For example, in the following query, there is one row group for each distinct value in product_id.



SELECT product_id, MAX(date_reported) AS latest 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

Every column in the select-list of a query must have a single value row per row group. This is called the Single-Value Rule. Columns named in the GROUP BY clause are guaranteed to be exactly one value per group, no matter how many rows the group matches.

The MAX( ) expression is also guaranteed to result in a single value for each group: the highest value found in the argument of MAX( ) over all the rows in the group.

However, the database server can’t be so sure about any other column named in the select-list. It can’t always guarantee that the same value occurs on every row in a group for those other columns.

Text Box: GROUP BY and DISTINCT
SQL supports a query modifier called DISTINCT that reduces the rows of the query result so that every row is unique. For example, the following query reports who reported bugs and which days they reported bugs, but only one row per date and person:
SELECT DISTINCT date_reported, reported_by FROM Bugs;
A grouping query can achieve the same result by omitting any aggregate function. The query result is reduced to one row for each distinct pair of values in the column named in the GROUP BY clause:
SELECT date_reported, reported_by FROM Bugs GROUP BY date_reported, reported_by;
Both queries produce the same result and should be optimized and executed similarly, so the difference in this example is only a matter of preference.

SELECT product_id, MAX(date_reported) AS latest, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

In this example, there are many distinct values for bug_id for a given product_id, because the BugsProducts table associates multiple bugs to a given product. In a grouping query that reduces to a single row per product, there’s no way to represent all the values of bug_id.

Since there is no guarantee of a single value per group in the “extra” columns, the database assumes that they violate the Single-Value Rule. Most brands of database report an error if you try to run any query that tries to return a column other than those columns named in the GROUP BY clause or as arguments to aggregate functions.

MySQL and SQLite have different behavior from other brands of database, which we’ll explore in Legitimate Uses of the Antipattern.

Do-What-I-Mean Queries

The common misconception that programmers have is that SQL can guess which bug_id you want in the report, based on the fact that MAX( ) is used in another column. Most people assume that if the query fetches the greatest value, then other columns named will naturally take their value from the same row where that greatest value occurs.

Unfortunately, SQL can’t make this inference in several cases:

  • If two bugs have the exact same value for date_reported and that is the greatest value in the group, which value of bug_id should the query report?
  • If you query for two different aggregate functions, for example MAX( ) and MIN( ), these probably correspond to two different rows in the group. Which bug_id should the query return for this group?

SELECT product_id, MAX(date_reported) AS latest,
   MIN(date_reported) AS earliest, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;
  • If none of the rows in the table matches the value returned by the aggregate function, what is the value of bug_id? This is commonly true for the functions AVG( ), COUNT( ), and SUM( ).

SELECT product_id, SUM(hours) AS total_project_estimate, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

These are examples of why the Single-Value Rule is important. Not every query that fails to follow this rule would produce an ambiguous result, but many do. It would be clever if the database could tell an ambiguous query from an unambiguous one and produce an error only when the data contains ambiguity. But that would not be good for application reliability; it would mean that the same query might be valid or invalid, depending on the state of data.

How to Recognize the Antipattern

In most brands of database, writing a query that violates the Single-Value Rule should elicit an error immediately as you prepare the query.

The following are examples of error messages given by some brands of database:

  • Firebird 2.1:
    Invalid expression in the select list (not contained in either 
    an aggregate function or the GROUP BY clause)
    
  • IBM DB2 9.5:
    
    An expression starting with "BUG_ID" specified in a SELECT 
    clause, HAVING clause, or ORDER BY clause is not specified in 
    the GROUP BY clause or it is in a SELECT clause, HAVING clause, 
    or ORDER BY clause with a column function and no GROUP BY clause
    is specified.
  • Microsoft SQL Server 2008:
    Column 'Bugs.bug_id' is invalid in the
    select list because it is not contained in either an aggregate function or the
    GROUP BY clause.
    
  • MySQL 5.1, after setting the ONLY_FULL_GROUP SQL mode to disallow ambiguous queries.
    'bugs.b.bug_id' isn't in GROUP BY
  • Oracle 10.2:
    not a GROUP BY expression
  • PostgreSQL 8.3:
    column "bp.bug_id" must
    appear in the GROUP BY clause or be used in an aggregate function

In SQLite and in MySQL, ambiguous columns may contain unexpected and unreliable values. In MySQL, the value returned is from the first row in the group, where first corresponds to physical storage. SQLite gives the opposite result: the value is from the last row in the group. In both cases, the behavior is not documented, and these databases aren’t obligated to work the same in future versions. It’s your responsibility to notice these cases and to design your queries to avoid ambiguity.

Legitimate Uses of the Antipattern

As we’ve seen, MySQL and SQLite can’t guarantee a reliable result for a column that doesn’t fit the Single-Value Rule. There are cases when you can take advantage of the fact that these databases enforce the rule less strictly than other brands.


SELECT b.reported_by, a.account_name 
FROM Bugs b JOIN Accounts a ON (b.reported_by = a.account_id) 
GROUP BY b.reported_by;

In the previous query, the account_name column technically violates the Single-Value Rule, since it’s named neither in the GROUP BY clause nor in an aggregate function. Nevertheless, there is only one value possible for account_name in each group; the groups are based on Bugs.reported_by, which is a foreign key to the Accounts table. Therefore, the groups correspond one-to-one with rows in the Accounts table.

In other words, if you know the value of reported_by, then you know the value of account_name unambiguously, like if you had queried by the primary key of the Accounts table.

This kind of unambiguous relationship is called a functional dependency. The most common example of this is between the primary key of a table and the table’s attributes: account_name is a functional dependency of its primary key, account_id. If you group a query by a table’s primary key column(s), then the groups correspond to a single row of that table, and therefore all other columns of the same table must have a single value per group.

Bugs.reported_by has a similar relationship with the dependent attributes of the Accounts table, because it references the primary key of the Accounts table. When the query groups by the reported_by column, which is a foreign key, the attributes of the Accounts table are functionally dependent, and the query result contains no ambiguity.

However, most brands of database still return an error. Not only is this the behavior required by the SQL standard, but it’s not too expensive to figure out functional dependencies on the fly.[1] But if you use MySQL or SQLite and you’re careful to query only functionally dependent columns, you can use this kind of grouping query and still avoid problems of ambiguity.

Solution: Use Columns Unambiguously

The sections that follow describe several ways you can resolve this antipattern and write unambiguous queries.

Query Only Functionally Dependent Columns

The most straightforward solution is to eliminate ambiguous columns from the query.


SELECT product_id, MAX(date_reported) AS latest 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

The query reveals the date of the latest bug per product, even though it doesn’t report the bug_id corresponding to that latest bug. Sometimes this is enough, so don’t overlook a simple solution.

Using a Correlated Subquery

A correlated subquery contains a reference to the outer query and so produces different results for each row of the outer query. We can use this to find the latest bug per product by running a subquery to search for bugs with the same product and a greater date. When the subquery finds none, the bug in the outer query is the latest.


SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id 
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id) 
WHERE NOT EXISTS
   (SELECT * FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
    WHERE bp1.product_id = bp2.product_id      
	  AND b1.date_reported < b2.date_reported);

Use this solution as a simple solution that is readable and easy to code. However, keep in mind that this solution isn’t likely to be the best for performance, because correlated subqueries are executed once for each row of the outer query.

Using a Derived Table

You can use a subquery as a derived table, producing an interim result that contains only the product_id and the corresponding greatest bug report date for each product. Then use this result to join against the tables so that the query result contains only bugs with the latest date per product.


SELECT m.product_id, m.latest, b1.bug_id 
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id) 
JOIN (SELECT bp2.product_id, MAX(b2.date_reported) AS latest
       FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
	    GROUP BY bp2.product_id) m
   ON (bp1.product_id = m.product_id
       AND b1.date_reported = m.latest);

product_id

latest

bug_id

1

2010-06-01

2248

2

2010-02-16

3456

2

2010-02-16

5150

3

2010-01-01

5678

Notice that you can get multiple rows per product if the latest date returned by the subquery matches multiple rows. If you need to ensure a single row per product_id, you can use another grouping function in the outer query:


SELECT m.product_id, m.latest, MAX(b1.bug_id) AS latest_bug_id 
FROM Bugs b1  
JOIN (SELECT product_id, MAX(date_reported) AS latest
       FROM Bugs b2 JOIN BugsProducts USING (bug_id)       
	   GROUP BY product_id) m
   ON (b1.date_reported = m.latest) 
GROUP BY m.product_id, m.latest;

product_id

latest

bug_id

1

2010-06-01

2248

2

2010-02-16

5150

3

2010-01-01

5678

Use the derived table solution as a more scalable alternative to the correlated subquery. The derived table is noncorrelated, so most database brands should be able to execute the subquery once. However, the database must store the interim result set in a temporary table, so this still isn’t the best for performance.

Using a JOIN

You can create a join that tries to match against a set of rows that may not exist. This type of join is called an outer join. Where the matching rows don’t exist, null is used for all columns in that nonexistent row. So, where the query finds null, we know no such row was found.


SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id 
FROM Bugs b1 JOIN BugsProducts bp1 ON (b1.bug_id = bp1.bug_id) 
LEFT OUTER JOIN (Bugs AS b2 JOIN BugsProducts AS bp2 ON
     (b2.bug_id = bp2.bug_id))
   ON (bp1.product_id = bp2.product_id AND
      (b1.date_reported < b2.date_reported OR      
	  b1.date_reported = b2.date_reported
          AND b1.bug_id < b2.bug_id)) 
WHERE b2.bug_id IS NULL;

product_id

latest

bug_id

1

2010-06-01

2248

2

2010-02-16

5150

3

2010-01-01

5678

It takes a few minutes of gazing at this query, and perhaps some doodles on notepaper, for most people to see how it works. But once you do, this technique can be an important tool.

Use the JOIN solution when the scalability of the query over large sets of data is important. Although it’s a tougher concept to grasp and therefore more difficult to maintain, it often scales better than a subquery-based solution. Remember to measure the performance of several query forms, instead of assuming that one performs better than the other.

Using an Aggregate Function for Extra Columns

You can make the extra column comply with the Single-Value Rule by applying another aggregate function to it.


SELECT product_id, MAX(date_reported) AS latest,
   MAX(bug_id) AS latest_bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

Use this solution only when you can rely on the latest bug_id being the bug with the latest date, in other words, if bugs are guaranteed to be reported in chronological order.

Concatenating All Values per Group

Finally, you can use another aggregate function on bug_id to avoid violating the Single-Value Rule. MySQL and SQLite support a function GROUP_CONCAT( ) that concatenates all the values in the group into one value. By default, this is a comma-separated string.


SELECT product_id, MAX(date_reported) AS latest,
   GROUP_CONCAT(bug_id) AS bug_id_list 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

product_id

latest

bug_id

1

2010-06-01

1234, 2248

2

2010-02-16

3456, 4077, 5150

3

2010-01-01

5678, 8063

This query doesn’t reveal which bug_id corresponds to the latest date; the bug_id_list includes all bug_id values in each group.

Another disadvantage of this solution is that it isn’t standard SQL. and other brands of database don’t support this function. Some brands of database support custom functions and custom aggregate functions. For example, here’s the solution for PostgreSQL:


CREATE AGGREGATE GROUP_ARRAY (
   BASETYPE = ANYELEMENT,   
   SFUNC = ARRAY_APPEND,   
   STYPE = ANYARRAY,   INITCOND = '{}' 
);

SELECT product_id, MAX(date_reported) AS latest,
   ARRAY_TO_STRING(GROUP_ARRAY(bug_id), ',' ) AS bug_id_list, 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

Some other brands of database don’t support custom functions, so the solution may require writing a stored procedure to loop over an nongrouped query result, concatenating values manually.

Use this solution when you expect the extra column to have a single value per group but the column still violates the Single-Value Rule.

Title: SQLAntipatterns: Avoiding the Pitfalls of Database Programming
Author: Bill Karwin
Publisher: The Pragmatic Bookshelf
ISBN-10: 1-934356-55-7
ISBN-13: 978-1-934356-55-5
Printed on acid-free paper.
P1.0 printing, May 2010
Version: 2010-7-8

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved