Deleting Duplicate Rows in a MySQL Database

Tuesday May 6th 2003 by Ian Gilfillan
Share:

Your beautifully designed application is behaving oddly; you discover your database table contains duplicate rows... and now you have to delete them. Ian Gilfillan explains why there are duplicate rows, how to fix the cause of the duplicates and finally, how to delete them.

The problem

Something is not right. Your beautifully designed application is behaving oddly. After some investigation, you discover a problem in the code logic, and realise the database table contains duplicate rows. And now you have to delete them. Before I show you how to do this in MySQL, take a step back. Why does your table contain duplicates? The answer in the vast majority of cases is that your table has been badly designed. There is no one else to blame but the database designer. So before you fix the symptom (and I know it's urgent and I'm getting there), you need to think about fixing the cause of the problem. All tables should have a unique primary key. Not having one is an exception that is usually not wise. I suggest you read some more about the topic of database normalization, and follow the steps to correct the structure of your database. A normalized database almost never has this kind of data integrity problems. All too many books and tutorials dive straight into providing novices with the tools needed to create database tables without teaching the fundamentals of good database design. However, I'll get off my soapbox now and get to the topic at hand. What to do about it when it happens to you. First, let's create our sample badly-designed table, and populate it with some data.

CREATE TABLE bad_table (
id INT NOT NULL,
name VARCHAR(20) NOT NULL 
);

Notice the table has no primary key. Alarm bells should be ringing. You should either have defined the id field as a PRIMARY KEY (perhaps even as an AUTO_INCREMENT as well), or simply as UNIQUE. Here is a suggestion for how the table should have looked:

CREATE TABLE good_table (
id INT NOT NULL,
name VARCHAR(20) NOT NULL,
PRIMARY KEY (id) 
);

Back to the bad table - let's insert some values, with duplicates:

INSERT INTO bad_table(id,name) VALUES 
(1,'Things Fall Apart'),
(1,'Things Fall Apart'),
(2,'The Famished Road');

Deleting a single record when duplicates exist

I hope that you would not have done such a thing using raw SQL, and can at least blame the application, but nevertheless, you're stuck with duplicate records and need to remove them. In some cases there may only be one duplicate, as in the sample data we have created. In this case, the simplest way is just to delete that one record. Many people do not know about a handy way to do this. Of course the statement:

DELETE FROM bad_table WHERE id=1;

would remove both records with an id of 1, and leave you having to INSERT the record again. But, do you remember the LIMIT clause? It is not only useful in SELECT statements, but in any kind of statement too, including DELETEs. Therefore, the most efficient way to solve the problem would be to use the following:

DELETE FROM bad_table WHERE id=1 LIMIT 1;

You can see now that you are still left with two records:

mysql> SELECT * FROM bad_table;
+----+-------------------+
| id | name              |
+----+-------------------+
|  1 | Things Fall Apart |
|  2 | The Famished Road |
+----+-------------------+

Simple and efficient.

Deleting duplicate records when multiple duplicates exist

Of course, not all of you are going to be so lucky as to only have one duplicate, and manually removing them one by one is hardly an answer. First add some more duplicates to the table, and then we will see what we can do:

INSERT INTO bad_table(id,name) VALUES
(1,'Things Fall Apart'),
(2,'The Famished Road'),
(3,'Thirteen cents'),
(3,'Thirteen cents');

Some of you may have read the article on Database Journal about removing duplicates, specifically for SQL Server databases (if you've stumbled across this article as a SQL Server user, you can find it here. Unfortunately none of the techniques mentioned in that article work with MySQL. You cannot use cursors, correlated subqueries or derived tables. But you can use temporary tables, although the way they work in MySQL is different to that which is described in that article. So, here's how you do it - first create a temporary table with the same structure as your existing table (MySQL 4.1 has a great feature - the LIKE keyword that allows you to create a new table based upon the structure of an existing table, but since it's only an alpha release most of you won't be running that yet).

CREATE TEMPORARY TABLE 
bad_temp(id INT,name VARCHAR(20))
TYPE=HEAP;

Note that a temporary table only exists for as long as the connection. Therefore, if you are using a web application such as PHPMyAdmin to test these SQL statements, you'll need to bundle the above statement with the next few, to ensure the temporary table is available all of the time (each web page in a browser-based tool such as PHPMyAdmin uses its own connection). Also, see the cautionary after the example for some things you need to be aware of when using this technique on a production server.

INSERT INTO bad_temp(id,name) SELECT DISTINCT id,name FROM bad_table;

[[Authors note: A reader has pointed out that I went out about this in a rather long-winded way. You can just as easily have used a single statement:

CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM bad_table
Thanks to P-A Fredriksson for pointing this out!]]

Now the temporary table contains a set of data with no duplicates. The DISTINCT keyword takes care of that. Now, delete all of the data from the original (and read the cautionary below first when trying this on a production server!), and populate it from the temporary table, as follows:

DELETE FROM bad_table;

INSERT INTO bad_table(id,name) SELECT id,name FROM bad_temp;

Care when using Temporary Tables

Be careful when using temporary tables in this way. With our tiny data set, there is not much of a problem, but huge tables could cause problems with memory. Temporary tables can be of type HEAP (as in the example), MyISAM, Merge or InnoDB (or even the old ISAM format). HEAP tables are placed in memory, in which case you need to make sure you have enough memory available to store the entire table, which may not always be the case with large production tables. Other table types are placed on disk, in which case you may as well not make the table temporary. There iss always the risk of losing your connection at the vital point, just after you've deleted the original data, and therefore losing your temporary table as well (and assuming you weren't using transactions, the changes are permanent). Simply remove the word TEMPORARY above, and you create a duplicate 'permanent' table. Of course, you'll want to delete it afterwards. Here is the full set of statements to achieve this:


## Necessary if you're using the same connection as the previous example
DROP TABLE bad_temp;

CREATE TABLE 
bad_temp(id INT, name VARCHAR(20));

INSERT INTO bad_temp(id,name) SELECT DISTINCT id,name FROM bad_table;

DELETE FROM bad_table;

INSERT INTO bad_table(id,name) SELECT id,name FROM bad_temp;

DROP TABLE bad_temp;

Duplicates with Unique Keys

The example above assumed the entire record was a duplicate. But there's another kind of duplicate, where there is a unique key, perhaps with an auto_increment field, yet the data itself is duplicated. As far as MySQL is concerned, there are no duplicates, yet logically there are. Let's look at an example, and add some more test data:

CREATE TABLE bad_table2 (
id INT NOT NULL UNIQUE AUTO_INCREMENT,
name VARCHAR(20) NOT NULL 
);

INSERT INTO bad_table2(id,name) VALUES
(1,'Things Fall Apart'),
(2,'Things Fall Apart'),
(3,'The Famished Road'),
(4,'Things Fall Apart'),
(5,'The Famished Road'),
(6,'Thirteen cents'),
(7,'Thirteen cents');

Here you need to make a decision. Is the id field of any value to you here? If it is, (because you've got links to other tables, with, for example, both id 1 and id 2, then you've got quite a nightmare on your hands - you can't lose either of the values, because you still need the link. In this case, you are going to have to write some code to resolve the mess, checking for duplicates, running update statements and the like. I will assume your data is not quite that much of a disaster, and that the id does not matter. In this case, you are simply going to let MySQL's auto increment field take care of the problem with the id field. The only difference in this example is that you cannot select a DISTINCT id-name combination; rather you only use the DISTINCT on the name.

CREATE TEMPORARY TABLE 
bad_temp2(id VARCHAR(10), name VARCHAR(20))
TYPE=HEAP;

INSERT INTO bad_temp2(name) SELECT DISTINCT name FROM bad_table2;

DELETE FROM bad_table2;

INSERT INTO bad_table2(id,name) SELECT id,name FROM bad_temp2;

Now that you know how to extricate yourself from the mess, don't think it's OK to get into this situation! With good planning and careful database design, you should never have to do this sort of thing. But now you know, just in case 'someone else' mangles your data. Good luck!

» See All Articles by Columnist Ian Gilfillan

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved