Dealing with MS SQL Tables that contain Duplicate Rows

Wednesday Jul 16th 2003 by Gregory A. Larsen
Share:

Every so often, you might have to deal with tables that contain duplicate rows. Greg Larsen shares techniques for identifying and deleting those rows.

Every so often, you might have to deal with tables that contain duplicate rows. In one case, you might only need to identify the duplicate rows. In other cases, you might need to remove the duplicate rows. This article will show you some different techniques for dealing with duplicate rows.

Your interpretation of a duplicate row might mean something different then the meaning to someone else. A duplicate row could mean that the data that makes up the key for each row is duplicated on more than one row. Another definition might be every column in a row matches with another row in the same table. I will show you how to find the duplicate rows in both of these cases.

Identifying Duplicate Key Values

One of the first things you will need consider when dealing with duplicates is how to identify the rows that contain duplicate values. To identify duplicates, you need to determine what is considered a duplicate record.

For the purpose of my first example, I will be using the sales table in the pubs database. The sales table has the following columns: stor_id, ord_num, ord_date, qty, payterms, title_id. For this example, let's say a duplicate record is any record that has the same stor_id and ord_num. To identify duplicates sales records I would use the following code:

select stor_id, ord_num , count(*) as 'Num of Dups' 
from pubs.dbo.sales
group by stor_id, ord_num 
having count(*) > 1

When this code is run I get the following results:

stor_id ord_num              Num of Dups 
------- -------------------- ----------- 
7067    P2121                3
7131    P3087a               4

In this example, I used the HAVING clause in conjunction with the GROUP BY to identify the number of duplicate records. In this case, I only showed the key values (stor_id, and ord_num) associated with the duplicate records along with the number of duplicates. Now in reality these are not duplicate records if you looked at all the columns in the table, but since I said we were only looking at stor_id and ord_num, these are considered duplicates.

Displaying Rows for Records that have Duplicate Key Values

Now my previous example only displayed the duplicate key values. I assume some of the time you may want to display all the duplicate records associated with duplicate key values. Here is some code that will display all of the rows for each duplicate key value.

select cast(a.stor_id as char(4)) as stor_id, 
       cast(a.ord_num as char(6)) as ord_num, 
       convert(char(8),a.ord_date,110) as ord_date,
       a.qty, 
       a.payterms, 
       a.title_id 
from pubs.dbo.sales a
join 
(select stor_id, ord_num  
from pubs.dbo.sales
group by stor_id, ord_num 
having count(*) > 1) b  
on a.stor_id = b.stor_id
           and
   a.ord_num = b.ord_num
order by a.stor_id, a.ord_num

For this example, I took my previous TSQL command that returned the duplicate keys (stor_id and ord_num) and joined it with the pubs.dbo.sales table. The join criteria allow TSQL to only return rows in the pubs.dbo.sales table that match the duplicate key values. The following output is displayed in the Query Analyzer results pane when I run this query:

stor_id ord_num ord_date qty    payterms     title_id 
------- ------- -------- ------ ------------ -------- 
7067    P2121   06-15-19 40     Net 30       TC3218
7067    P2121   06-15-19 20     Net 30       TC4203
7067    P2121   06-15-19 20     Net 30       TC7777
7131    P3087a  05-29-19 20     Net 60       PS1372
7131    P3087a  05-29-19 25     Net 60       PS2106
7131    P3087a  05-29-19 15     Net 60       PS3333
7131    P3087a  05-29-19 25     Net 60       PS7777

Identifying truly duplicate Rows

My definition of a truly duplicate row is a row where the column values of the row have exactly the same values for each column with another record in the same table. To identify true duplicate records you will need to have every column in the table in a group by clause. Let me run through an example that shows true duplicates.

Say you have an auto sales company. The sales people always want to know how many vehicles you have on your lot that are exactly the same. Basically, duplicate records in your vehicles file. Let's say your vehicles file looks like this:

make       model      color      
---------- ---------- ---------- 
Ford       Ranger     Red       
Ford       Ranger     Green     
Toyota     Altima     Red       
Toyota     Altima     Green     
Toyota     Altima     Red       
Toyota     Altima     Blue      
Toyota     Altima     Blue      
Plymouth   Voyager    Blue      
Plymouth   Voyager    White     
Plymouth   Voyager    Green     
Plymouth   Voyager    Red       
Plymouth   Voyager    Gold      

To find the duplicate records I could use the following code. This code is a little more complicated than it needs to be for this example. This complexity is because I built this code to be dynamic, so it will work on any table you specify. This code queries the information_schema.columns view to return the column names for all columns in the specified table. The code uses the column names to dynamically build the select statement. Writing the code this way allows the code to be used for finding true duplicates in any table. By changing the value of the @TABLE variable this code will find true duplicates in whichever table you specify.

declare @cmd varchar(4000)
declare @table varchar(100)
declare @curr_col varchar(100)
declare @old_col varchar(100)
declare @column_names varchar(4000)
-- Set the table to look for duplicates
set @table = 'vehicles'
set @curr_col = ''
-- Get name of first column 
select top 1 @curr_col=column_name   
  from information_schema.columns 
  where table_name = @table order by column_name
set @column_names = @curr_col
set @old_col = @curr_col
-- Get name of second column
select top 1 @curr_col=column_name 
  from information_schema.columns 
  where table_name = @table 
                  and 
        column_name > @old_col 
  order by column_name
-- Process all columns
while @curr_col <> @old_col
begin
  set @column_names = rtrim(@column_names) + ',' + rtrim(@curr_col)
  set @old_col = @curr_col
  -- Get next column 
  select top 1 @curr_col=column_name 
    from information_schema.columns 
    where table_name = @table 
                    and 
          column_name > @old_col 
    order by column_name
end
-- build the command to search for duplicates
set @cmd = 'select * from ' + rtrim(@table) +
           ' group by ' + rtrim(@column_names) + 
           ' having count(*) > 1'
-- Find duplicates
exec (@cmd)
drop table vehicles

The results of the query show the sale people that following vehicles are identical:

make       model      color      
---------- ---------- ---------- 
Toyota     Altima     Blue      
Toyota     Altima     Red  

What to do with Duplicate Records

Now that you know there are duplicates in your table, the question is what to do with them. Depending on what caused the duplicates might indicate what will be done with them. In my vehicles example, the business logic allowed for duplicate records. Although for other situations possible duplicates should not be allowed, and you will need to decide what to do with the duplicates. I will take you through a few examples on what to do with duplicated records to eliminate the duplicates.

In order to delete duplicate records you need to determine which duplicate to delete. Now in the case where we have true duplicates, like my vehicle example above, you really don't have to determine which record to delete, since all duplicate rows are exactly the same. In my other example where the extent of duplication is only associated with the duplicate key values, you will need to determine which one to keep. Do you keep the first one, the last one? How do you identify the first and last one if there is no datetime field helping you determine which one is first and which one is last? Remember ordinal position in SQL Server can change, as records are deleted and added. Let's review a few examples that might help you when deleting duplicates in your environment.

Deleting All But the Most Current Record

In this example, I will be showing you some code that will delete the oldest record. Let's say you have an application that has a business rule that requires that you don't keep more than a single transaction for each customer in the cust_trans table. When the original application was built, this was not a business rule, but due to the volume of customer transactions this single transaction rule is now required. Rather than go back to every program that appends customer transactions to the cust_tran table, it was decided that a nightly process would be run to remove the duplicate transactions for each customer, so that only the most recent transaction remains in the table.

Now say your current cust_tran table has the following records.

cust_id     tran_date        prod_id     qty         unit_price 
----------- ---------------- ----------- ----------- ---------- 
1           2003-07-01 00:00 98          1           9.99
2           2003-07-01 00:00 22          2           19.99
3           2003-07-01 00:00 18          1           5.99
1           2003-07-02 00:00 7           1           89.99
2           2003-07-01 01:01 15          1           2.99
3           2003-07-02 00:00 1           10          .99
4           2003-07-01 00:00 18          4           5.99
5           2003-07-02 00:00 98          1           9.99

The tran_date column will be used to determine which record is the most recent entry in the table for each by cust_id. In this example cust_id 1, 2, and 3 each have multiple records. To remove the duplicate records for cust_Id 1, 2, and 3 the following logic can be used.

delete from a
from cust_tran a join 
     (select cust_id, max(tran_date) max_tran_date from cust_tran 
             group by cust_id 
             having count(*) > 1) b
      on a.cust_id = b.cust_id and a.tran_date < b.max_tran_date

This code uses a sub-query and a join to identify those records that need to be deleted. The sub-query identifies the key values for most recently entered records for customers that have duplicate records. The rows returned from this sub-query are then joined with the records in the cust_tran table for each cust_tran that has a tran_date less than the most recently entered transaction for that cust_id. Upon completion of this delete statement, there will be no more than a single cust_tran record for each customer, even if the customer had more than two duplicate records to start with.

Deleting All the True Duplicates

As stated early, deleting true duplicates is the process of removing all but one of the duplicate records. No decision will be needed to determine which one to delete, since all records contain exactly the same information. I will show you a couple of ways to accomplish deleting true duplicates.

In the first delete example, I will add an identity column to my table containing duplicates. The identity column will allow me to assign a unique integer value to each row, thus allowing me to distinguish one duplicate row from the next, and therefore allowing me to delete all but one of the duplicates. Here is an example on how I used the identity column to remove duplicates from our earlier vehicles table example.

alter table vehicles 
  add seq_num int identity
go
delete from a
from vehicles a join 
     (select make, model, color, max(seq_num) max_seq_num from vehicles 
             group by make, model, color
             having count(*) > 1) b
      on a.make = b.make and 
         a.model = b.model and
         a.color = b.color and 
         a.seq_num < b.max_seq_num
go 
alter table vehicles
 drop column seq_num

Now another method of deleting true duplicates is to create a temporary table that will contain all the original table rows. Next delete all the rows in the original table and re-populate it, using an INSERT INTO with a SELECT DISTINCT statement. Here is how I would remove the true duplicates from the vehicles table using this method.

create table #vehicles (make  char(10), 
                        model char(10), 
                        color char(10))
insert into #vehicles select * from vehicles
truncate table vehicles
insert into vehicles select distinct * from #vehicles
drop table #vehicles

Making each record unique by adding another column to the duplicate key

In our vehicle file example, we have duplicate records because our duplicate key, which consists of make, model and color, does not uniquely identify a single record. Four of the records have just two unique key values. If we alter the table to add a dupnum column and then populate it with a numerical key value from 1 to N for each key value, where N is the number of duplicates per key value, then we will be able to uniquely identify each record. Here is the code necessary to add the dupnum column and populate it with the proper dupnum value for each key value.

alter table vehicles add seq_num int identity
alter table vehicles add dupnum int	
go
 
update vehicles 
set dupnum = rowno 
from vehicles d, 
     (select (select count(*) 
        	from vehicles a
        	where a.make = b.make and
                      a.model=b.model and 
                      a.color=b.color and 
                      a.seq_num <= b.seq_num
			) as rowno, 
          b.model,b.make,b.color,seq_num
       
       from vehicles b) c 
  where c.make = d.make and c.model = d.model and c.color = d.color and c.seq_num=d.seq_num
go
alter table vehicles drop column seq_num

If you review this code, you will notice that I first add an identity column called seq_num to my vehicles table. This will allow me to identify one duplicate column from another. In addition, I added a new column called dupnum to the vehicles table. The dupnum column is updated by determining the number of values for each key. If there are no duplications for the key this column is updated with a 1. If there are duplicate records for a given key value then this column is populated with a 1 for the first duplicate record, 2 for the second duplicate record, 3 for the third duplicate record and so on. To calculate the sequential number values, I used a correlated sub-query that determines the number (count(*)) of records that have a seq_num less than or equal to the current record being processed that has the same key value.

Conclusion

Duplicate records do come up occasionally and knowing how to find and deal with the duplicates is important. The examples I have shown for dealing with duplicates are simple ones. Most likely, your environment will have duplicate situations that are a little more complex. I hope that these examples can be tailored to your environment to help you resolve most of your duplicate record situations.

» See All Articles by Columnist Gregory A. Larsen

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved