Pages - Menu

TSQL - Delete Duplicate Rows Basic Techniques

Scope

From time to time, I will be asked to delete duplicate rows from the database. This is a very easy job but if you don't know the trick, it might get you off-guarded.

In this exercise, we are trying to find out if there are any products with the same SKU in the system and remove all the duplicates but leave one alone.

Steps

First we will check if any duplicates by using group by and having.

select Sku, count(Sku) as Cnt
from Product
group by Sku
having count(Sku) > 1

We found 245 duplicate products.

We will verify the above by checking the product id.

select P.[Id], Duplicates.Sku
from Product P
join (select Sku 
      from Product
      group by Sku
      having count(Sku) > 1) Duplicates on (Duplicates.Sku = P.Sku)

The same SKU shown with different product ids.

The trick we use is use aggregation to group them together and pick the odd one we want. Showing all the duplicate products with only 1 row per sku by their max id, and leave the duplicates out of this result set.

select max(P.[Id]) as Id, Duplicates.Sku
from Product P
join (select Sku 
      from Product
      group by Sku
      having count(Sku) > 1) Duplicates on (Duplicates.Sku = P.Sku)
group by Duplicates.Sku

We picked our 245 products that we will keep.

The last thing to do is to remove the duplicates except the ones we picked. This is by using delete from. Notice the deived table in the query are from previous checking code.

delete Product
from (select max(P.[Id]) as Id
      from Product P
      join (select Sku 
            from Product
            group by Sku
            having count(Sku) > 1) Duplicates 
                                          on (Duplicates.Sku = P.Sku)
      group by Duplicates.Sku) ProductToDelete
where Product.Id = ProductToDelete.Id

245 rows with duplicate SKUs deleted.

What if?

Sounds too good to be truth? What if the following scenario arise, can we still do it?
  • A legacy SQL table doesn't have ID nor primary key for us to use group by on.
  • Need to find out duplicate values of whole row or multiple fields of a row and not just 1 field (in my case SKU)
The common fallency is that we need to use cursor, multiple sql scripts or a console app to loop through something. In fact not, it is achievable with T-SQL only and these are typical job interview questions that I would ask (as an interviewer) in the SQL area. :)

No comments:

Post a Comment