Scope
From time to time, I will be asked to delete duplicate rows from the database. This is a very easy job but if you don't know the trick, it might get you off-guarded.
In this exercise, we are trying to find out if there are any products with the same SKU in the system and remove all the duplicates but leave one alone.
In this exercise, we are trying to find out if there are any products with the same SKU in the system and remove all the duplicates but leave one alone.
Steps
First we will check if any duplicates by using group by and having.
select Sku, count(Sku) as Cnt from Product group by Sku having count(Sku) > 1
We found 245 duplicate products. |
We will verify the above by checking the product id.
select P.[Id], Duplicates.Sku from Product P join (select Sku from Product group by Sku having count(Sku) > 1) Duplicates on (Duplicates.Sku = P.Sku)
The same SKU shown with different product ids. |
The trick we use is use aggregation to group them together and pick the odd one we want. Showing all the duplicate products with only 1 row per sku by their max id, and leave the duplicates out of this result set.
select max(P.[Id]) as Id, Duplicates.Sku from Product P join (select Sku from Product group by Sku having count(Sku) > 1) Duplicates on (Duplicates.Sku = P.Sku) group by Duplicates.Sku
We picked our 245 products that we will keep. |
The last thing to do is to remove the duplicates except the ones we picked. This is by using delete from. Notice the deived table in the query are from previous checking code.
delete Product from (select max(P.[Id]) as Id from Product P join (select Sku from Product group by Sku having count(Sku) > 1) Duplicates on (Duplicates.Sku = P.Sku) group by Duplicates.Sku) ProductToDelete where Product.Id = ProductToDelete.Id
245 rows with duplicate SKUs deleted. |
What if?
Sounds too good to be truth? What if the following scenario arise, can we still do it?
- A legacy SQL table doesn't have ID nor primary key for us to use group by on.
- Need to find out duplicate values of whole row or multiple fields of a row and not just 1 field (in my case SKU)
The common fallency is that we need to use cursor, multiple sql scripts or a console app to loop through something. In fact not, it is achievable with T-SQL only and these are typical job interview questions that I would ask (as an interviewer) in the SQL area. :)
No comments:
Post a Comment