How to remove duplicate rows from a table in SQL
There are a couple of ways to remove duplicate rows from a table in SQL e.g. you can use a temp tables or a window function like row_number() to generate artificial ranking and remove the duplicates. By using a temp table, you can first copy all unique records into a temp table and then delete all data from the original table and then copy unique records again to the original table. This way, all duplicate rows will be removed, but with large tables, this solution will require additional space of the same magnitude of the original table. The second approach doesn’t require extra space as it removes duplicate rows directly from the table. It uses a ranking function like row_number() to assign a row number to each row.
By using partition by clause you can reset the row numbers on a particular column. In this approach, all unique rows will have row number = 1 and duplicate row will have row_number > 1, which gives you an easy option to remove those duplicate rows. You can do that by using a common table expression (see T-SQL Fundamentals) or without it on Microsoft SQL Server.
No doubt that SQL queries are the integral part of any programming job interview which requires database and SQL knowledge. The queries are also very interesting to check candidate’s logical reasoning ability.
Earlier, I have shared a list of frequently asked SQL queries from interviews and this article is an extension of that. I have shared lot of good SQL based problem on that article and users have also shared some excellent problems in the comments, which you should look.
Btw, this is the follow-up question of another popular SQL interview question, how do you find duplicate records in a table, which we have discussed earlier. This is an interesting question because many candidates confuse themselves easily.
Some candidate says that they will find duplicate by using group by and printing name which has counted more than 1, but when it comes to deleting this approach doesn’t work, because if you delete using this logic both duplicate and unique row will get deleted.
This little bit of extra detail e.g. row_number makes this problem challenging for many programmers who don’t use SQL on a daily basis. Now, let’s see our solution to delete duplicate rows from a table in SQL Serve
Setup
Before exploring a solution, let’s first create the table and populate with test data to understand both problem and solution better. I am using a temp table to avoid leaving test data into the database once we are done. Since temp tables are cleaned up once you close the connection to the database, they are best suited for testing.
In our table, I have just one column for simplicity, if you have multiple columns then the definition of duplicate depends whether all columns should be equal or some key columns e.g. name and city can be same for two unique persons. In such cases, you need to extend the solution by using those columns on key places e.g. on a distinct clause in the first solution and on the partition by in second solution.
Anyway, here is our temp table with test data, it is carefully constructed to have duplicates, you can see that C++ is repeated thrice while Java is repeated twice in the table.
-- create a temp table for testing create table #programming (name varchar(10)); -- insert data with duplicate, C++ is repeated 3 times, while Java 2 times insert into #programming values ('Java'); insert into #programming values ('C++'); insert into #programming values ('JavaScript'); insert into #programming values ('Python'); insert into #programming values ('C++'); insert into #programming values ('Java'); insert into #programming values ('C++'); -- cleanup drop table #programming
Solution 1 – Use temp table
Yes, this is the most simple but logical way to remove duplicate elements from a table and it will work across database e.g. MySQL, Oracle or SQL Server. The idea is to copy unique rows into a temp table. You can find unique rows by using distinct clause. Once unique rows are copied, delete everything from the original table and then copy unique rows again. This way, all the duplicate rows have been removed as shown below.
-- removing duplicate using copy, delete and copy select distinct name into #unique from #programming delete from #programming; insert into #programming select * from #unique -- check after select * from #programming name Java C++ JavaScript Python
You can see the duplicate occurrences of Java and C++ have been removed from the #programming temp table. This is by far the simplest solution and also quite easy to understand but it doesn’t come to your mind without practicing. I suggest solving some SQL puzzles from Joe Celko’s classic book, SQL Puzzles and Answers, Second Edition to develop your SQL sense. It’s a great practice book to learn and master SQL logic.
Solution 2 – Using row_number() and derived table
The row_number() is one of several ranking functions provided by SQL Server, It also exists in Oracle database. You can use this function provide ranking to rows. You can further use partition by to tell SQL server that what would be the window. This way row number will restart as soon as a different name comes up but for the same name, all rows will get sequential numbers e.g. 1, 2, 3 etc. Now, it’s easy to spot the duplicates in the derived table as shown in the following example:
select * from (select *, row_number() OVER ( partition by name order by name) as rn from #programming) dups name rn C++ 1 C++ 2 C++ 3 Java 1 Java 2 JavaScript 1 Python 1
Now, you can remove all the duplicates which are nothing but rows with rn > 1 , as done by following SQL query:
delete dups from (select *, row_number() OVER ( partition by name order by name) as rn from #programming) dups WHERE rn > 1 (3 row(s) affected)
now, if you check the
#programming table again there won’t be any duplicates.
select * from #programming name Java C++ JavaScript Python
Here is a nice summary of all three ways to remove duplicates from a table using SQL:
Solution 3 – using CTE
The CTE stands for common table expression, which is similar to derived table and used to the temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. Similar to derived table, CTE is also not stored as an object and lasts only for the duration of the query. You can rewrite the previous solution using CTE as shown below:
;with cte as (select row_number() over (partition by name order by(select 0)) rn from #programming) delete from cte where rn > 1
The logic is exactly similar to the previous example and I am using select 0 because it’s arbitrary which rows to preserve in the event of a tie as both contents same data. If you are new to CTE then I suggest reading T-SQL Fundamentals, one of the best books to learn SQL Server fundamentals.
That’s all about how to remove duplicate rows from a table in SQL. As I said, this is one of the frequently asked SQL queries, so be prepare for that when you go for your programming job interview. I have tested the query in SQL Server 2008 and they work fine and you might need to tweak them a little bit depending upon the database you are going to use e.g. MySQL, Oracle or PostgreSQL. Feel free to post, if you face any issue while removing duplicates in Oracle, MySQL or any other database.
Other Frequently asked SQL queries from Interviews
- How to find the 2nd highest salary of an employee in SQL? (answer)
- How to join three tables in one SQL query? (solution)
- How to find all table names in a database? (query)
- How do you create backup of table or copy of table using SQL? (answer)
- How do you find all customers who have never ordered? (solution)
- Can you write pagination query for Oracle using row_number? (query)
- How do you find Nth highest salary of an employee using the correlated query? (solution)
- SQL Puzzles and Answers by Joe Celko (read)
Reference: | How to remove duplicate rows from a table in SQL from our JCG partner Javin Paul at the Javarevisited blog. |
Suppose you have billions of records in the table and out of them thousands are getting duplicate then I think following method will be more optimized way to go for it. step1: First only go for those records which are getting duplicates SELECT DISTINCT C1 INTO #OnlyUniqueFromDuplicates FROM [Table Having Duplicates] GROUP BY C1 HAVING COUNT(C1) > 1 Step2: Delete only the Duplicate records from the Master Table [Table Having Duplicates] DELETE FROM [Table Having Duplicates] as E WHERE EXISTS ( SELECT C1 FROM #OnlyUniqueFromDuplicates as I Where I.C1 = E.C1 ) Step3: Now Insert the Unique Records into Master… Read more »