x

Deleting the Duplicate rows in the table ?

How to delete duplicate rows in the table ? Thank you for the answer in advance.

more ▼

asked Jan 29 '10 at 11:28 AM in Default

OracleApprender gravatar image

OracleApprender
771 68 73 75

(comments are locked)
10|1200 characters needed characters left

6 answers: sort voted first

The answer is, "it depends."

How big is the table? Do you have a candidate primary key? Do you have a sequence number as primary key and the records are otherwise duplicates? Do you have a particular query that allows you to identify the records by something other than the fact that they are duplicates?

I'll presume the rows are true duplicates. Let's say this is your problem table:

CREATE TABLE mytable
(col1 VARCHAR(20),
col2 VARCHAR(20))
;

INSERT INTO mytable VALUES ('a','a');
INSERT INTO mytable VALUES ('b','b');
INSERT INTO mytable VALUES ('b','b');
INSERT INTO mytable VALUES ('c','c');
INSERT INTO mytable VALUES ('d','d');
INSERT INTO mytable VALUES ('d','e');
INSERT INTO mytable VALUES ('d','e');

You have 7 records but only want 5 uniques.

This first technique for deduping is relatively portable (substitute any other platform-specific temp table syntax for the Oracle-specific temp table used here). First, we'll make a temp table with copies of the duplicated rows from the table you wish to fix.

CREATE GLOBAL TEMPORARY TABLE mytabledups
(col1 VARCHAR2(20),
col2 VARCHAR2(20))
ON commit PRESERVE ROWS
;

INSERT INTO mytabledups
SELECT
  col1,
  col2
FROM mytable
GROUP BY
  col1,
  col2
HAVING
  (Count(*)>1)  -- this gets only the records that are dups
;

The column list in the temp table and all select lists, group bys, etc, should match the column list in your original table, of course. You should now have 4 rows in the temp table. Then, we'll delete the original rows and reinsert the deduped rows to the original table from the temp table:

DELETE FROM mytable mt
WHERE EXISTS (SELECT * FROM mytabledups mtd
    	WHERE mt.col1 = mtd.col1
    	AND mt.col2 = mtd.col2)
;

INSERT INTO mytable
SELECT * FROM mytabledups
;

The delete should process 4 rows and the insert should process two.

Added: You should probably do the delete and insert as a single transaction.

Now, if you want a more Oracle-specific method, you could do something with ROWIDs. First, collect the rows in question and a ROWID to delete.

CREATE TABLE mytabledups
AS
SELECT
    col1,
    col2,
    -- and the rest of the columns in the table
    Min(ROWID) AS myrowid
from    mytable
GROUP BY    col1,
    	col2
HAVING  (Count(*)>1)
;

You could also use Max(ROWID), it doesn't matter which. You could just populate the dups table with the ROWID but, for clarity, I'm keeping the original values, too. Now, delete duplicate rows from the original table:

DELETE FROM mytable
WHERE ROWID IN (SELECT myrowid FROM mytabledups)
;

If there was no more than a single dup for each row, you're done. IF there's more than one dup for each row, then you can either write a more sophisticated procedure or you can drop the dups table and recreate it multiple times until you run out of rows to delete.

With either technique, you might give some thought to locking the table before you start. I'm unsure of all of the implications of using ROWID. It could be that they can be changed on you. LOCK mytable IN EXCLUSIVE might be a good idea.

Note: I tested these under Oracle 11. I believe everything I used is valid under 10g, too.

more ▼

answered Jan 29 '10 at 02:23 PM

KillerDBA gravatar image

KillerDBA
1.5k 8 9 10

(comments are locked)
10|1200 characters needed characters left

I'm afraid this is a rather long post, it both answers the question and attempts to go into some detail about query tuning in Oracle.

Let's first set up a reasonable size table, with some duplicates and some double-duplicates:

SQL> CREATE TABLE t AS SELECT * FROM all_objects;

Table created.

SQL> INSERT INTO t SELECT * FROM all_objects WHERE rownum <= 20;

20 rows created.

SQL> INSERT INTO t SELECT * FROM all_objects WHERE rownum <= 10;

10 rows created.

In this case, we know that object_id alone is sufficient to identify a duplicate row (as it's unique in all_objects), which just makes the query easier to read. In more complex cases, you'll just have to include more columns in the queries.

To find the duplicates in a single pass, we can use the RANK analytic function for brevity:

SELECT object_id,RANK () OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t;

This groups by the object_id, and orders it by the unique rowid. So, any unique rows just have a rank of 1, any duplicate rows will have a rank greater than 1.

This then allows us to pull out the rowid of any duplicates, and use it in a DELETE:

DELETE FROM t WHERE rowid IN
    (SELECT rowid FROM 
        (SELECT rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t)
     WHERE rank > 1);

In this test, this actually gives an efficient execution plan:

SQL> EXPLAIN PLAN FOR DELETE FROM t WHERE rowid IN
  2  (SELECT rowid FROM 
  3  (SELECT rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank
  4  FROM t)
  5  WHERE rank > 1);

Explained.

SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2005107446

-------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT             |          |     1 |    24 |       |  1047   (1)| 00:00:13 |
|   1 |  DELETE                      | T        |       |       |       |            |          |
|   2 |   NESTED LOOPS               |          |     1 |    24 |       |  1047   (1)| 00:00:13 |
|   3 |    VIEW                      | VW_NSO_1 | 67584 |   792K|       |   734   (1)| 00:00:09 |
|   4 |     SORT UNIQUE              |          |     1 |  1650K|       |            |          |
|*  5 |      VIEW                    |          | 67584 |  1650K|       |   734   (1)| 00:00:09 |
|   6 |       WINDOW SORT            |          | 67584 |  1650K|  2408K|   734   (1)| 00:00:09 |
|   7 |        TABLE ACCESS FULL     | T        | 67584 |  1650K|       |   245   (1)| 00:00:03 |
|   8 |    TABLE ACCESS BY USER ROWID| T        |     1 |    12 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("RANK">1)

Note
-----
   - dynamic sampling used for this statement (level=2)

24 rows selected.

It looks pretty good, we're getting all duplicates in a single scan, then deleting individual rows by rowid. However, a careful examination of the execution plan shows that the optimizer isn't getting the cardinality correct, and if the table we're trying to de-duplicate has an index, it can cause a worse plan to emerge:

SQL> CREATE INDEX t_ix ON t(object_id);

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>'AJM',tabname=>'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE AUTO',estimate_percent=>100);

PL/SQL procedure successfully completed.

SQL> EXPLAIN PLAN FOR DELETE FROM t WHERE rowid IN
  2  (SELECT rowid FROM
  3  (SELECT rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t)
  4  WHERE rank > 1);
Explained.

SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1394499395

----------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT          |          | 55673 |  1576K|       |   470   (1)| 00:00:06 |
|   1 |  DELETE                   | T        |       |       |       |            |          |
|*  2 |   HASH JOIN RIGHT SEMI    |          | 55673 |  1576K|       |   470   (1)| 00:00:06 |
|   3 |    VIEW                   | VW_NSO_1 | 55673 |   652K|       |   344   (1)| 00:00:05 |
|*  4 |     VIEW                  |          | 55673 |  1359K|       |   344   (1)| 00:00:05 |
|   5 |      WINDOW SORT          |          | 55673 |   924K|  1544K|   344   (1)| 00:00:05 |
|   6 |       INDEX FAST FULL SCAN| T_IX     | 55673 |   924K|       |    35   (0)| 00:00:01 |
|   7 |    INDEX FULL SCAN        | T_IX     | 55673 |   924K|       |   125   (0)| 00:00:02 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(ROWID="$kkqu_col_1")
   4 - filter("RANK">1)

20 rows selected.

After adding an index and collecting full stats, it's looking at two full scans (all be it this time only index scans) and a hash join to remove a tiny number of rows. The second scan is obviously unnecessary since the subquery is giving exactly the rowids to remove. The problem is that the optimizer doesn't know the subquery will only return 30 rows, so I think that is going for the safer option of assuming the worst case of it returning all the rows in the table. If we know roughly how many rows the subquery will return, we can help the optimizer with a CARDINALITY hint:

SQL> EXPLAIN PLAN FOR DELETE FROM t WHERE rowid IN
  2  (SELECT rowid FROM
  3  (SELECT /*+ CARDINALITY (t 30) */ rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t)
  4  WHERE rank > 1);

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3094618761

-----------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT             |          |     1 |    29 |    38   (6)| 00:00:01 |
|   1 |  DELETE                      | T        |       |       |            |          |
|   2 |   NESTED LOOPS               |          |     1 |    29 |    38   (6)| 00:00:01 |
|   3 |    VIEW                      | VW_NSO_1 |    30 |   360 |    36   (3)| 00:00:01 |
|   4 |     SORT UNIQUE              |          |     1 |   750 |            |          |
|*  5 |      VIEW                    |          |    30 |   750 |    36   (3)| 00:00:01 |
|   6 |       WINDOW SORT            |          |    30 |   510 |    36   (3)| 00:00:01 |
|   7 |        INDEX FAST FULL SCAN  | T_IX     |    30 |   510 |    35   (0)| 00:00:01 |
|   8 |    TABLE ACCESS BY USER ROWID| T        |     1 |    17 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("RANK">1)

20 rows selected.

Now we're back to a nested loop join, with the advantage of a fast full index scan to identify the duplicates. That's about as efficient as it'll get for a small number of duplicates.

To confirm the hypothesis that the nested loop join is preferable in the small duplicate count case, we can test these queries with AUTOTRACE (i.e. SET AUTOTRACE ON in SQL*Plus). This shows the hash join plan using 294 logical IOs (37 db block gets and 257 consistent gets) and the nested loop plan using 229 logical IOs (96 db block gets and 133 consistent gets). Optimization should always be targeted at producing a query that does the job in the fewest possible logical IO operations.

As always, I wouldn't recommend hinting unless it's the only way to get a query that's efficient enough for the requirements. Different table sizes and shapes will have different optimal query plans. Test the execution plan first then decide whether a hint is absolutely necessary.

more ▼

answered Feb 03 '10 at 07:01 AM

Andrew Mobbs gravatar image

Andrew Mobbs
1.5k 1 3

Nice. I had just gotten the rank/rowid solution to work and I was perfectly happy with that but your performance info is most helpful. Thanks.
Feb 03 '10 at 11:52 AM KillerDBA
Very well explanation Andrew!!
Feb 04 '10 at 08:34 PM BI DWH BALA
(comments are locked)
10|1200 characters needed characters left

If you can ascertain the ROWID, you can issue a DELETE referencing that address. Toad has a Data Duplicates feature which makes handling this extremely easy, point and click style.

more ▼

answered Feb 02 '10 at 02:54 PM

HillbillyToad gravatar image

HillbillyToad
1k 2

I was giving some thought to a more elaborate query that did the delete in one pass, using ROWIDs. I hadn't heard about Data Duplicates. I'll look it up, thanks.
Feb 02 '10 at 04:55 PM KillerDBA
Thanks for the link.
Feb 03 '10 at 11:52 AM KillerDBA
(comments are locked)
10|1200 characters needed characters left

delete from where rowid not in (select max(rowid) from group by );

more ▼

answered Feb 26 '10 at 05:40 AM

murali 1 gravatar image

murali 1
1

(comments are locked)
10|1200 characters needed characters left

Hi,

Another way to do this is to use the Hashbytes() function to generate a common unique key for each distinct record.

Detailed how to available at http://www.nixtips.com/2014/02/21/sql-how-to-identify-duplicates-from-a-table/
more ▼

answered Feb 21 at 02:39 PM

niccoo gravatar image

niccoo
1

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x375
x30

asked: Jan 29 '10 at 11:28 AM

Seen: 3010 times

Last Updated: Feb 21 at 02:40 PM