|
How to delete duplicate rows in the table ? Thank you for the answer in advance.
(comments are locked)
|
|
I'm afraid this is a rather long post, it both answers the question and attempts to go into some detail about query tuning in Oracle. Let's first set up a reasonable size table, with some duplicates and some double-duplicates:
In this case, we know that object_id alone is sufficient to identify a duplicate row (as it's unique in all_objects), which just makes the query easier to read. In more complex cases, you'll just have to include more columns in the queries. To find the duplicates in a single pass, we can use the RANK analytic function for brevity:
This groups by the object_id, and orders it by the unique rowid. So, any unique rows just have a rank of 1, any duplicate rows will have a rank greater than 1. This then allows us to pull out the rowid of any duplicates, and use it in a DELETE:
In this test, this actually gives an efficient execution plan:
It looks pretty good, we're getting all duplicates in a single scan, then deleting individual rows by rowid. However, a careful examination of the execution plan shows that the optimizer isn't getting the cardinality correct, and if the table we're trying to de-duplicate has an index, it can cause a worse plan to emerge:
After adding an index and collecting full stats, it's looking at two full scans (all be it this time only index scans) and a hash join to remove a tiny number of rows. The second scan is obviously unnecessary since the subquery is giving exactly the rowids to remove. The problem is that the optimizer doesn't know the subquery will only return 30 rows, so I think that is going for the safer option of assuming the worst case of it returning all the rows in the table. If we know roughly how many rows the subquery will return, we can help the optimizer with a CARDINALITY hint:
Now we're back to a nested loop join, with the advantage of a fast full index scan to identify the duplicates. That's about as efficient as it'll get for a small number of duplicates. To confirm the hypothesis that the nested loop join is preferable in the small duplicate count case, we can test these queries with AUTOTRACE (i.e. SET AUTOTRACE ON in SQL*Plus). This shows the hash join plan using 294 logical IOs (37 db block gets and 257 consistent gets) and the nested loop plan using 229 logical IOs (96 db block gets and 133 consistent gets). Optimization should always be targeted at producing a query that does the job in the fewest possible logical IO operations. As always, I wouldn't recommend hinting unless it's the only way to get a query that's efficient enough for the requirements. Different table sizes and shapes will have different optimal query plans. Test the execution plan first then decide whether a hint is absolutely necessary. Nice. I had just gotten the rank/rowid solution to work and I was perfectly happy with that but your performance info is most helpful. Thanks.
Feb 03 '10 at 11:52 AM
KillerDBA
Very well explanation Andrew!!
Feb 04 '10 at 08:34 PM
BI DWH BALA
(comments are locked)
|
|
The answer is, "it depends." How big is the table? Do you have a candidate primary key? Do you have a sequence number as primary key and the records are otherwise duplicates? Do you have a particular query that allows you to identify the records by something other than the fact that they are duplicates? I'll presume the rows are true duplicates. Let's say this is your problem table:
You have 7 records but only want 5 uniques. This first technique for deduping is relatively portable (substitute any other platform-specific temp table syntax for the Oracle-specific temp table used here). First, we'll make a temp table with copies of the duplicated rows from the table you wish to fix.
The column list in the temp table and all select lists, group bys, etc, should match the column list in your original table, of course. You should now have 4 rows in the temp table. Then, we'll delete the original rows and reinsert the deduped rows to the original table from the temp table:
The delete should process 4 rows and the insert should process two. Added: You should probably do the delete and insert as a single transaction. Now, if you want a more Oracle-specific method, you could do something with ROWIDs. First, collect the rows in question and a ROWID to delete.
You could also use Max(ROWID), it doesn't matter which. You could just populate the dups table with the ROWID but, for clarity, I'm keeping the original values, too. Now, delete duplicate rows from the original table:
If there was no more than a single dup for each row, you're done. IF there's more than one dup for each row, then you can either write a more sophisticated procedure or you can drop the dups table and recreate it multiple times until you run out of rows to delete. With either technique, you might give some thought to locking the table before you start. I'm unsure of all of the implications of using ROWID. It could be that they can be changed on you. LOCK mytable IN EXCLUSIVE might be a good idea. Note: I tested these under Oracle 11. I believe everything I used is valid under 10g, too.
(comments are locked)
|
|
If you can ascertain the ROWID, you can issue a DELETE referencing that address. Toad has a Data Duplicates feature which makes handling this extremely easy, point and click style. I was giving some thought to a more elaborate query that did the delete in one pass, using ROWIDs. I hadn't heard about Data Duplicates. I'll look it up, thanks.
Feb 02 '10 at 04:55 PM
KillerDBA
Feb 02 '10 at 06:13 PM
HillbillyToad
Thanks for the link.
Feb 03 '10 at 11:52 AM
KillerDBA
(comments are locked)
|
|
delete from where rowid not in (select max(rowid) from group by );
(comments are locked)
|

