question

SQL_Student avatar image
SQL_Student asked

Removing Duplicates

Hello, I am trying to remove duplicate records from a query and would like to remove them within a LEFT JOIN. I have two tables in the JOIN, one with unique records (table A) and I am joining it to table B using the primary key of table A where that key may exist in table B on more than one record. All records in Table B a unique record created date and I need to link to table B on the record that has the MAX created date. From here I have to join table C to table B using the primary key of table B (and the record of table B that had the MAX created date) as table C has data I need in my result set. I have tried may different methods (except a temp table as I am trying to do this without creating one), all with no luck. Does anyone have any suggestions?
t-sqldeduplicationwindowing-functions
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image
WilliamD answered
You didn't mention the version of sql server. As of 2005, you can use windowing functions to achieve what you want. You would assign a [ROW_NUMBER()][1] partitioning by the primary key, ordering by the date column descending. This will give the newest record a rownumber of 1. You put all this into a CTE/sub-query and join to table c on the PK and rownumber 1. I have kept this theoretical, as judging by your username and question, I am guessing this is a homework question. If you want some more heelp, supply us with what you have in t-sql so far and we can take it from there. [1]: http://msdn.microsoft.com/en-us/library/ms186734.aspx
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image Oleg commented ·
@SQL_Student Now that you have tried, but keep getting errors, it might be OK to show the query which will put you on the right track. The idea behind using row_number() function partitioning by the primary key of the TableA is that it works as the typical sub-numbering, i.e. the B records ordered by CreatedDate desc are numbered so the record with greatest date gets to be #1 and older records for the same key get higher #s. The # gets reset the moment that ID changes. If you pick only those records from the A and B join which have their "record #" equal to 1, you will get one-to-one results. After that, assuming that your TableC has one-to-one relationship with TableB, you can just join the 2. Here is the script:
;with records (KeyOfTableA, KeyOfTableB, CreatedDate) as
(
    select
        a.KeyOfTableA, b.KeyOfTableB, b.CreatedDate,
        row_number() over (partition by a.KeyOfTableA 
            order by b.CreatedDate desc) Number
        from TableA a inner join TableB b
            on a.KeyOfTableA = a.KeyOfTableA
)
    select records.* 
        from records inner join TableC c
            on records.KeyOfTableB = c.KeyOfTableB
        where records.Number = 1;
2 Likes 2 ·
SQL_Student avatar image SQL_Student commented ·
Thank you for your quick response. I am using SQL 2005 and have tried to do what you suggested but I must be missing something as I keep getting various errors, it is due to my limited knowledge but your response lets me know I am on the right track. If I can't get it figured out I will provide the details of what I am trying to accomplish.
0 Likes 0 ·
TimothyAWiseman avatar image
TimothyAWiseman answered
WilliamD probably has the best answer, but I often find it useful to consider more than one solution even when confident you have the best one. Another avenue would be select in the join only where there does not exist another matching record with a greater date. IN a very general sense it would look like select [colList] from TableA as a left join TableB as b on a.pk = b.fk and not exists (select * from tableB b2 where b2.fk = b.fk and b2.date > b.date) left join TableC as c on b.pk = c.fk Of course, this assumes that the combination of fk and date in tableB is unique, otherwise you may still get multiple rows, but from your description this sounds like a safe assumption to make
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image
Oleg answered
William's answer is clearly the best way to accomplish the question objective, but just for the sake of examining all available options, you can consider another way of getting only one record per Key in TableA. This is not as elegant as the method with using the row_number() windowing funcition though: select a.KeyOfTableA, b.KeyOfTableB, b.CreatedDate, c.SomeColumn from TableA a outer apply ( select top 1 KeyOfTableB from TableB where KeyOfTableA = a.KeyOfTableA order by CreatedDate desc ) b left join TableC c on b.KeyOfTableB = c.KeyOfTableB; Due to the usage of the top with order by in the ITVF, it is very possible that this method emits a very strong and offensive odor as far as the query performance is concerned, but the reality is that there is no single best way to write T-SQL queries, and sometimes it does take to explore different possibilities and test, test, test in order to find the best one. Oleg
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image WilliamD commented ·
+1 for the great explanation, +infinity for the strong and offensive odor! :D The windowing functions really are the best thing since the best thing since sliced bread.
0 Likes 0 ·
Håkan Winther avatar image
Håkan Winther answered
If you want, you can actually find and remove the duplicates in one single statement (and even store the result in an archive table: ;WITH cte AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY colN) AS rowcnt FROM yourtable ) DELETE FROM cte /* -- remove this comment mark if you want to insert the duplicates in an archive table OUTPUT DELETED.col1, DELETED.col2, DELETED.colN INTO myArchive(col1,col2,colN ) */ WHERE rowcnt >1 [ http://ask.sqlservercentral.com/questions/48287/identify-and-remove-duplicates-help.html][1] [1]: http://ask.sqlservercentral.com/questions/48287/identify-and-remove-duplicates-help.html
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.