question

Avi avatar image
Avi asked

finding ........................

Can any one help me in getting duplicates in a table from every column

sql-server-2008t-sqldata-cleansing
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Can you provide a sample of your data?
0 Likes 0 ·
Can you provide the name of the column you want to check the DUPE?
0 Likes 0 ·
Fatherjack avatar image
Fatherjack answered

This is a very vague question that could get a really complicated solution. If you simply want to locate a duplicate row then you will need to use something like:

USE [adventureworks]
GO
CREATE TABLE Myduplicates
    (
      IDCol INT IDENTITY,
      ColA varchar(20),
      ColB VARCHAR(10),
      ColC int
    )
GO
INSERT  INTO [dbo].[Myduplicates] ( [ColA], [ColB], [ColC] )
        SELECT  'Larry', -- ColA - varchar(20)
                'Curly', -- ColB - varchar(10)
                10  -- ColC - int
        UNION
        SELECT  'Larry', -- ColA - varchar(20)
                'Moe', -- ColB - varchar(10)
                20  -- ColC - int
        UNION
        SELECT  'Curly', -- ColA - varchar(20)
                'Larry', -- ColB - varchar(10)
                30  -- ColC - int
        UNION
        SELECT  'Moe', -- ColA - varchar(20)
                'Curly', -- ColB - varchar(10)
                10  -- ColC - int
        UNION
        SELECT  'Zeppo', -- ColA - varchar(20)
                'Harpo', -- ColB - varchar(10)
                10  -- ColC - int
        UNION
        SELECT  'Chico', -- ColA - varchar(20)
                'Zeppo', -- ColB - varchar(10)
                30  -- ColC - int
        UNION
        SELECT  'Groucho', -- ColA - varchar(20)
                'Zeppo', -- ColB - varchar(10)
                20
  -- ColC - int
go

SELECT  [m].[ColA],
        COUNT(idcol) AS [duplicate count]
FROM    [dbo].[Myduplicates] AS m
GROUP BY [m].[ColA]
having  COUNT(idcol) > 1
ORDER BY [duplicate count] DESC ;

SELECT  [m].[Colb],
        COUNT(idcol) AS [duplicate count]
FROM    [dbo].[Myduplicates] AS m
GROUP BY [m].[Colb]
having  COUNT(idcol) > 1
ORDER BY [duplicate count] DESC ;

SELECT  [m].[Colc],
        COUNT(idcol) AS [duplicate count]
FROM    [dbo].[Myduplicates] AS m
GROUP BY [m].[Colc]
having  COUNT(idcol) > 1
ORDER BY [duplicate count] DESC ;

SELECT  [m].[IDCol] AS [IDs that need review for ColA duplicates]
FROM    [dbo].[Myduplicates] AS m
        INNER JOIN ( SELECT [m].[ColA],
                            COUNT(idcol) AS [duplicate count]
                     FROM   [dbo].[Myduplicates] AS m
                     GROUP BY [m].[ColA]
                     having COUNT(idcol) > 1
                   ) AS s1 ON [m].[ColA] = [s1].[ColA];

SELECT  [m].[IDCol] AS [IDs that need review for ColB duplicates]
FROM    [dbo].[Myduplicates] AS m
        INNER JOIN ( SELECT [m].[Colb],
                            COUNT(idcol) AS [duplicate count]
                     FROM   [dbo].[Myduplicates] AS m
                     GROUP BY [m].[Colb]
                     having COUNT(idcol) > 1
                   ) AS s1 ON [m].[Colb] = [s1].[Colb];

go
DROP TABLE Myduplicates

resolving the duplicates will be a whole new piece of work

1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

+1 : Stellar effort, considering...
0 Likes 0 ·
Grant Fritchey avatar image
Grant Fritchey answered

There are a number of ways to solve this using TSQL. The best these days seem to revolve around using ROW_NUMBER(). The key is to simply understand the basic concept that you need a method to uniquely identify the row. Then you need a way to mark duplicate values for that unique identifier and then you need a mechanism to remove those duplicates. While this sounds like three steps, you should be able to do all this in a single query.

4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

If you have to get into making judgement calls, there's really no way to automate. Generally you have to define a mechanism for identifying what is a duplicate and then eliminate the extras. If that mechanism is "let me look at it" then...
1 Like 1 ·
Grant, have you got any examples of the Row_Number() option please? I started off thinking that way but then decided I would want all rows to see which row I wanted to call the duplicate - IE rows where Row_Number values are 1-n, not 2-n ... that led me to the nested query solution. J
0 Likes 0 ·
This is from Simple-Talk... WITH numbered AS ( SELECT data , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr FROM @duplicateTable4 ) DELETE FROM numbered WHERE nr > 1
0 Likes 0 ·
Right, I see. I was thinking that the values in other columns might justfiy the row where nr=3 as the one to keep so rows 1,2+4 get deleted via application... Thanks.
0 Likes 0 ·
Oleg avatar image
Oleg answered

I hope that I understand the question correctly. The task is to find the duplicate records across all columns in the table. I will also provide the sample of how to quickly delete all such duplicates. Lets create a heap table and insert some records in it (including some duplicates:

create table #t (a int, b int);
go

insert into #t values (1, 1);
insert into #t values (1, 1);
insert into #t values (1, 1);
insert into #t values (1, 1);
insert into #t values (2, 5);
insert into #t values (2, 5);
insert into #t values (3, 1);
insert into #t values (4, 6);
insert into #t values (4, 6);
go

Now we have 4 occurences of (1, 1); 2 occurences of (2, 5); (3, 1) does not have any duplicates and we also have 2 occurences of (4, 6). Here is the script to quickly identify all the duplicates:

select
        row_number() over (partition by a order by a) PartitionedNumber, *
        from #t;

Here is the result of the query above:

PartitionedNumber    a           b
-------------------- ----------- -----------
1                    1           1
2                    1           1
3                    1           1
4                    1           1
1                    2           5
2                    2           5
1                    3           1
1                    4           6
2                    4           6

Suppose we want to get rid of all dups while preserving all unique rows. In other words, the end result is expected to have #t with one (1, 1) record, one (2, 5) record, one (3, 1) record, and one (4, 6) record,. The statement to do this can be like this:

with records (PartitionedNumber, a, b) as
(
        select
                row_number() over (partition by a order by a) PartitionedNumber, *
                from #t
)
        delete records where PartitionedNumber > 1;

The above will delete all dups preserving the unique records only.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.