x
login about faq Site discussion (meta-askssc)

Query to concatenate and remove common prefix

A bit of history to this : I posted this question elsewhere about 12 months ago, and got some very interesting and useful answers, but now I want to see what the wonderful community here can come up with.

--- In SQL 2005, I have some data

id    ref
==   ==========
1    3536757616
1    3536757617
1    3536757618

and want to get the result

1    3536757616/7/8

so essentially the data is aggregated on id, with the refs concatenated together, separated by a slash '/', but with any common prefix removed so if the data was like

id    ref
==   ==========
2    3536757628
2    3536757629
2    3536757630

I would want to get the result

2    3536757629/28/30

I know I can simply concatenate the refs by using

SELECT distinct
    id,
    stuff ( ( SELECT
                  '/ ' + ref 
              FROM
                  tableA tableA_1
              where tableA_1.id = tableA_2.id
    FOR XML PATH ( '' ) ) , 1 , 2 , '' )
from TableA tableA_2

to give

1   3536757616/ 3536757617/ 3536757618
2   3536757628/ 3536757629/ 3536757630

but it's the bit that removes the common element that I'm after.....

expected result

1   3536757616/7/8
2   3536757628/29/30

--- Code for test data :

create table tableA (id int, ref varchar(50))

insert into tableA
      select 1, 3536757616
union select 1, 3536757617
union select 1, 3536757618

union select 2, 3536757628
union select 2, 3536757629
union select 2, 3536757630

union select 3, 3536754000
union select 3, 3536755000
union select 3, 3536756000
union select 3, 3536757000

union select 4, 1999999999
union select 4, 1199999999
union select 4, 1119999999
union select 4, 1111999999

expected results

1   3536757616/7/8
2   3536757628/29/30
3   3536754000/5000/6000/7000
4   1111999999/119999999/199999999/999999999
more ▼

asked Sep 23 '10 at 03:46 AM in Default

Kev Riley gravatar image

Kev Riley ♦♦
46k 38 43 69

what should be the expected result for 1 ? is it 3536757616/7/8 or 3536757616/17/18?

Sep 23 '10 at 04:06 AM Cyborg

yeah, kind of with @Oleg here - is it always the last character that you want to consider for concatenation of is it possible to have a root of less than 9 characters

Sep 23 '10 at 04:34 AM Fatherjack ♦♦

the result for 1 is 3536757616/7/8 as the root is common all the way up to the '1'. Yes the root can be less than 9 chars, in fact at least it has to be 1 char

Sep 23 '10 at 04:39 AM Kev Riley ♦♦

... but match it as far as possible. meaning that an unbroken series from 353675616 to 353675639 would be returned in 3 groups, starting as 35367561x, 35367562x and 35367563x respectively .. ? if they had the same ID

Sep 23 '10 at 04:42 AM Fatherjack ♦♦

no, I want one row per id, so an unbroken series would have a common root of 3536756, and would go 353675616/17/18..../38/39. Like in the example for id 2, it crosses over from 29 to 30....

Sep 23 '10 at 04:55 AM Kev Riley ♦♦
(comments are locked)
10|1200 characters needed characters left

5 answers: sort voted first

This is a bit clunky... :)

Edited to include reference to a static tally table...
Edit 2 -> to include ORDER BY clauses
Edit 3 -> now simplified quite a lot
Edit 4 -> simplified even more - now 33.6 seconds on 8192 sets

DECLARE @tableA TABLE (
id  INT         ,
ref VARCHAR (50));

INSERT  INTO @tableA
    select 1, 3536757616
union select 1, 3536757617
union select 1, 3536757618

union select 2, 3536757628
union select 2, 3536757629
union select 2, 3536757630

union select 3, 3536754000
union select 3, 3536755000
union select 3, 3536756000
union select 3, 3536757000

union select 4, 1999999999
union select 4, 1199999999
union select 4, 1119999999
union SELECT 4, 1111999999
;

WITH   -- get the maximum length, min ref and max ref per group
maxlengths
AS     (SELECT   [id],
                 MAX(LEN(ref)) AS maxreflen,
                 MIN([ref]) AS firstref,
                 MAX([ref]) AS lastref
        FROM     @tableA AS [ta]
        GROUP BY [id]),
       -- get the maximums and find all matching roots for first and last ref
       roots
AS     (SELECT [m].[id],
               [m].[firstref],
               [N]
        FROM   [maxlengths] [m]
               INNER JOIN [dbo].[tally] ON [dbo].[tally].[n] < [m].[maxreflen]
        WHERE  LEFT([m].[firstref], [N]) = LEFT([m].[lastref], [N])),
       -- now get for each id the index and root text
       optimalCombinations
AS     (SELECT id,
               MAX([N]) + 1 AS ind,
               LEFT(MAX([firstref]), MAX([N])) AS root
        FROM   roots
        GROUP BY id)
        -- and select out
SELECT id,
       [mcl].[root] + stuff((SELECT '/' + SUBSTRING([ref], mcl.ind, 1000)
                         FROM   @tableA AS tableA_1                                
                         WHERE  tableA_1.id = mcl.id
                         ORDER BY [tableA_1].[ref]
                         FOR    XML PATH ('')), 1, 1, '')
FROM   optimalCombinations AS mcl
       ORDER BY [mcl].[id];
more ▼

answered Sep 23 '10 at 06:24 AM

Matt Whitfield gravatar image

Matt Whitfield ♦♦
29.2k 56 63 87

@Matt - Wow! I like it. There is an evil worktable on my machine for the ROW_NUMBER() for the cte AllLengthCounts - you use sys.allobjects and that is a killer (for me at least)

Sep 23 '10 at 06:42 AM WilliamD

Yeah I wasn't going for optimal, particularly, just flexible :) I think it would suck over a large amount of data...

Sep 23 '10 at 07:08 AM Matt Whitfield ♦♦

I didn't want to be so blunt :o)

I realised that my solution can be improved by dumping the result of string breaking into a "real" table as opposed to keeping it as a CTE.

Sep 23 '10 at 07:10 AM WilliamD

Hmm, actually, performance wise they both suck. Over 256 rows, mine ran in 9.884 seconds and yours ran in 38.781... Now that's poor performance! I'm sure we could squeeze them to be a lot faster if we put some thought in :)

Sep 23 '10 at 07:53 AM Matt Whitfield ♦♦

...although I had to make a tally table to get yours to run, so I changed mine to reference the tally table, and then it took 0.649 seconds.

Sep 23 '10 at 07:56 AM Matt Whitfield ♦♦
(comments are locked)
10|1200 characters needed characters left
-- Query
; with 
-- Input Data, include row_no for identifying the 1st (min) ref of the id
data as 
(
    select id, ref, row_no = row_number() over (partition by id order by ref)
    from   @tableA
),
-- cross join to tally table (numbers) to find possible the common length
cte as
(
    select id, num
    from   data a
       cross join numbers n
    where  n.num    between 1 and len(a.ref)
    group by id, num
    having min(left(ref, n.num)) = max(left(ref, n.num))
),
-- get the max common length
cte2 as
(
    select id, c = max(num)
    from   cte
    group by id
)
-- product the result using for xml to concatenate
select  id, ref = stuff(ref, 1, 1, '')
from    cte2 c
    cross apply
    (
       select    '/' + case when x.row_no = 1 then x.ref else stuff(x.ref, 1, c.c, '') end
       from  data x
       where x.id    = c.id
       for xml path('')
    ) r (ref)

/*
id          ref                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
----------- -----------------------------------------
1           3536757616/7/8
2           3536757628/29/30
3           3536754000/5000/6000/7000
4           1111999999/119999999/199999999/999999999

(4 row(s) affected)
*/
more ▼

answered Sep 24 '10 at 04:15 AM

Squirrel gravatar image

Squirrel
440 1 4

can you give the definition of the numbers table?

Sep 24 '10 at 06:10 AM Kev Riley ♦♦

numbers table is just a simple tally table.

create table numbers (num int, primary key (num))

Sep 24 '10 at 07:40 AM Squirrel
(comments are locked)
10|1200 characters needed characters left

As promised here's the answer I got originally, thanks to Quassnoi over at StackOverflow

Matt assures me that his solution is faster over a large enough data set, although I still marvel at the simplicity if this solution!

;WITH hier(cnt) AS
        (
        SELECT  1
        UNION ALL
        SELECT  cnt + 1
        FROM    hier
        WHERE   cnt <= 100
        )
SELECT  id,(
        SELECT  CASE WHEN ROW_NUMBER() OVER (ORDER BY a2.ref) = 1 THEN ref ELSE '/' + SUBSTRING(ref, mc + 1, LEN(ref)) END 
        FROM    tableA a2
        WHERE   a2.id = q2.id
        FOR XML PATH('')
        )
FROM    (
        SELECT  id, MIN(common) AS mc
        FROM    (
                SELECT  a.id,
                        (
                        SELECT  MAX(cnt)
                        FROM    hier
                        WHERE   SUBSTRING(i.initref, 1, cnt) = SUBSTRING(a.ref, 1, cnt)
                                AND cnt <= LEN(ref)
                        ) AS common
                FROM    (
                        SELECT  id, MIN(ref) AS initref
                        FROM    tableA
                        GROUP BY
                                id
                        ) i
                JOIN    tableA a
                ON      i.id = a.id
                ) q
        GROUP BY
                id
        ) q2
more ▼

answered Sep 30 '10 at 06:16 AM

Kev Riley gravatar image

Kev Riley ♦♦
46k 38 43 69

I was wondering about this since I suspected that someone would just use math and subtract the numbers and sequentially tack on the remainders. (The numbers may be in text format, but they're numbers nevertheless.) But no one tried that, although it I think it would have been easier that way.

Sep 30 '10 at 07:39 AM Mark

@Kev Riley +1 This is a very good solution, but why spoil it with such a strange way (100-deep recursion) it uses to generate 101 tally records? if the hier definition can be replaced with something else, like

;with hier (cnt) as
(
    select top 101 
        row_number() over (order by [object_id]) cnt
        from sys.objects -- or something
)

then the soluition will look even better.

Of course, with just 100-deep recursion into a single column set there is no big performance hit, but the plan for recursive cte is somewhat more complex :)

Sep 30 '10 at 08:03 AM Oleg
(comments are locked)
10|1200 characters needed characters left

I hope I understood you right, so here is my go at it. (v1.0 Lunchtime hack done in about 10 minutes and only tested on your example data).

Edit 14:37 CET - Fix added to comply with Kev's comment (extra CTE called MatchPos)

Edit 15:25 CET - Final fix so I really comply with Kev ;o)

Edit 15:39 CET - Kev really is playing with us! :o) Those repeating numbers made me reverse the position finding in the CTE LeftMostMatch, but it works now.

Edit 24-09-2010 09:26 - Made a further change to make sure that the results come out in the right order, even for permanent tables as opposed to table variables (order by added to the XML concatenation). The solution as a whole can be sped up by moving SplitRef into a permanent table (Is that allowed?).

DECLARE @TableA TABLE (id int, ref varchar(50))

INSERT  INTO @tableA
    select 1, 3536757616
union select 1, 3536757617
union select 1, 3536757618

union select 2, 3536757628
union select 2, 3536757629
union select 2, 3536757630

union select 3, 3536754000
union select 3, 3536755000
union select 3, 3536756000
union select 3, 3536757000

union select 4, 1999999999
union select 4, 1199999999
union select 4, 1119999999
union select 4, 1111999999
;
WITH    
        SplitRef
          /* Breaks the string down for comparisons across an id */ 
          AS (SELECT    id,
                        ref,
                        SUBSTRING(ref, n, 1) refChar,
                        N refPos
              FROM      @TableA A
              CROSS JOIN SqlAdministration.dbo.Tally T
              WHERE     t.N < LEN(ref) + 1)
              ,
        LeftMostMatch
          /* Finds the left-most position where the digits are still the same */ 
          AS (SELECT    baseref = MIN(cur.ref),
                        rest.ref,
                        MIN(rest.refpos) matchpos
              FROM      SplitRef cur
              LEFT JOIN SplitRef rest ON cur.id = rest.id
                                         AND cur.refPos = rest.refPos
                                         AND cur.ref < rest.ref
                                         AND cur.refChar <> rest.refChar
              WHERE     rest.ref IS NOT NULL
              GROUP BY  rest.ref)

              ,
        ExtendedMatch
        /* Poor hack to force the grouping - otherwise a grouping higher than 
           1 digit on the right won't work */
          AS (SELECT    baseref,
                        MIN(matchpos) matchpos
              FROM      LeftMostMatch
              GROUP BY  baseref),
        Concatenation
          /* Does the concatenation */ 
          AS (SELECT    id,
                        MIN(a.ref) + 
                        (SELECT '/' + SUBSTRING(hm.ref,em.matchpos,(LEN(hm.ref)))
                         FROM      LeftMostMatch hm
                         INNER JOIN ExtendedMatch em ON hm.baseref = em.baseref
                         WHERE     hm.baseref = MIN(a.ref)
                        FOR           XML PATH('')) Result
              FROM      @TableA a
              GROUP BY  id)
    SELECT  *
    FROM    Concatenation ;
more ▼

answered Sep 23 '10 at 05:07 AM

WilliamD gravatar image

WilliamD
25.3k 16 18 41

@WilliamD close, but not quite right on the concat for id=2, should be 3536757628/29/30, but you have 3536757628/9/30

Sep 23 '10 at 05:13 AM Kev Riley ♦♦

@WilliamD, sorry but now you have made id=1 wrong....should be 3536757616/7/8, but you have 3536757616/17/18. The common root can be of any length >1

Sep 23 '10 at 06:25 AM Kev Riley ♦♦

@Kev - I noticed that after posting. I got it fixed now (I hope!)

Sep 23 '10 at 06:27 AM WilliamD

@WilliamD, can I be picky? the order of the concatenated parts is slightly out, and for id=4 the result isn't correct

Sep 23 '10 at 07:51 AM Kev Riley ♦♦

@Kev - I beg to differ. I just ran it on my side and the version that I posted last of all spits out the same answer as you posted in the question. am i missing something here?

Sep 23 '10 at 07:56 AM WilliamD
(comments are locked)
10|1200 characters needed characters left


create table #temp ( id int, ref varchar(50) )

insert  into #temp
        select  1,
                3536757616
        union
        select  1,
                3536757617
        union
        select  1,
                3536757618
        union
        select  2,
                3536757628
        union
        select  2,
                3536757629
        union
        select  2,
                3536757630
                



SELECT  ID,
        REF + '/' + REPLACE(C3,
                            SUBSTRING(REF, 1,
                                      LEN(REf) - StrLen - CHARINDEX('0', REVERSE(C3))),
                            '')
FROM    ( SELECT    *,
                    ( SELECT TOP 1
                                LEN(CONVERT(VARCHAR, CONVERT(BIGINT, t2.Ref)
                                    - CONVERT(BIGINT, t1.ref))) t
                      FROM      #Temp t2
                      WHERE     t2.ID = t1.ID
                                AND t2.Ref > t1.ref
                      ORDER BY  LEN(CONVERT(VARCHAR, CONVERT(BIGINT, t2.Ref)
                                    - CONVERT(BIGINT, t1.ref))) DESC
                    ) StrLen,
                    STUFF(( SELECT  '/' + ref AS 'data()'
                            FROM    #Temp t2
                            WHERE   t2.ID = t1.ID
                                    AND t2.Ref > t1.ref
                          FOR
                            XML PATH('')
                          ), 1, 1, '') C3
          FROM      ( SELECT    ID,
                                MIN(Ref) Ref
                      FROM      #Temp
                      GROUP BY  ID
                    ) T1
        ) FT




more ▼

answered Sep 23 '10 at 05:38 AM

Cyborg gravatar image

Cyborg
10.1k 29 39 44

Cyborg, if the grouping goes round a 2 digit ending it breaks.

Try adding:

    SELECT  1,
            3536757609
    UNION
    SELECT  1,
            3536757610
                    UNION
    SELECT  1,
            3536757611
                    UNION
    SELECT  1,
            3536757612
                    UNION
    SELECT  1,
            3536757613
                    UNION
    SELECT  1,
            3536757614
                    UNION
    SELECT  1,
            3536757615
    UNION

to the test data, and you get an error with your solution.

Sep 23 '10 at 06:10 AM WilliamD
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments



Facebook logo Follow Ask SSC on Facebook
Find Ask SSC on Google+
linkedin logo Find us on LinkedIn

Topics:

x1834
x912
x20

asked: Sep 23 '10 at 03:46 AM

Seen: 1361 times

Last Updated: Sep 23 '10 at 06:29 AM

Copyright © 2002-2012 Simple Talk Publishing. All Rights Reserved. If you have any queries, please contact the site administrators.
Ask SQL Server Central is a community service provided by Red Gate.