question

Kev Riley avatar image
Kev Riley asked

Query to concatenate and remove common prefix

A bit of history to this : I posted this question elsewhere about 12 months ago, and got some very interesting and useful answers, but now I want to see what the wonderful community here can come up with. --- In SQL 2005, I have some data id ref == ========== 1 3536757616 1 3536757617 1 3536757618 and want to get the result 1 3536757616/7/8 so essentially the data is aggregated on id, with the refs concatenated together, separated by a slash '/', but with any common prefix removed so if the data was like id ref == ========== 2 3536757628 2 3536757629 2 3536757630 I would want to get the result 2 3536757629/28/30 I know I can simply concatenate the refs by using SELECT distinct id, stuff ( ( SELECT '/ ' + ref FROM tableA tableA_1 where tableA_1.id = tableA_2.id FOR XML PATH ( '' ) ) , 1 , 2 , '' ) from TableA tableA_2 to give 1 3536757616/ 3536757617/ 3536757618 2 3536757628/ 3536757629/ 3536757630 but it's the bit that removes the common element that I'm after..... expected result 1 3536757616/7/8 2 3536757628/29/30 --- Code for test data : create table tableA (id int, ref varchar(50)) insert into tableA select 1, 3536757616 union select 1, 3536757617 union select 1, 3536757618 union select 2, 3536757628 union select 2, 3536757629 union select 2, 3536757630 union select 3, 3536754000 union select 3, 3536755000 union select 3, 3536756000 union select 3, 3536757000 union select 4, 1999999999 union select 4, 1199999999 union select 4, 1119999999 union select 4, 1111999999 expected results 1 3536757616/7/8 2 3536757628/29/30 3 3536754000/5000/6000/7000 4 1111999999/119999999/199999999/999999999
sql-server-2005t-sqlconcatenation
8 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

what should be the expected result for 1 ? is it 3536757616/7/8 or 3536757616/17/18?
0 Likes 0 ·
yeah, kind of with @Oleg here - is it always the last character that you want to consider for concatenation of is it possible to have a root of less than 9 characters
0 Likes 0 ·
the result for 1 is 3536757616/7/8 as the root is common all the way up to the '1'. Yes the root can be less than 9 chars, in fact at least it has to be 1 char
0 Likes 0 ·
... but match it as far as possible. meaning that an unbroken series from 353675616 to 353675639 would be returned in 3 groups, starting as 35367561x, 35367562x and 35367563x respectively .. ? if they had the same ID
0 Likes 0 ·
no, I want one row per id, so an unbroken series would have a common root of 3536756, and would go 353675616/17/18..../38/39. Like in the example for id 2, it crosses over from 29 to 30....
0 Likes 0 ·
Show more comments
Matt Whitfield avatar image
Matt Whitfield answered
This is a bit clunky... :) Edited to include reference to a static tally table... Edit 2 -> to include ORDER BY clauses Edit 3 -> now simplified quite a lot Edit 4 -> simplified even more - now 33.6 seconds on 8192 sets DECLARE @tableA TABLE ( id INT , ref VARCHAR (50)); INSERT INTO @tableA select 1, 3536757616 union select 1, 3536757617 union select 1, 3536757618 union select 2, 3536757628 union select 2, 3536757629 union select 2, 3536757630 union select 3, 3536754000 union select 3, 3536755000 union select 3, 3536756000 union select 3, 3536757000 union select 4, 1999999999 union select 4, 1199999999 union select 4, 1119999999 union SELECT 4, 1111999999 ; WITH -- get the maximum length, min ref and max ref per group maxlengths AS (SELECT [id], MAX(LEN(ref)) AS maxreflen, MIN([ref]) AS firstref, MAX([ref]) AS lastref FROM @tableA AS [ta] GROUP BY [id]), -- get the maximums and find all matching roots for first and last ref roots AS (SELECT [m].[id], [m].[firstref], [N] FROM [maxlengths] [m] INNER JOIN [dbo].[tally] ON [dbo].[tally].[n] < [m].[maxreflen] WHERE LEFT([m].[firstref], [N]) = LEFT([m].[lastref], [N])), -- now get for each id the index and root text optimalCombinations AS (SELECT id, MAX([N]) + 1 AS ind, LEFT(MAX([firstref]), MAX([N])) AS root FROM roots GROUP BY id) -- and select out SELECT id, [mcl].[root] + stuff((SELECT '/' + SUBSTRING([ref], mcl.ind, 1000) FROM @tableA AS tableA_1 WHERE tableA_1.id = mcl.id ORDER BY [tableA_1].[ref] FOR XML PATH ('')), 1, 1, '') FROM optimalCombinations AS mcl ORDER BY [mcl].[id];
12 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

@Matt - Wow! I like it. There is an evil worktable on my machine for the ROW_NUMBER() for the cte AllLengthCounts - you use sys.allobjects and that is a killer (for me at least)
0 Likes 0 ·
Yeah I wasn't going for optimal, particularly, just flexible :) I think it would suck over a large amount of data...
0 Likes 0 ·
I didn't want to be so blunt :o) I realised that my solution can be improved by dumping the result of string breaking into a "real" table as opposed to keeping it as a CTE.
0 Likes 0 ·
Hmm, actually, performance wise they both suck. Over 256 rows, mine ran in 9.884 seconds and yours ran in 38.781... Now that's poor performance! I'm sure we could squeeze them to be a lot faster if we put some thought in :)
0 Likes 0 ·
...although I had to make a tally table to get yours to run, so I changed mine to reference the tally table, and then it took 0.649 seconds.
0 Likes 0 ·
Show more comments
WilliamD avatar image
WilliamD answered
I hope I understood you right, so here is my go at it. *(v1.0 Lunchtime hack done in about 10 minutes and only tested on your example data).* *Edit 14:37 CET - Fix added to comply with Kev's comment (extra CTE called MatchPos)* *Edit 15:25 CET - Final fix so I really comply with Kev* ;o) *Edit 15:39 CET - Kev really is playing with us! :o) Those repeating numbers made me reverse the position finding in the CTE LeftMostMatch, but it works now.* *Edit 24-09-2010 09:26 - Made a further change to make sure that the results come out in the right order, even for permanent tables as opposed to table variables (order by added to the XML concatenation). The solution as a whole can be sped up by moving SplitRef into a permanent table (**Is that allowed?**).* DECLARE @TableA TABLE (id int, ref varchar(50)) INSERT INTO @tableA select 1, 3536757616 union select 1, 3536757617 union select 1, 3536757618 union select 2, 3536757628 union select 2, 3536757629 union select 2, 3536757630 union select 3, 3536754000 union select 3, 3536755000 union select 3, 3536756000 union select 3, 3536757000 union select 4, 1999999999 union select 4, 1199999999 union select 4, 1119999999 union select 4, 1111999999 ; WITH SplitRef /* Breaks the string down for comparisons across an id */ AS (SELECT id, ref, SUBSTRING(ref, n, 1) refChar, N refPos FROM @TableA A CROSS JOIN SqlAdministration.dbo.Tally T WHERE t.N < LEN(ref) + 1) , LeftMostMatch /* Finds the left-most position where the digits are still the same */ AS (SELECT baseref = MIN(cur.ref), rest.ref, MIN(rest.refpos) matchpos FROM SplitRef cur LEFT JOIN SplitRef rest ON cur.id = rest.id AND cur.refPos = rest.refPos AND cur.ref < rest.ref AND cur.refChar <> rest.refChar WHERE rest.ref IS NOT NULL GROUP BY rest.ref) , ExtendedMatch /* Poor hack to force the grouping - otherwise a grouping higher than 1 digit on the right won't work */ AS (SELECT baseref, MIN(matchpos) matchpos FROM LeftMostMatch GROUP BY baseref), Concatenation /* Does the concatenation */ AS (SELECT id, MIN(a.ref) + (SELECT '/' + SUBSTRING(hm.ref,em.matchpos,(LEN(hm.ref))) FROM LeftMostMatch hm INNER JOIN ExtendedMatch em ON hm.baseref = em.baseref WHERE hm.baseref = MIN(a.ref) FOR XML PATH('')) Result FROM @TableA a GROUP BY id) SELECT * FROM Concatenation ;
13 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

@WilliamD close, but not quite right on the concat for id=2, should be 3536757628/29/30, but you have 3536757628/9/30
0 Likes 0 ·
@WilliamD, sorry but now you have made id=1 wrong....should be 3536757616/7/8, but you have 3536757616/17/18. The common root can be of any length >1
0 Likes 0 ·
@Kev - I noticed that after posting. I got it fixed now (I hope!)
0 Likes 0 ·
@WilliamD, can I be picky? the order of the concatenated parts is slightly out, and for id=4 the result isn't correct
0 Likes 0 ·
@Kev - I beg to differ. I just ran it on my side and the version that I posted last of all spits out the same answer as you posted in the question. am i missing something here?
0 Likes 0 ·
Show more comments
Cyborg avatar image
Cyborg answered


create table #temp ( id int, ref varchar(50) )

insert  into #temp
        select  1,
                3536757616
        union
        select  1,
                3536757617
        union
        select  1,
                3536757618
        union
        select  2,
                3536757628
        union
        select  2,
                3536757629
        union
        select  2,
                3536757630
                



SELECT  ID,
        REF + '/' + REPLACE(C3,
                            SUBSTRING(REF, 1,
                                      LEN(REf) - StrLen - CHARINDEX('0', REVERSE(C3))),
                            '')
FROM    ( SELECT    *,
                    ( SELECT TOP 1
                                LEN(CONVERT(VARCHAR, CONVERT(BIGINT, t2.Ref)
                                    - CONVERT(BIGINT, t1.ref))) t
                      FROM      #Temp t2
                      WHERE     t2.ID = t1.ID
                                AND t2.Ref > t1.ref
                      ORDER BY  LEN(CONVERT(VARCHAR, CONVERT(BIGINT, t2.Ref)
                                    - CONVERT(BIGINT, t1.ref))) DESC
                    ) StrLen,
                    STUFF(( SELECT  '/' + ref AS 'data()'
                            FROM    #Temp t2
                            WHERE   t2.ID = t1.ID
                                    AND t2.Ref > t1.ref
                          FOR
                            XML PATH('')
                          ), 1, 1, '') C3
          FROM      ( SELECT    ID,
                                MIN(Ref) Ref
                      FROM      #Temp
                      GROUP BY  ID
                    ) T1
        ) FT




1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Cyborg, if the grouping goes round a 2 digit ending it breaks. Try adding: SELECT 1, 3536757609 UNION SELECT 1, 3536757610 UNION SELECT 1, 3536757611 UNION SELECT 1, 3536757612 UNION SELECT 1, 3536757613 UNION SELECT 1, 3536757614 UNION SELECT 1, 3536757615 UNION to the test data, and you get an error with your solution.
0 Likes 0 ·
Squirrel avatar image
Squirrel answered
-- Query ; with -- Input Data, include row_no for identifying the 1st (min) ref of the id data as ( select id, ref, row_no = row_number() over (partition by id order by ref) from @tableA ), -- cross join to tally table (numbers) to find possible the common length cte as ( select id, num from data a cross join numbers n where n.num between 1 and len(a.ref) group by id, num having min(left(ref, n.num)) = max(left(ref, n.num)) ), -- get the max common length cte2 as ( select id, c = max(num) from cte group by id ) -- product the result using for xml to concatenate select id, ref = stuff(ref, 1, 1, '') from cte2 c cross apply ( select '/' + case when x.row_no = 1 then x.ref else stuff(x.ref, 1, c.c, '') end from data x where x.id = c.id for xml path('') ) r (ref) /* id ref ----------- ----------------------------------------- 1 3536757616/7/8 2 3536757628/29/30 3 3536754000/5000/6000/7000 4 1111999999/119999999/199999999/999999999 (4 row(s) affected) */
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

can you give the definition of the `numbers` table?
0 Likes 0 ·
numbers table is just a simple tally table. create table numbers (num int, primary key (num))
0 Likes 0 ·
Kev Riley avatar image
Kev Riley answered
As promised here's the answer I got originally, thanks to [Quassnoi over at StackOverflow][1] Matt assures me that his solution is faster over a large enough data set, although I still marvel at the simplicity if this solution! ;WITH hier(cnt) AS ( SELECT 1 UNION ALL SELECT cnt + 1 FROM hier WHERE cnt <= 100 ) SELECT id,( SELECT CASE WHEN ROW_NUMBER() OVER (ORDER BY a2.ref) = 1 THEN ref ELSE '/' + SUBSTRING(ref, mc + 1, LEN(ref)) END FROM tableA a2 WHERE a2.id = q2.id FOR XML PATH('') ) FROM ( SELECT id, MIN(common) AS mc FROM ( SELECT a.id, ( SELECT MAX(cnt) FROM hier WHERE SUBSTRING(i.initref, 1, cnt) = SUBSTRING(a.ref, 1, cnt) AND cnt <= LEN(ref) ) AS common FROM ( SELECT id, MIN(ref) AS initref FROM tableA GROUP BY id ) i JOIN tableA a ON i.id = a.id ) q GROUP BY id ) q2 [1]: http://stackoverflow.com/questions/944160/tsql-query-to-concatenate-and-remove-common-prefix/944276#944276
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

I was wondering about this since I suspected that someone would just use math and subtract the numbers and sequentially tack on the remainders. (The numbers may be in text format, but they're numbers nevertheless.) But no one tried that, although it I think it would have been easier that way.
0 Likes 0 ·
@Kev Riley +1 This is a very good solution, but why spoil it with such a strange way (100-deep recursion) it uses to generate 101 tally records? if the **hier** definition can be replaced with something else, like
;with hier (cnt) as
(
    select top 101 
        row_number() over (order by [object_id]) cnt
        from sys.objects -- or something
)
then the soluition will look even better. Of course, with just 100-deep recursion into a single column set there is no big performance hit, but the plan for recursive cte is somewhat more complex :)
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.