x

Recursive hierarchy?

I have a set of data which are directly or indirectly connected, example data set shown below in MT. id e, f, g, h, i, and j are connected through i since they are all in the subgroup 2 and/or 3; a, b, c, d and k are connected since they are in either subgroup 1 or/and 2. Because ids a to k are all directly or indirectly connected to each other, these need to be in one group A. Id i in subgroup 4 and 5, so i is connected to m and n in group 5, should be in a new group B, and so on. Recursive CTE might be a solution, but I couldn't get my head around it. I'd really appreciate if someone here can help me with this. Thank you.

 Drop table MT
 create table MT ( id nvarchar(10), subgroup int)
 insert into MT Values('a', 1)
 insert into MT Values('b', 1)
 insert into MT Values('c', 1)
 insert into MT Values('d', 1)
 insert into MT Values('e', 2)
 insert into MT Values('f', 2)
 insert into MT Values('g', 2)
 insert into MT Values('h', 3)
 insert into MT Values('i', 3)
 insert into MT Values('i', 2)
 insert into MT Values('j', 3)
 insert into MT Values('k', 1)
 insert into MT Values('k', 2)
 insert into MT Values('l', 4)
 insert into MT Values('l', 5)
 insert into MT Values('m', 5)
 insert into MT Values('n', 5)


 I need to have a result as below, grouping to a top level, the rule as I have explained above.
     id    subgroup    supergroup
     a    1    Group1
     b    1    Group1
     c    1    Group1
     d    1    Group1
     e    2    Group1
     f    2    Group1
     g    2    Group1
     h    3    Group1
     i    3    Group1
     i    2    Group1
     j    3    Group1
     k    1    Group1
     k    2    Group1
     l    4    Group2
     l    5    Group2
     m    5    Group2
     n    5    Group2
more ▼

asked Nov 12, 2017 at 09:53 PM in Default

avatar image

Jason111
11 1

(comments are locked)
10|1200 characters needed characters left

2 answers: sort voted first

I'm posting this as a separate answer as it's entirely unrelated.

There was a timely publication of an article by Itzik Ben-Gan (the man who excels at these kinds of puzzles) that presented a T-SQL solution to the 'transitive closure' problem - which is essentially what this problem here is. I urge you to read the article: http://www.itprotoday.com/microsoft-sql-server/t-sql-puzzle-challenge-grouping-connected-items

So now we can take Itzik's solution and apply it to your data.

The first thing we need to do is generate the data in a way that fits into the solution - so we need a table of edges. This is easily constructed as

 select MT1.id, a.id
 from MT MT1
 cross apply (select top 1 MT2.id from MT MT2 where MT1.id <> MT2.id 
             and MT1.subgroup = MT2.subgroup
             and Mt2.id > MT1.id) a

and gives us

 id         id
 ---------- ----------
 a          b
 b          c
 c          d
 d          k
 e          f
 f          g
 g          i
 h          i
 i          j
 i          k
 l          m
 m          n
 
 (12 rows affected)

which represents enough necessary edges that could be constructed between ids in the same subgroups, but without addition of noise that could be introduced by adding all possible edges - this is done by limiting the edges so that the first node is less than the second (Itzik talks about why that is important in the article).

We then need to tweak Itzik's code as he was using integer datatypes for his ids, and you have nvarchar(10), but we can still use the same ideas.

After that we have the output as

 id         grp
 ---------- ----------
 a          a
 b          a
 c          a
 d          a
 k          a
 e          a
 f          a
 g          a
 i          a
 h          a
 j          a
 l          l
 m          l
 n          l
 
 (14 rows affected)

Now we need to slightly amend the output to get your desired output - so use a partitioned row_number() function or dense_rank() function to get a numeric supergroup, and join back to the original table to get the subgroup detail

 SELECT MT.id, MT.subgroup, dense_rank()over(order by grp) as SuperGroup
 FROM #G
 join MT on #g.id = MT.id;

 id         subgroup    SuperGroup
 ---------- ----------- --------------------
 a          1           1
 b          1           1
 c          1           1
 d          1           1
 k          1           1
 k          2           1
 e          2           1
 f          2           1
 g          2           1
 i          3           1
 i          2           1
 h          3           1
 j          3           1
 l          4           2
 l          5           2
 m          5           2
 n          5           2
 
 (17 rows affected)


Hopefully this performs well enough on your large MT table


full code

 with undirected_edges(id1, id2) as
 (
 select MT1.id, a.id
 from MT MT1
 cross apply (select top 1 MT2.id from MT MT2 where MT1.id <> MT2.id 
             and MT1.subgroup = MT2.subgroup
             and Mt2.id > MT1.id) a
 )
  
 SELECT id1, id2
 INTO #T1
 from undirected_edges;
  
 CREATE UNIQUE CLUSTERED INDEX idx_id1_id2 ON #T1(id1, id2);
 CREATE UNIQUE NONCLUSTERED INDEX idx_id2_id1 ON #T1(id2, id1);
  
 CREATE TABLE #G
 (
   id nvarchar(10) NOT NULL,
   grp nvarchar(10) NOT NULL,
   lvl INT NOT NULL,
   PRIMARY KEY NONCLUSTERED (id),
   UNIQUE CLUSTERED(lvl, id)
 );
  
 DECLARE @lvl AS INT = 1, @added AS INT, @id1 AS nvarchar(10), @id2 AS nvarchar(10);
 DECLARE @CurIds AS TABLE(id nvarchar(10) NOT NULL);
  
 SELECT TOP (1) @id1 = id1, @id2 = id2
 FROM #T1
 ORDER BY id1, id2;
  
 SET @added = @@ROWCOUNT;
  
 WHILE @added > 0
 BEGIN
   INSERT INTO #G(id, grp, lvl) VALUES
     (@id1, @id1, @lvl),
     (@id2, @id1, @lvl);
  
   DELETE FROM #T1 WHERE id1 = @id1 AND id2 = @id2;
  
   WHILE @added > 0
   BEGIN
     SET @lvl += 1;
  
     DELETE FROM @CurIds;
  
     DELETE FROM T1
       OUTPUT deleted.id2 AS id INTO @CurIds(id)
     FROM #G AS G
       INNER JOIN #T1 AS T1
         ON G.id = T1.id1
     WHERE lvl = @lvl - 1;
      
     INSERT INTO #G(id, grp, lvl)
       SELECT DISTINCT id, @id1 AS grp, @lvl AS lvl
       FROM @CurIds AS C
       WHERE NOT EXISTS
           ( SELECT * FROM #G AS G
             WHERE G.id = C.id );
  
     SET @added = @@ROWCOUNT;
  
     DELETE FROM @CurIds;
  
     DELETE FROM T1
       OUTPUT deleted.id1 AS id INTO @CurIds(id)
     FROM #G AS G
       INNER JOIN #T1 AS T1
         ON G.id = T1.id2
     WHERE lvl = @lvl - 1;           
    
     INSERT INTO #G(id, grp, lvl)
       SELECT DISTINCT id, @id1 AS grp, @lvl AS lvl
       FROM @CurIds AS C
       WHERE NOT EXISTS
           ( SELECT * FROM #G AS G
             WHERE G.id = C.id );
  
     SET @added += @@ROWCOUNT;
  
   END;
  
   SELECT TOP (1) @id1 = id1, @id2 = id2
   FROM #T1
   ORDER BY id1, id2;
  
   SET @added = @@ROWCOUNT;
 END;
  
 SELECT id, grp
 FROM #G;
  
 SELECT MT.id, MT.subgroup, dense_rank()over(order by grp) as SuperGroup
 FROM #G
 join MT on #g.id = MT.id;
 
 DROP TABLE #G, #T1;
more ▼

answered Nov 28, 2017 at 01:47 PM

avatar image

Kev Riley ♦♦
66.8k 48 65 81

(comments are locked)
10|1200 characters needed characters left

This answer isn't my preferred way of doing this as I've had to resort to a loop, but it might help you, and I'll keep playing to see if there is a better option.

I had to do some pre-processing on the data, so I used CTEs to do this The first CTE links gets all the id-pairs that are related

  id         id
  ---------- ----------
  a          b
  a          c
  a          d
  a          k
  b          a
  b          c
  b          d
  ....etc

from those I can then work out which sub-groups are related, so for example the pair a,k tells me that sub-group 1 is related to sub-group 2, and I get those in CTE distinctlinks.

  subgroup    subgroup
  ----------- -----------
  1           2
  1           3
  2           1
  2           3
  3           1
  3           2
  4           5
  5           4
  

Then I materialize this data in an ordered fashion into #orderedlinks as I have to loop over this result set to determine the supergroups, which after running the while loop look like

  sg1         sg2         rn                   group
  ----------- ----------- -------------------- -----------
  1           2           1                    1
  1           3           2                    1
  2           1           3                    1
  2           3           4                    1
  3           1           5                    1
  3           2           6                    1
  4           5           7                    2
  5           4           8                    2

which is then simply joined back onto the original data to determine a supergroupid

  id         subgroup    SuperGroupID
  ---------- ----------- ------------
  a          1           1
  b          1           1
  c          1           1
  d          1           1
  e          2           1
  f          2           1
  g          2           1
  h          3           1
  i          3           1
  i          2           1
  j          3           1
  k          1           1
  k          2           1
  l          4           2
  l          5           2
  m          5           2
  n          5           2
  
  

The full code:

  if object_id('tempdb..#orderedlinks') is not null drop table #orderedlinks;
  
  with links(id1, id2) as (
  select MT1.id, a.id
  from MT MT1
  cross apply (select MT2.id, MT2.subgroup from MT MT2 
              where MT1.id <> MT2.id 
              and MT1.subgroup = MT2.subgroup
              ) a
  )
  , distinctlinks (sg1, sg2) as (
  select distinct
      MT1.subgroup,
      MT2.subgroup
  from links
  join MT MT1 on links.id1 = MT1.id
  join MT MT2 on links.id2 = MT2.id
  where 
      MT1.subgroup <> MT2.subgroup
  )
  
  select sg1, sg2, row_number()over(order by sg1, sg2) as rn, 0 as [group]
  into #orderedlinks
  from distinctlinks;
  
  
  while exists (select top 1 rn from #orderedlinks where [group] = 0)
   begin
       update o
           set [group] = 
               case 
              when (select min([group]) from #orderedlinks o2 where [group] > 0 and (o.sg1 = o2.sg1 or o.sg1 = o2.sg2)) is not null
              then (select min([group]) from #orderedlinks o2 where [group] > 0 and (o.sg1 = o2.sg1 or o.sg1 = o2.sg2))
                  else (select max([group]) from #orderedlinks)+1
               end
       from #orderedlinks o
       where rn in (select top 1 rn from #orderedlinks  where [group] = 0 order by rn);
   end;
  
   select
      MT.id,
      MT.subgroup,
      SuperGroups.SuperGroupID
   from MT
   cross apply (select min([group]) as SuperGroupID 
                        from #orderedlinks 
                        where MT.subgroup = #orderedlinks.sg1 or MT.subgroup = #orderedlinks.sg1) SuperGroups;
more ▼

answered Nov 13, 2017 at 02:46 PM

avatar image

Kev Riley ♦♦
66.8k 48 65 81

Thank you so much for the quick response. It is working for the sample data. My MT table is big, the #orderedlinks is more than 2 millions. I applied your code, it's been over 1 hour and it is still in the while begin... end.

Nov 13, 2017 at 05:08 PM Jason111

ouch.... yeah, row-by-row processing sucks! There may be some improvements that can be made to remove the links that don't cross sub-groups, but ultimately a set-based approach would be better. I'll keep trying....

Nov 13, 2017 at 07:47 PM Kev Riley ♦♦
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x258
x25

asked: Nov 12, 2017 at 09:53 PM

Seen: 77 times

Last Updated: Nov 28, 2017 at 01:54 PM

Copyright 2018 Redgate Software. Privacy Policy