question

philnolan avatar image
philnolan asked

Query over status movement

I have a table that captures status movement against an entity. Like so... Id Date Status Sequence 1 2010-10-12 10:56:00.000 SomeStatusX 1 1 2010-10-12 17:08:00.000 SomeStatusB 2 1 2010-10-13 11:45:00.000 SomeStatusB 3 1 2010-10-13 12:09:00.000 SomeStatusD 4 1 2010-10-13 15:36:00.000 SomeStatusX 5 1 2010-10-14 07:33:00.000 SomeStatusX 6 1 2010-10-14 12:51:00.000 SomeStatusB 7 As you can see, the entity has can enter the same status more than once and can have multiple entries in the table for that same status (same episode). I want to identify each status movement noting the start and end date of that episode. Return this for the above data:- Id StartDate EndDate Status 1 2010-10-12 10:56:00.000 2010-10-12 17:08:00.000 SomeStatusX 1 2010-10-12 17:08:00.000 2010-10-13 12:09:00.000 SomeStatusB 1 2010-10-13 12:09:00.000 2010-10-13 15:36:00.000 SomeStatusD 1 2010-10-13 15:36:00.000 2010-10-14 12:51:00.000 SomeStatusX 1 2010-10-14 12:51:00.000 2010-10-14 12:51:00.000 SomeStatusB The query I have at the moment works but is proving slow. SELECT st.Id, Startdate = COALESCE(x.Date, st.Date), EndDate = st.Date, st.Status FROM StatusTable st CROSS APPY (SELECT TOP 1 crst2.Date FROM StatusTable crst INNER JOIN StatusTable crst2 ON crst2.Id = crst.Id AND crst2.Date > crst.Date AND crst2.Status = crst.Status WHERE crst.Id = st.Id AND crst.Date < crst.Date AND crst.Status <> st.Status ORDER BY crst.Date DESC, crst2.Date ASC ) x For each row I am basically finding the last row where status was different and from there finding the next row where the status is the same. Is there a better way to do this that would provide better performance?
t-sqlperformance
7 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image WilliamD commented ·
@Fatherjack - the results your version delivers is different from the expected results in the original question. Philnolan doesn't just need the startdate of the next record, he needs the values consolidated to show contiguous data per status change. The results from @philnolan, @Matt and me are correct according to the test data and expected output
1 Like 1 ·
Scot Hauder avatar image Scot Hauder commented ·
What do you mean by slow? How long and how many rows in the movement table and what indexes do you currently have?
0 Likes 0 ·
ozamora avatar image ozamora commented ·
Might be several rows. The query is doing a cross apply and looping through a query as many records that need to be matched. Possibly a full table scan on each call.
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
@Phil, can you check your query please? I cant get it to return any rows from the data sample you provided ... :-/
0 Likes 0 ·
philnolan avatar image philnolan commented ·
I need to get it pre second if possible but at the moment it's over that. The table has been obtimised and is using index seeks - no scans. It's the TOP 1 sort that's hurting it.
0 Likes 0 ·
Show more comments
ozamora avatar image
ozamora answered
This might be a starting point Original: WITH CTE_seq AS SELECT id, date, status, ROW_NUMBER() OVER(PARTITION by id ORDER BY date) as sequence FROM dbo.table ) SELECT t1.id, t1.date AS StartDate, t2.date AS EndDate, t1.Status FROM CTE_seq t1 INNER JOIN CTE_seq t2 ON ( t1.id = t2.id AND t1.sequence + 1 = t2.sequence) ========================================================================================= Crispy (this approaches what you want, in a very fast way): -- I am borrowing @WilliamD insert statement CREATE TABLE dbo.sampletable ( ID INT , date DATETIME , status CHAR(11) , sequence INT ) go INSERT INTO dbo.sampletable VALUES ( 1, '2010-10-12 10:56:00.000', 'SomeStatusX', 1 ), ( 1, '2010-10-12 17:08:00.000', 'SomeStatusB', 2 ), ( 1, '2010-10-13 11:45:00.000', 'SomeStatusB', 3 ), ( 1, '2010-10-13 12:09:00.000', 'SomeStatusD', 4 ), ( 1, '2010-10-13 15:36:00.000', 'SomeStatusX', 5 ), ( 1, '2010-10-14 07:33:00.000', 'SomeStatusX', 6 ), ( 1, '2010-10-14 12:51:00.000', 'SomeStatusB', 7 ) go -- 2 CTEs make this work WITH CTE_seq AS ( SELECT id, date, status, ROW_NUMBER() OVER(PARTITION by id ORDER BY date) as sequence FROM dbo.sampletable ), CTE_final AS ( SELECT t1.id, t1.date AS StartDate, ISNULL(t2.date, t1.date) AS EndDate, t1.Status, CASE WHEN t1.Status = t2.status THEN 1 ELSE 0 END as is_same_status FROM CTE_seq t1 LEFT OUTER JOIN CTE_seq t2 ON ( t1.id = t2.id AND t1.sequence + 1 = t2.sequence) ) SELECT id, StartDate, EndDate, status FROM CTE_final WHERE is_same_status = 0; Result: ![Query Result][1] [1]: http://img685.imageshack.us/img685/7132/queryresult1.png
8 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

philnolan avatar image philnolan commented ·
@ozamora - thanks for your input and thanks for the stat info below. your query does not quite achieve the desired results as the table can have 2 or more repeating statuses in an episode. imagine the following... INSERT INTO dbo.sampletable VALUES ( 1, '2010-10-12 10:56:00.000', 'SomeStatusX', 1 ), ( 1, '2010-10-12 17:08:00.000', 'SomeStatusB', 2 ), ( 1, '2010-10-13 11:45:00.000', 'SomeStatusB', 3 ), ( 1, '2010-10-13 11:50:00.000', 'SomeStatusB', 4 ), ( 1, '2010-10-13 12:01:00.000', 'SomeStatusB', 5 ), ( 1, '2010-10-13 12:09:00.000', 'SomeStatusD', 6 ), ( 1, '2010-10-13 15:36:00.000', 'SomeStatusX', 7 ), ( 1, '2010-10-14 07:33:00.000', 'SomeStatusX', 8 ), ( 1, '2010-10-14 12:51:00.000', 'SomeStatusB', 9 ) go for the first occurence of 'SomeStatusB i'm looking for a start date of '2010-10-12 17:08:00.000' and an end date of '2010-10-13 12:09:00.000'.
1 Like 1 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
@ozamora - I am interested to know why you have created a CTE that replicates the table data. Can you explain your thinking please?
0 Likes 0 ·
ozamora avatar image ozamora commented ·
I use CTE to embed a sequencer. That way I can get a pointer that tells me the incremental dates for "id". You cannot evaluate ROW_NUMBER() outpusts within the query so a CTE works around the issue. Also, I am looking at the same table again, and performing a LEFT OUTER to match sequencers "t1.sequence + 1 = t2.sequence". OUTER because the last row wont have a match, hence the ISNULL evaluation "ISNULL(t2.date, t1.date) AS EndDate,". Hope this helps
0 Likes 0 ·
ozamora avatar image ozamora commented ·
2 logical reads: SQL Server parse and compile time: CPU time = 6 ms, elapsed time = 6 ms. (5 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'sampletable'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
@ozamora - what @Fatherjack was meaning was the column sequence already does the job of ROW_NUMBER(). This was not very clear in the original question (my first attempt at an answer did the same as you).
0 Likes 0 ·
Show more comments
Fatherjack avatar image
Fatherjack answered
I would simplly join the table to its self with the sequence shifted by one as: CREATE TABLE EntityMovement ( ID INT , MoveDate DATETIME , EntStatus CHAR(11) , sequence INT ) go INSERT INTO EntityMovement VALUES ( 1, '2010-10-12 10:56:00.000', 'SomeStatusX', 1 ), ( 1, '2010-10-12 17:08:00.000', 'SomeStatusB', 2 ), ( 1, '2010-10-13 11:45:00.000', 'SomeStatusB', 3 ), ( 1, '2010-10-13 12:09:00.000', 'SomeStatusD', 4 ), ( 1, '2010-10-13 15:36:00.000', 'SomeStatusX', 5 ), ( 1, '2010-10-14 07:33:00.000', 'SomeStatusX', 6 ), ( 1, '2010-10-14 12:51:00.000', 'SomeStatusB', 7 ) go SELECT [em].[ID] , [em].[MoveDate] AS [start_date] , [em2].[MoveDate] AS [end_date] , [em].[EntStatus] , [em].[sequence] FROM [dbo].[EntityMovement] AS em INNER JOIN [dbo].[EntityMovement] AS em2 ON [em].[ID] = [em2].[ID] AND [em2].[sequence] = [em].[sequence] + 1 If you want to see the current status of an entity, then change the Inner join to Left and they get included
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

philnolan avatar image philnolan commented ·
Sorry. There was a number of typos in the query (I typed it up last night from memory as was away from my dev box). Have fixed and placed over the table you created. SELECT st.Id,Startdate = COALESCE(x.MoveDate, st.MoveDate), EndDate = st.MoveDate, st.EntStatus FROM EntityMovement st OUTER APPLY -- changed form cross as always want a row ( SELECT TOP 1 st3.MoveDate FROM EntityMovement AS st2 INNER JOIN EntityMovement AS st3 ON st2.Id = st3.Id AND st3.MoveDate > st2.MoveDate AND st3.EntStatus = st.EntStatus WHERE st2.Id = st.Id AND st2.MoveDate < st.MoveDate AND st2.EntStatus <> st.EntStatus ORDER BY st2.MoveDate DESC, st3.MoveDate ASC) x ORDER BY 2 Note the start date of rows "SomeStatusB" and "SomeStatusX" is the same. I plan to aggregate over this by status and startdate, maxing over endate. This will give the single episodes of status change with a true start to end date. If you look at the plan created for this you get a sort operator (for me detailing 37%). When ran against a much larger dataset this really crunches the performance. Thanks for your time all...
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
yeah, I got that too but would have added ALTER TABLE [dbo].[EntityMovement] ADD CONSTRAINT ID_Seq PRIMARY KEY (ID,sequence) if it was my schema. I dont know what other indexes you have ...
0 Likes 0 ·
Matt Whitfield avatar image
Matt Whitfield answered
This seems to work - but it's difficult to know whether it would perform any better than what you are using. Some indexing would help. Using @Fatherjack's table def... WITH breaks AS (SELECT [em1].[sequence] AS seq1, [em1].[ID] AS seq1ID, [em2].[sequence] AS seq2, [em2].[ID] AS seq2ID, ROW_NUMBER() OVER ( ORDER BY [em1].[ID], [em1].[sequence]) AS __RN FROM [dbo].[EntityMovement] AS [em1] LEFT OUTER JOIN [dbo].[EntityMovement] AS [em2] ON [em2].[sequence] = [em1].[sequence] + 1 AND [em1].[ID] = [em2].[ID] WHERE [em1].[EntStatus] != ISNULL([em2].[EntStatus], 0)), islands AS (SELECT ISNULL([b1].[seq2], [b].[seq1]) AS island_start, ISNULL([b1].[seq2id], [b].[seq1id]) AS island_start_id, ISNULL([b].[seq2], [b].[seq1]) AS island_end, ISNULL([b].[seq2id], [b].[seq1id]) AS island_end_id FROM [breaks] AS [b] LEFT OUTER JOIN [breaks] AS [b1] ON [b].[__RN] = [b1].[__RN] + 1) SELECT [em].[ID], [em].[MoveDate] AS [StartDate], [em1].[MoveDate] AS [EndDate], [em].[EntStatus] AS [Status] FROM [islands] AS [i] INNER JOIN [dbo].[EntityMovement] AS [em] ON [em].[sequence] = [i].[island_start] AND [em].[ID] = [i].[island_start_id] INNER JOIN [dbo].[EntityMovement] AS [em1] ON [em1].[sequence] = [i].[island_end] AND [em1].[ID] = [i].[island_end_id];
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
It's also possible that, over a large amount of data, then materialising breaks and islands into temp tables would help - as this would stop them from being materialised more than once (gotta love CTEs) and you could also index them...
0 Likes 0 ·
philnolan avatar image
philnolan answered
I've changed tack and come up with a far simpler approach. Instead of trying to find the starting date of the episode (by finding the first time it entered that status) it works far better to focus on the end date of the episode as the end date of the episode is when the status has changed! My query now looks like this and is way quicker. SELECT st.Id , Startdate = st.MoveDate , EndDate = x.MoveDate , st.EntStatus FROM EntityMovement st OUTER APPLY ( SELECT TOP 1 st2.MoveDate FROM EntityMovement AS st2 WHERE st2.Id = st.Id AND st2.MoveDate > st.MoveDate AND st2.EntStatus <> st.EntStatus ORDER BY st2.MoveDate ASC) x ORDER BY 2 Thanks again for all your input :)
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

ozamora avatar image ozamora commented ·
While the method is effective, it is performing a nested loop against EntityMovement for each row that needs to be probed. 8 logical reads: SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 5 ms. (7 row(s) affected) Table 'EntityMovement'. Scan count 8, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
1 Like 1 ·
Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
What is the indexing like on this table? Because if you have an index on ID, sequence - and sequence is in the same order as date, then this query would eliminate a sort: SELECT st.Id, Startdate = st.MoveDate, EndDate = ( SELECT TOP 1 st2.MoveDate FROM EntityMovement AS st2 WHERE st2.Id = st.Id AND st2.MoveDate > st.MoveDate AND st2.EntStatus <> st.EntStatus ORDER BY st2.[sequence]) , st.EntStatus FROM EntityMovement st
0 Likes 0 ·
philnolan avatar image philnolan commented ·
Yup, with the correct indexing we can eliminate the sort - thanks. Is it preferrable to use a correlated subquery ahead of the OUTER APPLY?
0 Likes 0 ·
Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
@philnolan - I would be surprised if there was a huge difference, actually...
0 Likes 0 ·
WilliamD avatar image
WilliamD answered
Well, I have spent enough time on mine now to not want to throw it away! DECLARE @EntityMovement AS TABLE (Id int, [MoveDate] datetime, [EntStatus] varchar(20), Sequence int, PRIMARY KEY CLUSTERED (id, sequence)) INSERT INTO @EntityMovement (Id, [MoveDate], [EntStatus], Sequence) VALUES (1, '2010-10-12 10:56:00.000', 'SomeStatusX', 1), (1, '2010-10-12 17:08:00.000', 'SomeStatusB', 2), (1, '2010-10-13 11:45:00.000', 'SomeStatusB', 3), (1, '2010-10-13 12:09:00.000', 'SomeStatusD', 4), (1, '2010-10-13 15:36:00.000', 'SomeStatusX', 5), (1, '2010-10-14 07:33:00.000', 'SomeStatusX', 6), (1, '2010-10-14 12:51:00.000', 'SomeStatusB', 7) ; WITH Chronology AS (SELECT [em].[ID], [em].[MoveDate] AS [start_date], ISNULL([em2].[MoveDate], em.MoveDate) AS [end_date], [em].[EntStatus], [em].[sequence], StatusGrouping = em.Sequence - ROW_NUMBER() OVER (PARTITION BY em.id, em.EntStatus ORDER BY em.MoveDate) FROM @EntityMovement AS em LEFT JOIN @EntityMovement AS em2 ON [em].[ID] = [em2].[ID] AND [em2].[sequence] = [em].[sequence] + 1), StartsAndFinishes AS (SELECT *, finish = ROW_NUMBER() OVER (PARTITION BY id, StatusGrouping, EntStatus ORDER BY start_date DESC), start = ROW_NUMBER() OVER (PARTITION BY id, StatusGrouping, EntStatus ORDER BY start_date) FROM Chronology) SELECT start.id, start.start_date, finish.end_date, start.EntStatus FROM StartsAndFinishes start INNER JOIN StartsAndFinishes finish ON START.id = finish.id AND START.StatusGrouping = finish.StatusGrouping AND start.EntStatus = finish.EntStatus AND finish.finish = 1 WHERE START.START = 1 ; I nicked the Chronology CTE off Fatherjack - thanks :)
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Scot Hauder avatar image
Scot Hauder answered
I know you said you've changed tack, but this solves your original question CREATE TABLE #EntityMovement( [Id] [int] NULL, [MoveDate] [datetime] NULL, [EntStatus] [varchar](50) NULL, [Sequence] [int] NULL) CREATE CLUSTERED INDEX [IX_EntityMovement] ON #EntityMovement( [Id] ASC, [Sequence] ASC) CREATE NONCLUSTERED INDEX [IX_EntityMovement_1] ON #EntityMovement( [Id] ASC, [EntStatus] ASC, [Sequence] DESC) INCLUDE ( [MoveDate]) INSERT #EntityMovement (Id, MoveDate, EntStatus, Sequence) VALUES (1, '2010-10-12 10:56:00.000','SomeStatusX',1),(1, '2010-10-12 17:08:00.000','SomeStatusB',2), (1, '2010-10-13 11:45:00.000','SomeStatusB',3),(1, '2010-10-13 12:09:00.000','SomeStatusD',4), (1, '2010-10-13 15:36:00.000','SomeStatusX',5),(1, '2010-10-14 07:33:00.000','SomeStatusX',6), (1, '2010-10-14 12:51:00.000','SomeStatusB',7),(2, '2010-10-15 02:45:00.000','SomeStatusD',1), (2, '2010-10-15 03:45:00.000','SomeStatusD',2),(2, '2010-10-15 04:45:00.000','SomeStatusD',3), (2, '2010-10-15 05:45:00.000','SomeStatusD',4),(2, '2010-10-15 06:45:00.000','SomeStatusB',5), (2, '2010-10-15 07:45:00.000','SomeStatusB',6),(2, '2010-10-15 08:45:00.000','SomeStatusC',7), (2, '2010-10-15 09:45:00.000','SomeStatusC',8),(2, '2010-10-15 10:45:00.000','SomeStatusD',9), (3, '2010-10-16 01:45:00.000','SomeStatusD',1),(4,'2010-10-17 03:45:00.000','SomeStatusA',1), (4, '2010-10-17 04:45:00.000','SomeStatusA',2) SELECT Id, MIN([E1MoveDate])[StartDate], ISNULL(MAX([E2MoveDate]), MIN([E1MoveDate]))[EndDate], E1EntStatus [Status] FROM( SELECT e1.Id, e1.[MoveDate][E1MoveDate], e1.[EntStatus][E1EntStatus], e2.[MoveDATE][E2MoveDate], e1.Sequence + (ROW_NUMBER() OVER(PARTITION BY e1.ID, e1.EntStatus ORDER BY e1.Sequence DESC))[GroupPartition] FROM #EntityMovement e1 LEFT JOIN #EntityMovement e2 ON ( e1.Id = e2.Id AND e1.Sequence = e2.Sequence - 1) )d GROUP BY Id, E1EntStatus,GroupPartition ORDER BY Id, StartDate DROP TABLE #EntityMovement /* Table 'EntityMovement'. Scan count 2, logical reads 20, (41 with indices which you will physical reads 0, want if you have a lot of rows, read-ahead reads 0, if you only have several rows lob logical reads 0, you may get better performance lob physical reads 0, without them lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. */
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

philnolan avatar image philnolan commented ·
+1 for continued support :)
0 Likes 0 ·
Scot Hauder avatar image Scot Hauder commented ·
The solution is deceptively simple, I say that because finding a simple solution to this type of problem has eluded me for months. I know how to do it with recursion but that proves to be slow. I was always trying to find a convoluted way to create a single column that could serve as a true group partition. Alas, it seems that all along what I really needed was 2 columns, in your case EntStatus *and* my GroupPartition
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
+1 - Scot, I can understand what you mean with the deceptively simple. If this is what eluded you a few months ago, then I am at about that point myself. I realise now, I have almost the same solution as you. I missed the relatively simple MIN(), MAX() part. I was stuck in ROW_NUMBER() mode on this one. That would have reduced the complexity of mine to be identical to yours!
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.