question

RSuabshini avatar image
RSuabshini asked

SQL Server : Replace consecutive identical row occurences to single row based on Id column

![alt text][1] Hello Everyone, I am trying to reduce the **consecutive** identical rows within same Id to one single row. I tried duplication but then it replaces all non-consecutive occurences within the same Id to one single row. Also the order of the message is important. The input and the desired output is shown below. Is there any way to acheive this desired result? Thanks Input data Id Result Message 1 0 a 1 0 p 1 0 p 1 0 p 1 0 d 1 0 p 1 0 p 1 0 f 1 0 p 2 1 a 2 1 a 2 1 a 2 1 f 2 1 h 2 1 b 2 1 b 3 0 d 3 0 d 3 0 d 3 0 c 3 0 c Desired output Id Result Message 1 0 a 1 0 p 1 0 d 1 0 p 1 0 f 1 0 p 2 1 a 2 1 f 2 1 h 2 1 b 3 0 d 3 0 c
sql
6 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image Oleg commented ·
@RSuabshini Could you please re-post the image? There is no way to view it now. Alternatively, you can post the sample input and desired output as text.
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@RSuabshini You mentioned that the order of the message is important. What drives this order? From the sample data in question, there is no way to figure out why the row with id = 1 and message = p should be before the row with id = 1 and message = d. Also, do you want to write the script to physically delete the duplicates or just write a select statement which outputs distinct rows?
0 Likes 0 ·
RSuabshini avatar image RSuabshini commented ·
Every alphabet in the Message column has its own internal meaning. At some places 'p' will be placed before 'd' and in some cases 'p' will be placed after 'd'. This is how I get the source data. So retaining this order is necessary. (Would like to get some suggestions if it does not work the way how it is now. Like work around's to make the table proper) Deletion or selection either of this is ok, but in case of deletion - For **consecutive** identical rows - except the first row remaining rows can be deleted. case of selection - For **consecutive** identical rows - only the first row should be selected.
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@RSuabshini There has to be a way to identify the sort order. What I mean is this: suppose you execute select statement: select * from YourTable The output is all rows. They are displayed in the specific order, i.e. row with message **a** followed by 3 rows with message **p** followed by a row with message **d** followed by 2 rows with message **p**, etc. There is something driving this specific order. This something could be the identity column or the datetime column, or whatever else, but there has to be something driving the rows to appear in their order. In this case, it is possible to delete consecutive duplicates. In the example above the final output would be a, p, d, p, etc. If the table has but only 3 columns without any specification about the order then it is impossible to tell which order is correct. For example, suppose you execute select statement: select * from YourTable order by id, [message] The output is now different. Please let me know what drives the order of rows in the table and I can come up with the solution. It would be nice to know the version of SQL Server too because if it is SQL Server 2012 or newer then the solution is very easy.
0 Likes 0 ·
RSuabshini avatar image RSuabshini commented ·
Thank you so much for your explanation. Now I understood your question. The ordering is by **id** column. I am using SQL Server 2012
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@RSuabshini No, this is not enough. There are multiple rows with the same id and within them the data is in some specific order, which is driving your decision to consider removing the consecutive duplicates. What makes the rows appear in this specific order? If the table does not have any other columns besides the 3 columns in question then it probably does not have a clustered index. In this case, the order of the rows might be the order in which they were inserted. Unfortunately, there is no way (in this case) to guarantee that every select (without order by) will output the rows exactly in the order they were inserted into the table. Without this guarantee, any solution removing the duplicates might work, but it would be unreliable. The only way to make it reliable is if the table has an ever-increasing, never changing unique clustered index. Then the order of the rows is guaranteed to be consistent. I will add the solution now, but please let me know how the table is structured so that the solution can be modified accordingly.
0 Likes 0 ·

1 Answer

·
Oleg avatar image
Oleg answered
From the question definition it appears that the table does not have a clustered index. In this case, there is no guarantee that the select statement will reliably output the rows in the ***order in which they were inserted***. Without such guarantee, the solution below is unreliable. Still, the idea of the solution is as follows: Because it is essential to retain the order in which the rows were inserted in order to consider which ones are consecutive duplicates, an attempt to use any windowing function might break such order and should, therefore be avoided. In pre-SQL Server 2012 days, one way to compare the current row with the previous row from the same set would be by the means of a self-join on the off-by-one basis. In SQL Server 2012 it can be done via [Lead][1] and [Lag][2] analytical functions. So, when reading the values from the "previous row" it is possible to compare the values of the columns and if both id and message match then it means that it is a consecutive duplicate row, and it needs to be excluded. For example, the very first row will not have a "previous row" and should be spared. Any other rows will have the "previous row" counterpart so the column values can be compared. Again, the key here is the order of the rows, and without a reliable way to figure this order the solution cannot guarantee correct results. Because there is no information about the order of the rows in question, I use select null as an attempt to preserve the order in which the rows were inserted. Here is the solution which should produce desired outcome: ;with records as ( select *, lag(id) over (order by (select null)) PrevID, lag([message]) over (order by (select null)) PrevMessage from YourTable ) select id, result, [message] from records where not (isnull(PrevID, 0) = id and isnull(PrevMessage, '') = [message]); Using the sample data in question as an example, this query produces desired output. The final select can be restated as a delete statement (just remove the **NOT** part and replace the line reading **select id, result, [message]** with the line reading **delete**). **\<\!-- Edit to add the script which works in earlier versions of SQL Server, 2005+. -->** SQL Server versions earlier than 2012 do not have the lead and lag functions, but it is possible to mimic this functionality by joining the table with itself on the off-by-one basis. I cannot stress enough that with the sample data in question, the solution is unreliable because the table does not appear to have a proper clustered index (narrow, ever-increasing, never-changing). It was mentioned in one of the comments by OP that "the ordering is by **id** column", but such ordering is not sufficient because there are multiple rows for the same id. This raises a question about how the data is sorted within the rows with the same id. Without a reliable method to sort the data, the solution below is bound to be unreliable, much like the original solution which works in SQL Server 2012+. Nevertheless, here is the script which should produce desired output without any guarantee whatsoever: -- SQL Server 2005, 2008, 2008 R2 versions do not have the lead and lag, but -- they have windowing finctions which can be utilized in order to mimic the -- lag functionality by joining the table with itself on the off-by-one basis ;with records as ( select *, row_number() over (order by (select null)) RecordID from YourTable ) select a.id, a.result, a.[message] from records a left join records b on a.RecordID = b.RecordID + 1 where not (isnull( b.id, 0) = a.id and isnull(b.[message], '') = a.[message]); go In case if these scripts work now, but will stop working in the future (as the table gets more and more rows inserted) then a cave man attempt to force the scripts to continue to work would be by adding maxdop 1 option (the line reading ***option (maxdop 1)*** to the end of the script. The main reason for the solution to stop working is due to high enough number of rows in the table which will force parallelism to kick in at some point, at which time the results will certainly be ordered differently, rendering the solution useless. Hope this helps. Oleg [1]: https://docs.microsoft.com/en-us/sql/t-sql/functions/lead-transact-sql [2]: https://docs.microsoft.com/en-us/sql/t-sql/functions/lag-transact-sql
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

RSuabshini avatar image RSuabshini commented ·
Thanks a lot for the clear explanation. This method worked out.
0 Likes 0 ·
RSuabshini avatar image RSuabshini commented ·
@Oleg Can you tell me how this same can be done using the self-join method in SQL server 2008 version.
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@RSuabshini I added the script which should work in SQL Server 2008 and even 2005, it produces same results, but uses the off-by-one join in place of the LAG, which is not available in these versions.
0 Likes 0 ·
RSuabshini avatar image RSuabshini commented ·
@Oleg Thanks you so much once again for your explanation. It was very useful.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.