question

GPO avatar image
GPO asked

Split a column based on a pattern

I have a column of semi-structured text (datatype varchar(7000)). Within that text there are certain patterns and I want to be able to split based on the presence of the pattern for instance: Row 1 Age: 78 Height: 178cm Comments: Likes the following: Ice cream, pickles, artichokes. Row 2 Age: 12 E-mail: 123234345@mildew.com Comments: Visited the monkeys at the zoo this week. Row 3 Height: 173cm Weight: 85kg I want the data (initially at least) to be split into two columns [question] and [answer] which would look like so: row_id question answer 1 Age 78 1 Height 178cm 1 Comments Likes the following: Ice cream, pickles, artichokes. 2 Age 12 2 E-mail 123234345@mildew.com 2 Comments Visited the monkeys at the zoo this week. 3 Height 173cm 3 Weight 85kg So as I see it the delimiter also becomes a value in the question column. The pattern for the delimiter is: char(10) + [any string of characters excluding char(10)] + ':' + char(10) One thing I'm struggling with, is how to express [any string of characters excluding char(10)] in TSQL. Notes: CLR/regex are not an option. The same questions don't necessarily repeat in each row, and order of questions may not be predictable. If it works, it may be applied to millions of rows. By way of background, the data comes into the database via a "message" format that could easily have been converted into relatively normalised tables/columns. For whatever reason the vendor decided to mangle the incoming messages into a format suitable for presentation instead of storage/retrieval... and once this is done, the incoming message is destroyed. Arrgh!!!! So now, in order to get meaning out of the "data" (I use the word loosely) we have to try and reverse engineer it. And it takes up at least 100 times more storage than it otherwise would.
tsqlparsingstring-splitting
2 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

I think you also mentioned on another question that CLR is not an option. I am curious as to why not.
0 Likes 0 ·
You have some answers below. This web site runs by you voting and marking the best answers. For each helpful answer below, indicate this by clicking on the thumbs up symbol next to those answers. If any one answer lead to a solution, indicate this by clicking on the check mark next to that answer. This helps indicate your solution for anyone else with the same question.
0 Likes 0 ·
sdoubleday avatar image
sdoubleday answered
I think recursion could be your answer (though I'd test on increasingly large numbers of rows before turning it loose on millions). Something like this: declare @message NVARCHAR(max) = 'age: 78 Height: 178cm Comments: blah:blah blah blah blah ' /*The trailing line feed is important, or you will need to find another way of finishing the recursion.*/ ;with mycte AS ( SELECT @message as mymessage ) ,myRecursion AS ( SELECT substring(mymessage,0,IndexOfNextLineFeed) AS Element ,substring(mymessage,IndexOfNextLineFeed+1,LEN(mymessage)) AS mymessage ,CAST('Key' AS VARCHAR(10)) As KeyOrValue FROM mycte CROSS APPLY ( SELECT CHARINDEX(CHAR(10),mymessage,0) AS IndexOfNextLineFeed )AS a WHERE mymessage NOT LIKE '' UNION ALL SELECT substring(mymessage,0,IndexOfNextLineFeed) AS Element ,substring(mymessage,IndexOfNextLineFeed+1,LEN(mymessage)) AS mymessage ,CAST(CASE WHEN KeyOrValue LIKE 'Key' THEN 'Value' ELSE 'Key' END AS VARCHAR(10)) As KeyOrValue FROM myRecursion CROSS APPLY ( SELECT CHARINDEX(CHAR(10),mymessage,0) AS IndexOfNextLineFeed )AS a WHERE mymessage NOT LIKE '' ) select * FROM myRecursion OPTION (MAXRECURSION 0)
4 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Full text of your comment appears to be: Interesting (thank you!) but not quite what I'm after. This assumes that every char(10) is a delimiter. but it isn't. The delimiter has to be: char(10) + [any string of characters excluding char(10)] + ':' + char(10) Having said that I'm now experimenting with just using ':' + char(10) as a first pass delimiter. This puts the [question] text at the end of the previous [answer text] but that might not be an insurmountable problem because I think THAT can then be reliably split on the last char(10) of the previous split item. Early days yet. Stay tuned... Thanks again.
0 Likes 0 ·
@GPO - I suspect that it's more about the frequent use of the word "char" - I think that you triggered the spam filter because you had "char" in your comment 6 times...
0 Likes 0 ·
Interesting (thank you!) but not quite what I'm after. This assumes that every char(10) is a delimiter. but it isn't. The delimiter has to be: char(10) + [any string of characters excluding char(10)] + ':' + char(10) Having said that I'm now experimenting with just using ':' + char(10) as a first pass delimiter. This puts the [question] text at the end of the previous [answer text] but that might not be an insurmountable problem because I think THAT can then be reliably split on the last char(10) of the previous split item. Early days yet. Stay tuned... Thanks again.
0 Likes 0 ·
Thanks Thomas for restoring my attempts to post comments! I might now delete a few :-)
0 Likes 0 ·
sdoubleday avatar image
sdoubleday answered
Hi GPO, I see now that my first answer did not resolve the delimiter issue. Also, in my efforts, I could not get the ':' + char(10) delimiter to work without adding char(13) between them. I think that what we need is two delimiters: a row delimiter and a column delimiter, like so: declare @message NVARCHAR(max) = 'age: 78 Height: 178cm Comments: blah:blah blah blah blah ' DECLARE @RowDelim NVARCHAR(10) = '%'+ CHAR(10) + '%' declare @ColumnDelim NVARCHAR(10) = '%:'+ CHAR(13) + CHAR(10) + '%' /*The trailing line feed is important, or you will need to find another way of finishing the recursion.*/ ;with mycte AS ( SELECT @message as mymessage ) ,myRecursion AS ( SELECT myKey.KeyData ,myValue.ValueData ,substring(mymessage,rdlm.IndexOfNextDelimiter+rdlm.DelimiterLength,LEN(mymessage)) AS mymessage ,1 AS RowNumber FROM mycte CROSS APPLY ( SELECT PATINDEX(@ColumnDelim,mycte.mymessage) AS IndexOfNextDelimiter , LEN(@ColumnDelim) - 2 AS DelimiterLength )AS cdlm CROSS APPLY ( SELECT substring(mycte.mymessage,0,cdlm.IndexOfNextDelimiter) AS KeyData )AS myKey CROSS APPLY ( SELECT PATINDEX(@RowDelim, substring(mycte.mymessage,cdlm.IndexOfNextDelimiter+cdlm.DelimiterLength ,LEN(mycte.mymessage)) ) + LEN(KeyData) + cdlm.DelimiterLength AS IndexOfNextDelimiter ,PATINDEX(@RowDelim, substring(mycte.mymessage,cdlm.IndexOfNextDelimiter+cdlm.DelimiterLength ,LEN(mycte.mymessage)) ) AS EndOfValue , LEN(@RowDelim) - 2 AS DelimiterLength )AS rdlm CROSS APPLY ( SELECT substring(mycte.mymessage,cdlm.IndexOfNextDelimiter+cdlm.DelimiterLength ,rdlm.EndOfValue) AS ValueData )AS myValue WHERE mymessage NOT LIKE '' UNION ALL SELECT myKey.KeyData ,myValue.ValueData ,substring(mymessage,rdlm.IndexOfNextDelimiter+rdlm.DelimiterLength,LEN(mymessage)) AS mymessage ,RowNumber + 1 AS RowNumber FROM myRecursion CROSS APPLY ( SELECT PATINDEX(@ColumnDelim,myRecursion.mymessage) AS IndexOfNextDelimiter , LEN(@ColumnDelim) - 2 AS DelimiterLength )AS cdlm CROSS APPLY ( SELECT substring(myRecursion.mymessage,0,cdlm.IndexOfNextDelimiter) AS KeyData )AS myKey CROSS APPLY ( SELECT PATINDEX(@RowDelim, substring(myRecursion.mymessage,cdlm.IndexOfNextDelimiter+cdlm.DelimiterLength ,LEN(myRecursion.mymessage)) ) + LEN(myKey.KeyData) + cdlm.DelimiterLength AS IndexOfNextDelimiter ,PATINDEX(@RowDelim, substring(myRecursion.mymessage,cdlm.IndexOfNextDelimiter+cdlm.DelimiterLength ,LEN(myRecursion.mymessage)) ) AS EndOfValue , LEN(@RowDelim) - 2 AS DelimiterLength )AS rdlm CROSS APPLY ( SELECT substring(myRecursion.mymessage,cdlm.IndexOfNextDelimiter+cdlm.DelimiterLength ,rdlm.EndOfValue) AS ValueData )AS myValue WHERE mymessage NOT LIKE '' ) select * FROM myRecursion OPTION (MAXRECURSION 0) The recursion adds rows, which are split by finding the next column delimiter after the first row delimiter in the remaining message. Within each row, the message is split into a key and a value. I hope this helps!
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.