question

sakath2009 avatar image
sakath2009 asked

Identify repetitive words successively in a text

Hi Team, I want to write a SQL query, to retrive the duplicate words in a text saved in a cell. Below is an example. The Requirement is if the text is as below, the query should give the below highlighted underlined words. The below text is stored in a single cell. Like this i have a huge database, where I need to retrive the results as below. Text column duplicatewords aaaaaaa sales sales aaaaaaa full full aaaaaaa store store Text: XYZ Corp announced ***sales sales*** results for the fourth quarter and ***full full*** year ended January 3, 2016. For the fourth quarter of 2015, company-owned comparable ***store store*** sales decreased 8.5%, Franchise comparable store sales decreased 6.2%, and System-wide comparable store sales decreased 2.5%. Full year, company-owned comparable store sales increased 0.8%, Franchise comparable store sales decreased 0.3%, and System-wide comparable store sales decreased 0.2%. For the year 2015, the company expects to report total revenue to meet or exceed $78 million compared to prior guidance of approximately $8 million. Non-GAAP adjusted EBITDA to meet or exceed $1.5 million compared to prior guidance of approximately $1.5 million. For the year 2017, the company expects total revenue to be in the range of $5 million to $7 million compared to preliminary guidance of $7 million to $8 million. Non-GAAP adjusted EBITDA to be in the range of $3 million to $15 million compared to preliminary guidance of $1 million to $6 million. Can someone please help ? Thanks, Sakat
full-textwordsquery-store
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

1 Answer

· Write an Answer
Kev Riley avatar image
Kev Riley answered
Here's a solution that will use a few techniques. First we need to split the text up into the individual words. That is done by using a *tally table* (sometimes called a *numbers table)*, and in this case I actually use a cte **tally_cte**. The words and their relative positions are produced by **words_cte** (so you could select from that to see). Then I used a variation on the *gaps and islands* solutions, to find 'islands' of repeated words. The final select gives you those words. You could add more info here like position in the original string, word number etc. Both the tally cte and islands solution are credited to Itzik Ben-Gan https://manning-content.s3.amazonaws.com/download/3/e589652-7171-4310-a714-e84dd0f14090/SampleChapter5.pdf declare @text varchar(max) = 'XYZ Corp announced sales sales results for the fourth quarter and full full year ended January 3, 2016. For the fourth quarter of 2015, company-owned comparable store store sales decreased 8.5%, Franchise comparable store sales decreased 6.2%, and System-wide comparable store sales decreased 2.5%. Full year, company-owned comparable store sales increased 0.8%, Franchise comparable store sales decreased 0.3%, and System-wide comparable store sales decreased 0.2%. For the year 2015, the company expects to report total revenue to meet or exceed $78 million compared to prior guidance of approximately $8 million. Non-GAAP adjusted EBITDA to meet or exceed $1.5 million compared to prior guidance of approximately $1.5 million. For the year 2017, the company expects total revenue to be in the range of $5 million to $7 million compared to preliminary guidance of $7 million to $8 million. Non-GAAP adjusted EBITDA to be in the range of $3 million to $15 million compared to preliminary guidance of $1 million to $6 million.' declare @input varchar(max) set @input = ' '+@text+' ' --first split the string in to proper data, i.e. a column ;WITH Nbrs_4( n ) AS ( SELECT 1 UNION SELECT 0 ), Nbrs_3( n ) AS ( SELECT 1 FROM Nbrs_4 n1 CROSS JOIN Nbrs_4 n2 ), Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 ), Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 ), Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 ), Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 ) , tally_cte as ( select n FROM ( SELECT ROW_NUMBER() OVER (ORDER BY n) FROM Nbrs ) D ( n ) where n < 50000 ) ,words_cte as ( select row_number()over (order by N) as WordNo, substring(@input,N+1,charindex(' ',@input,N+1)-N-1) as Value from tally_cte where N < len(@input) and substring(@input,N,1) = ' ' ) , IslandGroupings AS ( SELECT WordNo, Value, ROW_NUMBER() OVER(ORDER BY WordNo) - ROW_NUMBER() OVER(ORDER BY Value, WordNo) AS grp FROM words_cte ) SELECT Value FROM IslandGroupings GROUP BY Value, grp having MIN(WordNo)<>MAX(WordNo)
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.