question

Phil Factor avatar image
Phil Factor asked

Best way of splitting text into its individual words in SQL

I know plenty of ways of splitting strings, but what is the best/fastest way of splitting text into its individual words to produce a table of words, and their sequence? Naturally I'd want to take off trailing punctuation but preserve it if it was within the word. The reason I'm interested is that I want to run some timings to check performance. What I'm writing is a system for doing an inversion index for searching and gauging text similarity. The last time I did it, it was a bit slow.
stringsplitstring-splittingwords
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

1 Answer

·
SirSQL avatar image
SirSQL answered
6 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

AaronBertrand avatar image AaronBertrand commented ·
@Phil that makes things quite tricky, as you can imagine. I can show you how to split the string in T-SQL and maintain order in the output, but I would say you are never going to be able to solve this problem in T-SQL as easily *or* to run as efficiently as you could do it in CLR, where you have the added benefit of native RegEx to deal with punctuation etc.
2 Likes 2 ·
Phil Factor avatar image Phil Factor commented ·
Oh Dear. I can't quite see how this relates to the problem of splitting out words from text. Text is delimited by whitespace, which includes one or more of a range of characters, and may have trailing punctuation which is treated as whitespace. The text I've just typed in should return a table that begins.. Word --------- Oh Dear I Can't Quite See -etc-
0 Likes 0 ·
SirSQL avatar image SirSQL commented ·
Could you not take the string, remove any non-alpha, replace the commas and run it through one of those, or have them run with spaces instead of commas (I've not done this, so spitballing a little here).
0 Likes 0 ·
AaronBertrand avatar image AaronBertrand commented ·
@Phil did you intentionally remove the ' from "can't", e.g. you don't care about punctuation even if it's in the middle of a word or name (like O'Shea)?
0 Likes 0 ·
Phil Factor avatar image Phil Factor commented ·
@Aaron It was a mistake. I've corrected it
0 Likes 0 ·
Phil Factor avatar image Phil Factor commented ·
@SirSQL Well several methods turn the whitespace and punctuation into a single delimiter and then turn it into a string-splitting problem. You're part-way there, but what about full-stops, semicolons and colons? It is beginning to sound like a regex problem!
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.