question

paws27284 avatar image
paws27284 asked

How do you replace characters you don't know about in a string?

I know that sounds funny, but the examples i am finding on the web are from the point of view of knowing what characters need replacing. I know only what is allowed and there can be a multitude of unwanted characters. I can only accept letters, numbers and these 3 special characters / - \. I have also read that cleansing should be at the source and while I agree, that will never happen in this case. I have thought of creating a function containing the valid characters, but then how do I generate the code to see if the character is not in and to replace it. Thanks! Paws27284
tsqlreplacefunction
5 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Arcanas avatar image Arcanas commented ·
Can you provide a couple examples of a string, before and after? That would help visualize what you are looking for. Thanks!
0 Likes 0 ·
paws27284 avatar image paws27284 commented ·
Sure... Before:24" SQ RECESS FIXTURE AR24 After: 24 SQ RECESS FIXTURE AR24 Before:CABLE CLEAN UP invoice # 6159 After:CABLE CLEAN UP invoice 6159 Before: HANDWASH 0.3% TRICLOSAN 1250ML After: HANDWASH 0 3 TRICLOSAN 1250ML The request is to replace the unwanted characters with a space.
0 Likes 0 ·
Arcanas avatar image Arcanas commented ·
Which three special characters are allowed?
0 Likes 0 ·
paws27284 avatar image paws27284 commented ·
Forward slash, Back slash and dash
0 Likes 0 ·
Arcanas avatar image Arcanas commented ·
I'm not able to think of a set based approach that would work, but I think I've got a RBAR method that would do it. Is this a common thing, or a one-time thing? If it's one time, I think I can do a WHILE loop that will clean it right up.
0 Likes 0 ·
paws27284 avatar image
paws27284 answered
Ok I found a shorter way to "git er done"...works great! CREATE Function [dbo].[udf_RemoveBannedCharacters](@Temp VarChar(1000)) Returns VarChar(1000) AS Begin Declare @KeepValues as varchar(50) Set @KeepValues = '%[^a-z0-9/\\s-]%' While PatIndex(@KeepValues, @Temp) > 0 Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '') Return @Temp End
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Arcanas avatar image
Arcanas answered
This is a RBAR approach. I looked at a couple of set based approaches, but couldn't find one that would work. This is likely a "one time only" or a "on insert/update" approach, and shouldn't be used to update a million rows ten times a day. You should be able to easily modify this into a function for single strings. No guarantee on performance. DECLARE @InitialString NVARCHAR(MAX) DECLARE @FinalString NVARCHAR(MAX) DECLARE @LoopID INT = 1 IF ( OBJECT_ID('tempdb..#tmpTestData') IS NOT NULL ) BEGIN DROP TABLE #tmpTestData END CREATE TABLE #tmpTestData ( ID INT IDENTITY(1, 1), OldStringValue NVARCHAR(MAX), NewStringValue NVARCHAR(MAX) ) INSERT INTO #tmpTestData ( OldStringValue ) VALUES ( '24" SQ RECESS FIXTURE# AR24' ), ( 'CABLE CLEAN UP invoice # 6159' ), ( 'HANDWASH 0.3% TRICLOSAN 1250ML' ), ( 'A VALID VALUE with the number 82847' ) WHILE ( @LoopID <= ( SELECT COUNT(*) FROM #tmpTestData ) ) BEGIN -- Grab the original string, and set the @FinalString to be an empty string. SELECT @InitialString = OldStringValue, @FinalString = '' FROM #tmpTestData TTD WHERE TTD.ID = @LoopID; -- The first CTE will split the string into individual rows, one character per row. -- Full credit to http://stackoverflow.com/questions/8517816/t-sql-split-word-into-characters -- for this piece of code. WITH CTE_Split AS ( SELECT SUBSTRING(a.b, v.number + 1, 1) AS SingleChar FROM ( SELECT @InitialString b ) a JOIN master.dbo.spt_values v ON v.number < LEN(a.b) WHERE v.type = 'P' ), -- Second CTE will look at the ASCII value of each character, and only allow those -- that are a-z, A-Z, 0-9, /, \ or -/ All others are replaced with a space, which -- is the ELSE clause in the CASE statement CTE_CleanUp AS ( SELECT CASE WHEN ASCII(SP.SingleChar) BETWEEN 97 AND 122 THEN SP.SingleChar -- a-z WHEN ASCII(SP.SingleChar) BETWEEN 65 AND 90 THEN SP.SingleChar -- A-Z WHEN ASCII(SP.SingleChar) BETWEEN 48 AND 57 THEN SP.SingleChar -- 0-9 WHEN ASCII(SP.SingleChar) = 47 THEN SP.SingleChar -- / WHEN ASCII(SP.SingleChar) = 45 THEN SP.SingleChar -- \ WHEN ASCII(SP.SingleChar) = 92 THEN SP.SingleChar -- - ELSE ' ' END AS SingleChar FROM CTE_Split SP ) -- Set the @FinalString to be the "sum" of the single characters. This is a -- quick trick to concatenate all of the rows in a table to a single string SELECT @FinalString += CU.SingleChar FROM CTE_CleanUp CU -- Update the original Temp table with the new string value UPDATE #tmpTestData SET NewStringValue = @FinalString WHERE ID = @LoopID SELECT @LoopID += 1 END SELECT TTD.ID, TTD.OldStringValue, TTD.NewStringValue FROM #tmpTestData TTD DROP TABLE #tmpTestData
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

paws27284 avatar image paws27284 commented ·
The code will be executed monthly on approximately 25k to 50k rows. The sql server is not transaction based but used to stage data for a warehouse. I will give these a try to check the performance on each. Thanks Much! Paws27284
0 Likes 0 ·
paws27284 avatar image paws27284 commented ·
Well that wasn't very fun. I tried the first solution. There were approximately 23K rows and after 1 hour 45 mins I decided that was enough. Guess I'll try the RBAR approach...
0 Likes 0 ·
Kev Riley avatar image Kev Riley ♦♦ commented ·
If by the first solution you mean mine, then something is wrong somewhere - I've just tested it against a 24k row table and it completed in 35 secs - not fast I admit, but better than 1hr45mins. Better still, if I use a permanent Numbers table rather than generating it as a CTE, performance comes down to 9s. Upping this to 51k rows, and starting with a cold cache, still came in under 40s
0 Likes 0 ·
paws27284 avatar image paws27284 commented ·
Yikes, I must have done something really wrong then. I did use a permanent allowed characters table in case the user wanted to add something later. I haven't used a CTE before so maybe the adjustments made for my columns and tables are wrong. If I cut and paste your exact code it works well. Is there a better explanation of CTE than in MSDN?
0 Likes 0 ·
Kev Riley avatar image
Kev Riley answered
Here's a fun solution that is set-based and makes use of a numbers (or tally) table. Numbers tables are simply a table of numbers from 1 to n, but make handling strings like this easy. If you don't have a permanent numbers table then below is a way to use CTEs just to spin one up quickly, but it's better to have one pre built somewhere for you to use. The first part is simply creating a table of allowed characters - you could just as easily type these in. declare @AllowedChars table (AllowedChar char(1)); WITH Nbrs_4( n ) AS ( SELECT 1 UNION SELECT 0 ), --2 Nbrs_3( n ) AS ( SELECT 1 FROM Nbrs_4 n1 CROSS JOIN Nbrs_4 n2 ), --2*2 = 4 Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 ), --4*4 = 16 Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 ), --16*16 = 256 Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 ), --256*256 = 65,536 Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 ) --65536*65536 = 4,294,967,296 , Numbers_cte as ( select n FROM ( SELECT ROW_NUMBER() OVER (ORDER BY n) FROM Nbrs ) D ( n ) where n < 500 ) insert into @AllowedChars ( AllowedChar ) select cast(n-1 as char(1)) from Numbers_cte where n between 1 and 10 union select char(n) from Numbers_cte where n between 65 and 90 union select ' ' union select '\' union select '/' union select '-'; Then we use a numbers table to split the input strings into the individual characters, and check them against the set of allowed characters using a inner join - any invalid ones are removed. The final select uses the XML concatenation trick to pull the string back together from the individual characters, but the invalid chars are gone. declare @input table (id int, string varchar(max)); insert into @input select 1,'24" SQ RECESS FIXTURE AR24'; insert into @input select 2,'CABLE CLEAN UP invoice # 6159'; insert into @input select 3,'HANDWASH 0.3% TRICLOSAN 1250ML'; WITH Nbrs_4( n ) AS ( SELECT 1 UNION SELECT 0 ), --2 Nbrs_3( n ) AS ( SELECT 1 FROM Nbrs_4 n1 CROSS JOIN Nbrs_4 n2 ), --2*2 = 4 Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 ), --4*4 = 16 Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 ), --16*16 = 256 Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 ), --256*256 = 65,536 Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 ) --65536*65536 = 4,294,967,296 , Numbers_cte as ( select n FROM ( SELECT ROW_NUMBER() OVER (ORDER BY n) FROM Nbrs ) D ( n ) where n < 500 ) , StringSplitandXrefAgainstAllowed as ( select id, n, substring(string,n,1) as ValidChar from @input join Numbers_cte on n <= len(string) join @AllowedChars on substring(string,n,1) = AllowedChar ) select distinct [@input].id , string, ( SELECT '' + ValidChar FROM StringSplitandXrefAgainstAllowed TABLE1 WHERE TABLE1.id = TABLE2.id order by n for xml path(''), type ).value('.','varchar(max)') as CLeanString FROM StringSplitandXrefAgainstAllowed TABLE2 join @input on [@input].id = TABLE2.id this returns id string CLeanString ----------- ------------------------------------ --------------------------------- 1 24" SQ RECESS FIXTURE AR24 24 SQ RECESS FIXTURE AR24 2 CABLE CLEAN UP invoice # 6159 CABLE CLEAN UP invoice 6159 3 HANDWASH 0.3% TRICLOSAN 1250ML HANDWASH 03 TRICLOSAN 1250ML (3 row(s) affected)
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Arcanas avatar image Arcanas commented ·
Sliiiick. I forgot about that XML trick.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.