question

paws27284 avatar image
paws27284 asked

How do you replace characters you don't know about in a string?

I know that sounds funny, but the examples i am finding on the web are from the point of view of knowing what characters need replacing. I know only what is allowed and there can be a multitude of unwanted characters. I can only accept letters, numbers and these 3 special characters / - \. I have also read that cleansing should be at the source and while I agree, that will never happen in this case. I have thought of creating a function containing the valid characters, but then how do I generate the code to see if the character is not in and to replace it. Thanks! Paws27284
tsqlreplacefunction
5 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Can you provide a couple examples of a string, before and after? That would help visualize what you are looking for. Thanks!
0 Likes 0 ·
Sure... Before:24" SQ RECESS FIXTURE AR24 After: 24 SQ RECESS FIXTURE AR24 Before:CABLE CLEAN UP invoice # 6159 After:CABLE CLEAN UP invoice 6159 Before: HANDWASH 0.3% TRICLOSAN 1250ML After: HANDWASH 0 3 TRICLOSAN 1250ML The request is to replace the unwanted characters with a space.
0 Likes 0 ·
Which three special characters are allowed?
0 Likes 0 ·
Forward slash, Back slash and dash
0 Likes 0 ·
I'm not able to think of a set based approach that would work, but I think I've got a RBAR method that would do it. Is this a common thing, or a one-time thing? If it's one time, I think I can do a WHILE loop that will clean it right up.
0 Likes 0 ·
paws27284 avatar image
paws27284 answered
Ok I found a shorter way to "git er done"...works great! CREATE Function [dbo].[udf_RemoveBannedCharacters](@Temp VarChar(1000)) Returns VarChar(1000) AS Begin Declare @KeepValues as varchar(50) Set @KeepValues = '%[^a-z0-9/\\s-]%' While PatIndex(@KeepValues, @Temp) > 0 Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '') Return @Temp End
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Arcanas avatar image
Arcanas answered
This is a RBAR approach. I looked at a couple of set based approaches, but couldn't find one that would work. This is likely a "one time only" or a "on insert/update" approach, and shouldn't be used to update a million rows ten times a day. You should be able to easily modify this into a function for single strings. No guarantee on performance. DECLARE @InitialString NVARCHAR(MAX) DECLARE @FinalString NVARCHAR(MAX) DECLARE @LoopID INT = 1 IF ( OBJECT_ID('tempdb..#tmpTestData') IS NOT NULL ) BEGIN DROP TABLE #tmpTestData END CREATE TABLE #tmpTestData ( ID INT IDENTITY(1, 1), OldStringValue NVARCHAR(MAX), NewStringValue NVARCHAR(MAX) ) INSERT INTO #tmpTestData ( OldStringValue ) VALUES ( '24" SQ RECESS FIXTURE# AR24' ), ( 'CABLE CLEAN UP invoice # 6159' ), ( 'HANDWASH 0.3% TRICLOSAN 1250ML' ), ( 'A VALID VALUE with the number 82847' ) WHILE ( @LoopID <= ( SELECT COUNT(*) FROM #tmpTestData ) ) BEGIN -- Grab the original string, and set the @FinalString to be an empty string. SELECT @InitialString = OldStringValue, @FinalString = '' FROM #tmpTestData TTD WHERE TTD.ID = @LoopID; -- The first CTE will split the string into individual rows, one character per row. -- Full credit to http://stackoverflow.com/questions/8517816/t-sql-split-word-into-characters -- for this piece of code. WITH CTE_Split AS ( SELECT SUBSTRING(a.b, v.number + 1, 1) AS SingleChar FROM ( SELECT @InitialString b ) a JOIN master.dbo.spt_values v ON v.number < LEN(a.b) WHERE v.type = 'P' ), -- Second CTE will look at the ASCII value of each character, and only allow those -- that are a-z, A-Z, 0-9, /, \ or -/ All others are replaced with a space, which -- is the ELSE clause in the CASE statement CTE_CleanUp AS ( SELECT CASE WHEN ASCII(SP.SingleChar) BETWEEN 97 AND 122 THEN SP.SingleChar -- a-z WHEN ASCII(SP.SingleChar) BETWEEN 65 AND 90 THEN SP.SingleChar -- A-Z WHEN ASCII(SP.SingleChar) BETWEEN 48 AND 57 THEN SP.SingleChar -- 0-9 WHEN ASCII(SP.SingleChar) = 47 THEN SP.SingleChar -- / WHEN ASCII(SP.SingleChar) = 45 THEN SP.SingleChar -- \ WHEN ASCII(SP.SingleChar) = 92 THEN SP.SingleChar -- - ELSE ' ' END AS SingleChar FROM CTE_Split SP ) -- Set the @FinalString to be the "sum" of the single characters. This is a -- quick trick to concatenate all of the rows in a table to a single string SELECT @FinalString += CU.SingleChar FROM CTE_CleanUp CU -- Update the original Temp table with the new string value UPDATE #tmpTestData SET NewStringValue = @FinalString WHERE ID = @LoopID SELECT @LoopID += 1 END SELECT TTD.ID, TTD.OldStringValue, TTD.NewStringValue FROM #tmpTestData TTD DROP TABLE #tmpTestData
4 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

The code will be executed monthly on approximately 25k to 50k rows. The sql server is not transaction based but used to stage data for a warehouse. I will give these a try to check the performance on each. Thanks Much! Paws27284
0 Likes 0 ·
Well that wasn't very fun. I tried the first solution. There were approximately 23K rows and after 1 hour 45 mins I decided that was enough. Guess I'll try the RBAR approach...
0 Likes 0 ·
If by the first solution you mean mine, then something is wrong somewhere - I've just tested it against a 24k row table and it completed in 35 secs - not fast I admit, but better than 1hr45mins. Better still, if I use a permanent Numbers table rather than generating it as a CTE, performance comes down to 9s. Upping this to 51k rows, and starting with a cold cache, still came in under 40s
0 Likes 0 ·
Yikes, I must have done something really wrong then. I did use a permanent allowed characters table in case the user wanted to add something later. I haven't used a CTE before so maybe the adjustments made for my columns and tables are wrong. If I cut and paste your exact code it works well. Is there a better explanation of CTE than in MSDN?
0 Likes 0 ·
Kev Riley avatar image
Kev Riley answered
Here's a fun solution that is set-based and makes use of a numbers (or tally) table. Numbers tables are simply a table of numbers from 1 to n, but make handling strings like this easy. If you don't have a permanent numbers table then below is a way to use CTEs just to spin one up quickly, but it's better to have one pre built somewhere for you to use. The first part is simply creating a table of allowed characters - you could just as easily type these in. declare @AllowedChars table (AllowedChar char(1)); WITH Nbrs_4( n ) AS ( SELECT 1 UNION SELECT 0 ), --2 Nbrs_3( n ) AS ( SELECT 1 FROM Nbrs_4 n1 CROSS JOIN Nbrs_4 n2 ), --2*2 = 4 Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 ), --4*4 = 16 Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 ), --16*16 = 256 Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 ), --256*256 = 65,536 Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 ) --65536*65536 = 4,294,967,296 , Numbers_cte as ( select n FROM ( SELECT ROW_NUMBER() OVER (ORDER BY n) FROM Nbrs ) D ( n ) where n < 500 ) insert into @AllowedChars ( AllowedChar ) select cast(n-1 as char(1)) from Numbers_cte where n between 1 and 10 union select char(n) from Numbers_cte where n between 65 and 90 union select ' ' union select '\' union select '/' union select '-'; Then we use a numbers table to split the input strings into the individual characters, and check them against the set of allowed characters using a inner join - any invalid ones are removed. The final select uses the XML concatenation trick to pull the string back together from the individual characters, but the invalid chars are gone. declare @input table (id int, string varchar(max)); insert into @input select 1,'24" SQ RECESS FIXTURE AR24'; insert into @input select 2,'CABLE CLEAN UP invoice # 6159'; insert into @input select 3,'HANDWASH 0.3% TRICLOSAN 1250ML'; WITH Nbrs_4( n ) AS ( SELECT 1 UNION SELECT 0 ), --2 Nbrs_3( n ) AS ( SELECT 1 FROM Nbrs_4 n1 CROSS JOIN Nbrs_4 n2 ), --2*2 = 4 Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 ), --4*4 = 16 Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 ), --16*16 = 256 Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 ), --256*256 = 65,536 Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 ) --65536*65536 = 4,294,967,296 , Numbers_cte as ( select n FROM ( SELECT ROW_NUMBER() OVER (ORDER BY n) FROM Nbrs ) D ( n ) where n < 500 ) , StringSplitandXrefAgainstAllowed as ( select id, n, substring(string,n,1) as ValidChar from @input join Numbers_cte on n <= len(string) join @AllowedChars on substring(string,n,1) = AllowedChar ) select distinct [@input].id , string, ( SELECT '' + ValidChar FROM StringSplitandXrefAgainstAllowed TABLE1 WHERE TABLE1.id = TABLE2.id order by n for xml path(''), type ).value('.','varchar(max)') as CLeanString FROM StringSplitandXrefAgainstAllowed TABLE2 join @input on [@input].id = TABLE2.id this returns id string CLeanString ----------- ------------------------------------ --------------------------------- 1 24" SQ RECESS FIXTURE AR24 24 SQ RECESS FIXTURE AR24 2 CABLE CLEAN UP invoice # 6159 CABLE CLEAN UP invoice 6159 3 HANDWASH 0.3% TRICLOSAN 1250ML HANDWASH 03 TRICLOSAN 1250ML (3 row(s) affected)
1 comment
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Sliiiick. I forgot about that XML trick.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.