Hi all, I have a account number field that gets supplied to us from one of our partners. The problem is they've dumped (what i think) is carriage returns in this field as REPLACE(field, ' ', '') doesn't seem to be picking them up. How do i identify and remove carraige returns, tabs, line feeds, whitespace, any garbage that isn't text or numbers?
Carriage returns (char(13)) line feeds (char(10)) and tabs (char(9)) can be easily replaced replace(replace(replace(field, char(9),''), char(10),''), char(13),'') but its the 'other garbage' that can be difficult to eradicate - you either have to know what it is up-front and deal with it, like above, or have a list of valid characters and try and trap for those... for example if only numbers where allowed where patindex(^[0-9], field) = 0
Just for fun, I thought of handling it by using the cteTally which would return a string with valid characters only i.e. [0-9] and [A-Z] ([a-z] are handled by using the UPPER function, but can be handled the same way as [A-Z]) DECLARE @pstring VARCHAR(200) SELECT @pstring = 'ABC 456 # & 8 ' ; WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ), --10E+1 or 10 rows E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front -- for both a performance gain and prevention of accidental "overruns" SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 ) , cteValidCharacters(CharValue) AS (--==== This provides the list of the valid Characters i.e [0-9] AND [A-Z] SELECT CHAR(47+ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E1 UNION ALL SELECT TOP 26 CHAR(64+ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E2 ) , cteAllCharacters(Chars) AS ( SELECT SUBSTRING(@pString ,N,1) FROM cteTally WHERE UPPER(SUBSTRING(@pString ,N,1)) IN (SELECT [CharValue] FROM cteValidCharacters) ) SELECT (SELECT [Chars] [*] FROM cteAllCharacters FOR XML PATH('')) It can be turned into an inline function to scale up.
For whatever replacements and matching in case you can use CLR, the most siple thing is to use a CLR RegEx implementation. You can take a look on [SQL Server Regular Expressions – Replacements] for sample implementation. For your concrete needs you can use the `"(?:[^a-zA-Z0-9]|\s)"` regular expression which returns all chars, which are not alphanumeric or are white space. In the sample implementation you can simple replace by empty string and you are done. :
We get what appears to be "unvalidated" data from some of our vendors. Our database is based on the requirements of MS-Dynamics which is much more restrictive. Sometimes we get strange characters in and I use the following script to identify the data. By the way, I wrote this for SQL-2K so there are probably better ways of doing this now.
DECLARE @TestString VARCHAR(30), @Bin VARBINARY(255) SELECT @TestString = RTRIM(SomeCharColumn) FROM DB..Table (NOLOCK) WHERE LEN(SomeCharColumn) = 4 AND SomeCharColumn LIKE '%-DT' -- Sometimes this won't print (normally happens when there is null data) PRINT @TestString SET @Bin = CONVERT(VARBINARY(255), @TestString) PRINT @Bin