question

echoScout avatar image
echoScout asked

How to define a regular expression with multiple OR operators where each term includes a space prefix and suffix?

I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable. My question: how do I modify the regular expression to include each term within the OR separated list? *REGULAR EXPRESSION* ' and | and or | a o | company | co | c o | dba | d b a ' *DESIRED BEHAVIOR* Replace each found term (including the prefix and suffix spaces) with a single space. *ACTUAL BEHAVIOR* Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space. *EXAMPLE* **Source String** ' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT ' **Result String (Desired Behavior)** ' MASHABLE INFORMATION EXPERTS COPYRIGHT ' **Result String (Actual Behavior)** ' MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT ' *ENVIRONMENT* SQL Server 2005 User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post) *CODE* set nocount on declare @source [varchar](800) declare @regexp [varchar](400) declare @replace [char](1) declare @globalReplace [bit] declare @ignoreCase [bit] declare @result [varchar](800) set @globalReplace = 1 set @ignoreCase = 1 SET @source = ' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT ' set @regexp = ' and | and or | a o | company | co | c o | dba | d b a ' set @replace = ' ' select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase) print @result ... producing the result: MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT *** dbo.regexReplace user-defined function definition *** CREATE FUNCTION [dbo].[regexReplace] ( @source varchar(5000), @regexp varchar(1000), @replace varchar(1000), @globalReplace bit = 0, @ignoreCase bit = 0 ) RETURNS varchar(1000) AS BEGIN DECLARE @hr integer DECLARE @objRegExp integer DECLARE @result varchar(5000) EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OADestroy @objRegExp IF @hr <> 0 BEGIN RETURN NULL END RETURN @result END
sql-server-2005regex
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

inder avatar image
inder answered
I found the operation by regex function as correct 1. words "LTD" and "THE" are not mentioned in the regex function so those will not be replaced. 2. Two consective and words have only one space between them and so only one of them will be removed. 3. Similar is the behaviour of other words like company co. 4. One solution may be to have alist of words to replace and call them one by one in regex instead of once.That will also partially solve your problem as some words will be unnecessarily joined. 5. If you use above said technique with eiterh proceeding or following space.Your problem will be solved.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Pavel Pawlowski avatar image
Pavel Pawlowski answered
The issue is caused by the spaces. I suggest you to make the RegEx replace in two passes. First remove the unwanted words and in second pass remove multiple spaces. I tested the RegEx and when you have no spaces in the choices the replacement works correctly. (probably the issue is cause when first of choices is replaced then a space is removed and the second choice which contains space is not matched correctly). So.. Apply: **First Pass:** RegEx: `"and or|and|a o|company|co|c o|dba|d b a"` Replacement: `""` Result: MASHABLE LTD DBA THE INFORMATION EXPERTS COPYRIGHT **Second Pass:** RegEx: `" {2,}"`: Replacement `" "` Result: MASHABLE LTD DBA THE INFORMATION EXPERTS COPYRIGHT
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

echoScout avatar image
echoScout answered
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.