x

How to define a regular expression with multiple OR operators where each term includes a space prefix and suffix?

I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.

My question: how do I modify the regular expression to include each term within the OR separated list?

REGULAR EXPRESSION

' and | and or | a o | company | co | c o | dba | d b a '

DESIRED BEHAVIOR

Replace each found term (including the prefix and suffix spaces) with a single space.

ACTUAL BEHAVIOR

Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.

EXAMPLE

Source String

' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '

Result String (Desired Behavior)

' MASHABLE INFORMATION EXPERTS COPYRIGHT '

Result String (Actual Behavior)

' MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT '

ENVIRONMENT

SQL Server 2005

User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)

CODE

 set nocount on
 
 declare @source [varchar](800)
 declare @regexp [varchar](400)
 declare @replace [char](1)
 declare @globalReplace [bit]
 declare @ignoreCase [bit]
 declare @result [varchar](800)
 
 set @globalReplace = 1
 set @ignoreCase = 1
 
 SET @source = '   MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT  '
 set @regexp = ' and | and or | a o | company | co | c o | dba | d b a '
 set @replace = ' '
 
 select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase)
 
 print @result

... producing the result:

    MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT  



dbo.regexReplace user-defined function definition

 CREATE FUNCTION [dbo].[regexReplace]
     (
     @source varchar(5000),
     @regexp varchar(1000),
     @replace varchar(1000),
     @globalReplace bit = 0,
     @ignoreCase bit = 0
     )
     RETURNS varchar(1000) AS
         BEGIN
             DECLARE @hr integer
             DECLARE @objRegExp integer
             DECLARE @result varchar(5000)
 
             EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT
             IF @hr <> 0 
                 BEGIN
                     EXEC @hr = sp_OADestroy @objRegExp
                     RETURN NULL
                 END
             EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp
             IF @hr <> 0 
                 BEGIN
                     EXEC @hr = sp_OADestroy @objRegExp
                     RETURN NULL
                 END
             EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace
             IF @hr <> 0 
                 BEGIN
                     EXEC @hr = sp_OADestroy @objRegExp
                     RETURN NULL
                 END
             EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase
             IF @hr <> 0 
                 BEGIN
                     EXEC @hr = sp_OADestroy @objRegExp
                     RETURN NULL
                 END
 
             EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace
             IF @hr <> 0 
                 BEGIN
                     EXEC @hr = sp_OADestroy @objRegExp
                     RETURN NULL
                 END
 
             EXECUTE @hr = sp_OADestroy @objRegExp
                 IF @hr <> 0 
                     BEGIN
                         RETURN NULL
                     END
 
             RETURN @result
         END

more ▼

asked Jan 19, 2012 at 12:02 PM in Default

avatar image

echoScout
11 2 2 3

(comments are locked)
10|1200 characters needed characters left

3 answers: sort voted first

The issue is caused by the spaces. I suggest you to make the RegEx replace in two passes. First remove the unwanted words and in second pass remove multiple spaces.

I tested the RegEx and when you have no spaces in the choices the replacement works correctly. (probably the issue is cause when first of choices is replaced then a space is removed and the second choice which contains space is not matched correctly).

So.. Apply:
First Pass:
RegEx: "and or|and|a o|company|co|c o|dba|d b a" Replacement: ""
Result:

 MASHABLE LTD DBA THE INFORMATION EXPERTS         COPYRIGHT  

Second Pass:
RegEx: " {2,}": Replacement " "
Result:

 MASHABLE LTD DBA THE INFORMATION EXPERTS COPYRIGHT
more ▼

answered Jan 19, 2012 at 01:07 PM

avatar image

Pavel Pawlowski
22.7k 10 15 26

(comments are locked)
10|1200 characters needed characters left

I found the operation by regex function as correct 1. words "LTD" and "THE" are not mentioned in the regex function so those will not be replaced. 2. Two consective and words have only one space between them and so only one of them will be removed. 3. Similar is the behaviour of other words like company co. 4. One solution may be to have alist of words to replace and call them one by one in regex instead of once.That will also partially solve your problem as some words will be unnecessarily joined. 5. If you use above said technique with eiterh proceeding or following space.Your problem will be solved.

more ▼

answered Jan 19, 2012 at 01:01 PM

avatar image

inder
211 18 19 25

(comments are locked)
10|1200 characters needed characters left
more ▼

answered Jan 20, 2012 at 09:52 AM

avatar image

echoScout
11 2 2 3

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x2016
x9

asked: Jan 19, 2012 at 12:02 PM

Seen: 1882 times

Last Updated: Jan 19, 2012 at 12:02 PM

Copyright 2016 Redgate Software. Privacy Policy