x

How to define a regular expression with multiple OR operators where each term includes a space prefix and suffix?

I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.

My question: how do I modify the regular expression to include each term within the OR separated list?

REGULAR EXPRESSION

' and | and or | a o | company | co | c o | dba | d b a '

DESIRED BEHAVIOR

Replace each found term (including the prefix and suffix spaces) with a single space.

ACTUAL BEHAVIOR

Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.

EXAMPLE

Source String

' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '

Result String (Desired Behavior)

' MASHABLE INFORMATION EXPERTS COPYRIGHT '

Result String (Actual Behavior)

' MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT '

ENVIRONMENT

SQL Server 2005

User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)

CODE

set nocount on

declare @source [varchar](800)
declare @regexp [varchar](400)
declare @replace [char](1)
declare @globalReplace [bit]
declare @ignoreCase [bit]
declare @result [varchar](800)

set @globalReplace = 1
set @ignoreCase = 1

SET @source = '   MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT  '
set @regexp = ' and | and or | a o | company | co | c o | dba | d b a '
set @replace = ' '

select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase)

print @result

... producing the result:

   MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT  

*** dbo.regexReplace user-defined function definition ***

CREATE FUNCTION [dbo].[regexReplace]
    (
    @source varchar(5000),
    @regexp varchar(1000),
    @replace varchar(1000),
    @globalReplace bit = 0,
    @ignoreCase bit = 0
    )
    RETURNS varchar(1000) AS
       BEGIN
         DECLARE @hr integer
         DECLARE @objRegExp integer
         DECLARE @result varchar(5000)

         EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT
         IF @hr <> 0 
          BEGIN
              EXEC @hr = sp_OADestroy @objRegExp
              RETURN NULL
          END
         EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp
         IF @hr <> 0 
          BEGIN
              EXEC @hr = sp_OADestroy @objRegExp
              RETURN NULL
          END
         EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace
         IF @hr <> 0 
          BEGIN
              EXEC @hr = sp_OADestroy @objRegExp
              RETURN NULL
          END
         EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase
         IF @hr <> 0 
          BEGIN
              EXEC @hr = sp_OADestroy @objRegExp
              RETURN NULL
          END

         EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace
         IF @hr <> 0 
          BEGIN
              EXEC @hr = sp_OADestroy @objRegExp
              RETURN NULL
          END

         EXECUTE @hr = sp_OADestroy @objRegExp
          IF @hr <> 0 
              BEGIN
                 RETURN NULL
              END

         RETURN @result
       END
more ▼

asked Jan 19, 2012 at 12:02 PM in Default

echoScout gravatar image

echoScout
11 2 2 3

(comments are locked)
10|1200 characters needed characters left

3 answers: sort voted first

The issue is caused by the spaces. I suggest you to make the RegEx replace in two passes. First remove the unwanted words and in second pass remove multiple spaces.

I tested the RegEx and when you have no spaces in the choices the replacement works correctly. (probably the issue is cause when first of choices is replaced then a space is removed and the second choice which contains space is not matched correctly).

So.. Apply:
First Pass:
RegEx: "and or|and|a o|company|co|c o|dba|d b a" Replacement: ""
Result:

MASHABLE LTD DBA THE INFORMATION EXPERTS         COPYRIGHT  

Second Pass:
RegEx: " {2,}": Replacement " "
Result:

MASHABLE LTD DBA THE INFORMATION EXPERTS COPYRIGHT
more ▼

answered Jan 19, 2012 at 01:07 PM

Pavel Pawlowski gravatar image

Pavel Pawlowski
22.3k 9 11 21

(comments are locked)
10|1200 characters needed characters left

I found the operation by regex function as correct 1. words "LTD" and "THE" are not mentioned in the regex function so those will not be replaced. 2. Two consective and words have only one space between them and so only one of them will be removed. 3. Similar is the behaviour of other words like company co. 4. One solution may be to have alist of words to replace and call them one by one in regex instead of once.That will also partially solve your problem as some words will be unnecessarily joined. 5. If you use above said technique with eiterh proceeding or following space.Your problem will be solved.

more ▼

answered Jan 19, 2012 at 01:01 PM

inder gravatar image

inder
211 18 19 22

(comments are locked)
10|1200 characters needed characters left
more ▼

answered Jan 20, 2012 at 09:52 AM

echoScout gravatar image

echoScout
11 2 2 3

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x1951
x7

asked: Jan 19, 2012 at 12:02 PM

Seen: 1507 times

Last Updated: Jan 19, 2012 at 12:02 PM