I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.
My question: how do I modify the regular expression to include each term within the OR separated list?
' and | and or | a o | company | co | c o | dba | d b a '
Replace each found term (including the prefix and suffix spaces) with a single space.
Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.
' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '
Result String (Desired Behavior)
' MASHABLE INFORMATION EXPERTS COPYRIGHT '
Result String (Actual Behavior)
' MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT '
SQL Server 2005
User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)
... producing the result:
*** dbo.regexReplace user-defined function definition ***
asked Jan 19, 2012 at 12:02 PM in Default
The issue is caused by the spaces. I suggest you to make the RegEx replace in two passes. First remove the unwanted words and in second pass remove multiple spaces.
I tested the RegEx and when you have no spaces in the choices the replacement works correctly. (probably the issue is cause when first of choices is replaced then a space is removed and the second choice which contains space is not matched correctly).
answered Jan 19, 2012 at 01:07 PM
I found the operation by regex function as correct 1. words "LTD" and "THE" are not mentioned in the regex function so those will not be replaced. 2. Two consective and words have only one space between them and so only one of them will be removed. 3. Similar is the behaviour of other words like company co. 4. One solution may be to have alist of words to replace and call them one by one in regex instead of once.That will also partially solve your problem as some words will be unnecessarily joined. 5. If you use above said technique with eiterh proceeding or following space.Your problem will be solved.
answered Jan 19, 2012 at 01:01 PM
Thank you for the responses.
[The question was answered on StackExchange.com]: http://stackoverflow.com/questions/8932453/how-to-define-a-regular-expression-with-multiple-or-operators-where-each-term-in