x

Fuzzy Grouping SSIS - How to Run Either/Or Logic Against Same Data Set,

I have a data set (approximately 4 million records) with the following fields:

-Id

-First Name

-Last Name

-Street

-City

-State

-Zip

-Email

The total confidence score needs to be greater than or equal to 75%, and the Last Name field must have a minimum confidence of 80%. The following rules also need to be applied to the match process:

-If First Name has a length less than or equal to 7 characters, then the minimum confidence must be 90%; if First Name has a length greater than 7 characters, then the minimum confidence can be 80%.

-Street and City must have a minimum confidence of 90% OR Email must be an exact match (100%).

I have seen numerous SSIS examples where a Conditional Split was used to break up a data set by State, for example, but then the separate Fuzzy Grouping tasks used the same overall match logic. Breaking up by State is straightforward, but the two rules above amount to a more complicated either/or logic (Street and City greater than or equal to 90% OR Email equals 100%).

What would be the best way in Fuzzy Grouping to process this kind of either/or logic against the same data set? Any advice would be greatly appreciated!

more ▼

asked Jul 18, 2012 at 09:44 PM in Default

Gawain41 gravatar image

Gawain41
0 1 1 1

(comments are locked)
10|1200 characters needed characters left

2 answers: sort voted first

I think the Conditional Split is still the way to go but in the Condition you will have to build it up to show what you need.

Or a simplified route may be to create views on the database based on your criteria and work with the views individually rather than as a whole. It's not as neat a solution but may help solve your problems.
more ▼

answered Jul 20, 2012 at 10:49 AM

Mrs_Fatherjack gravatar image

Mrs_Fatherjack
4.7k 60 62 67

(comments are locked)
10|1200 characters needed characters left

I got to thinking even after posting my question that a Conditional Split might still be the way to go. The problem is that some of the conditions depend on having already run Fuzzy Grouping (e.g., Email equals 100% confidence).

I'm wondering if the Fuzzy Grouping could be processed on the same data set sequentially? For example, I could create a task to process the Last Name according to the first criterion (minimum confidence of 80%), then take that output and create a task to process on the either/or variations of First Name, and so on, each time "adding" to the Fuzzy Grouping output. Once each task was processed, the next task could use a view, as you suggested, to extract the right data for the next run.

Thank you for the quick response!
more ▼

answered Jul 20, 2012 at 10:47 PM

Gawain41 gravatar image

Gawain41
0 1 1 1

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x946
x3

asked: Jul 18, 2012 at 09:44 PM

Seen: 1593 times

Last Updated: Jul 20, 2012 at 10:47 PM