I have a data set (approximately 4 million records) with the following fields:
The total confidence score needs to be greater than or equal to 75%, and the Last Name field must have a minimum confidence of 80%. The following rules also need to be applied to the match process:
-If First Name has a length less than or equal to 7 characters, then the minimum confidence must be 90%; if First Name has a length greater than 7 characters, then the minimum confidence can be 80%.
-Street and City must have a minimum confidence of 90% OR Email must be an exact match (100%).
I have seen numerous SSIS examples where a Conditional Split was used to break up a data set by State, for example, but then the separate Fuzzy Grouping tasks used the same overall match logic. Breaking up by State is straightforward, but the two rules above amount to a more complicated either/or logic (Street and City greater than or equal to 90% OR Email equals 100%).
What would be the best way in Fuzzy Grouping to process this kind of either/or logic against the same data set? Any advice would be greatly appreciated!
asked Jul 18, 2012 at 09:44 PM in Default
I think the Conditional Split is still the way to go but in the Condition you will have to build it up to show what you need.
Or a simplified route may be to create views on the database based on your criteria and work with the views individually rather than as a whole. It's not as neat a solution but may help solve your problems.
answered Jul 20, 2012 at 10:49 AM
I got to thinking even after posting my question that a Conditional Split might still be the way to go. The problem is that some of the conditions depend on having already run Fuzzy Grouping (e.g., Email equals 100% confidence).
I'm wondering if the Fuzzy Grouping could be processed on the same data set sequentially? For example, I could create a task to process the Last Name according to the first criterion (minimum confidence of 80%), then take that output and create a task to process on the either/or variations of First Name, and so on, each time "adding" to the Fuzzy Grouping output. Once each task was processed, the next task could use a view, as you suggested, to extract the right data for the next run.
Thank you for the quick response!
answered Jul 20, 2012 at 10:47 PM