question

harika avatar image
harika asked

Comparing tables

Hi Iam trying to join two tables with respect to their name and address. But the problem is that there is a mismatch in the spellings. like if its 1 E st in table A then the same is written as One E st in table B.,so its unable to return all possbile outcomes. Also there are problems with extra spaces,abbrevations etc, It is a file of almost 7000 records. Anyone can suggest me the best way to solve this porblem. Thank you very much.
joins
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Mark avatar image Mark commented ·
@harika, what version of SQL Server are you using?
1 Like 1 ·
Håkan Winther avatar image
Håkan Winther answered
The only options I can think of right now is "[full text search][1]",SSIS with [Fuzzy lookup][2] and maybe some features of master data services. I haven't used any of them myself, but it should be possible, and i think master data services in Denali is trying to target issues like this ( I saw a quick demo in Seattle where they used master data services to create "clean" data). [1]: http://msdn.microsoft.com/en-us/library/ms142571.aspx [2]: http://msdn.microsoft.com/en-us/library/ms137786.aspx
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Mark avatar image
Mark answered
You can try to use [SOUNDEX][1], but I'd be careful with it and verify its results (which makes it less useful). But mainly I have sympathy for you. [1]: http://msdn.microsoft.com/en-us/library/ms187384.aspx
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

TimothyAWiseman avatar image TimothyAWiseman commented ·
This is a good suggestion to use in conjunction with other techniques, but in particular the soundex for all numbers written as numerals is '0000', so I do not think it will help in the particular situation where one is a number written as numerals and the other is a number spelled out.
0 Likes 0 ·
Mark avatar image Mark commented ·
@TimothyAWiseman, yes, I didn't mean to imply that this was the only solution and not to use it with others. Also, I've worked with data like that and imagined that he would also run into mis-spellings where SOUNDEX could be helpful.
0 Likes 0 ·
TimothyAWiseman avatar image TimothyAWiseman commented ·
Oh certainly, I think it would be a great piece of the overall solution and a very good suggestion. I just want to make it obvious to people without much experience with soundex that it could only be a piece of the solution and that it won't handle the specific example ('1' vs 'one') in the question.
0 Likes 0 ·
Kevin Feasel avatar image
Kevin Feasel answered
I've had some success with SSIS's Fuzzy Lookup functionality. It's not perfect and you'll probably have to play around with the cutoff range, but it will eliminate a large percentage of the problem records. Unfortunately, you'll probably need to check the results afterward, but it will eliminate a large chunk of just-off-by-a-little results.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Håkan Winther avatar image Håkan Winther commented ·
+1 I should look into the Fuzzy lookup feature someday. :)
0 Likes 0 ·
ThomasRushton avatar image
ThomasRushton answered
Sounds like the sort of stuff that some of the [Datenfabrik][1] tools might be able to help with... [1]: http://www.datenfabrik.com
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Mark avatar image Mark commented ·
Wow, I'll have to pick up German right quick to read that web page... :0)
0 Likes 0 ·
sp_lock avatar image sp_lock commented ·
Click on the english flag... ;-)
0 Likes 0 ·
Håkan Winther avatar image Håkan Winther commented ·
Or google translate :)
0 Likes 0 ·
TimothyAWiseman avatar image
TimothyAWiseman answered
The first question is can you you change the schema? If you can, then it may make sense to create a table C that has an ID number and the addresses, and then have both table A and table B link to table C. When you need to find where the address in table A and table B matches, then you can do it on the key number from table C. If you can't do that, can you create and import a CLR assembly? If you can, then you can write the matching function in C# where you can invoke regex and invoke conditionals to do things like try the spelled out version of numbers when present (or vice versa). It would liekly be slow, but it could be more effective then anything you could do in SQL. If neight of those are options, then I think Hakan's suggestions are the way to go.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Håkan Winther avatar image Håkan Winther commented ·
+1 I think thats an excellent idea, string manipulation in C# is much faster, but you have to come up with your Fuzzy logic algorithm by yourself, and that can take more time then exporting everything to Excel, do some smart sorting, and alter the records by hand. 7000 records isn't that much. :)
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.