question

Ian Roke avatar image
Ian Roke asked

Join tables based on similar terms

I have a table of part numbers of many million. I would like to join it to another reference table where the part numbers can be slightly different. For example I could have `ABC123` and `ABC321` which (believe it or not!) can be the same thing! More specifically I want to be able to join a part number similar to `H7C-2275Q-VT 6CE` and `H7C-2275Q-VT EC6` where the last three characters are different. I would also like to be able to match on differing percentages of match so `ABCD E123 45` would match with `ABCD E123` for example. I don't mind using a routine and producing a static table. Any ideas? I must admit this has got me a bit stumped!
sql-server-2005t-sqljoins
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

KenJ avatar image KenJ commented ·
Going for some matching rules here: I see that 6CE would match EC6 which contains the same characters. Would it still match if it was 6CE and 4DA? How should it match if you have 6CE, EC6, 6EC and E6C? Can you go into more detail on the percentage piece? It looks like they can still match even if one of the numbers doesn't have the final characters at all. If so, how would that weigh against another number that had the final characters (maybe the same, maybe in a different order, or with some different characters)?
0 Likes 0 ·
Grant Fritchey avatar image Grant Fritchey ♦♦ commented ·
Remember to vote up the questions that were helpful and mark any that solved the problem. If you need clarification or additional information use the comments.
0 Likes 0 ·
GPO avatar image
GPO answered
I've used the tsql impementation of the Jaro Winkler algorithm for stuff like this before. It's very slow - the implementation that I've used is full of nested loops, but it can be useful. It compares two combinations of characters and comes up with a score between (I think) 0.5 being completely random and 1 being identical. It, from memory takes into account missing characters, and transpositions amongst other things. so comparing "123-AGEG-GHT" to "123-AGED-GHR" would score very close to 1 and comparing "123-AGEG-GHT" to "gi464oeern" would score near 0.5. See [ http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/][1] [1]: http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

tanglesoft avatar image
tanglesoft answered
You could look at Soundex function to see if matches your matching rules or possibly a regular expression if you define under which conditions a match could occur. An alternative and possible more efficient option would be to create a third table that had a 1 to many link from the main parts table with every match that you expect then join this to the other table.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Grant Fritchey avatar image
Grant Fritchey answered
Honestly, nothing is going to work well and consistently except structural changes. If it's only the last three characters that are different, but you have a mechanism for mapping them, one to another, you could create a mapping table. But a better approach would be some sort of matching table that links directly between your two key values. But, that's going to require additional joins in order to retrieve your data which could be a little bit of a performance hit. You could also put in an additional column one of the tables, but any calculations, hash, soundex, are more than likely going to fail a fairly large percentage of the time making a structural change the right way to go.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.