I have a table of part numbers of many million. I would like to join it to another reference table where the part numbers can be slightly different. For example I could have `ABC123` and `ABC321` which (believe it or not!) can be the same thing! More specifically I want to be able to join a part number similar to `H7C-2275Q-VT 6CE` and `H7C-2275Q-VT EC6` where the last three characters are different. I would also like to be able to match on differing percentages of match so `ABCD E123 45` would match with `ABCD E123` for example. I don't mind using a routine and producing a static table. Any ideas? I must admit this has got me a bit stumped!
I've used the tsql impementation of the Jaro Winkler algorithm for stuff like this before. It's very slow - the implementation that I've used is full of nested loops, but it can be useful. It compares two combinations of characters and comes up with a score between (I think) 0.5 being completely random and 1 being identical. It, from memory takes into account missing characters, and transpositions amongst other things. so comparing "123-AGEG-GHT" to "123-AGED-GHR" would score very close to 1 and comparing "123-AGEG-GHT" to "gi464oeern" would score near 0.5. See [
You could look at Soundex function to see if matches your matching rules or possibly a regular expression if you define under which conditions a match could occur. An alternative and possible more efficient option would be to create a third table that had a 1 to many link from the main parts table with every match that you expect then join this to the other table.
Honestly, nothing is going to work well and consistently except structural changes. If it's only the last three characters that are different, but you have a mechanism for mapping them, one to another, you could create a mapping table. But a better approach would be some sort of matching table that links directly between your two key values. But, that's going to require additional joins in order to retrieve your data which could be a little bit of a performance hit. You could also put in an additional column one of the tables, but any calculations, hash, soundex, are more than likely going to fail a fairly large percentage of the time making a structural change the right way to go.