After we have fixed the performance problems with the codes written by Robert Bar and his brother Reggie Bar, now has the time come to their cousin, Fu Bar.
She was facing a not-so-simple problem by matching a list of user-entered SSN's (from a web page) with a list of validated SSN's. So far so good, right, but as we know, users aren't that particular careful when entering their personal information, so the SSN's have a number of errors in them.
The good thing is that all user-entered SSN's are checksum-validated and have duplicates removed.
All files (setup, sample data population and Fu Bar's code) can be downloaded from here. Fu's code is updated 10am Sunday 16th of May 2010 UTC. If you have downloaded the code before this time, please re-download.
The rules are as follow (in addition to the standard rules here)
1) A perfect match is valid
2) One single digit wrong is valid
3) Two digits wrong are valid, as long as they are not adjacent
4) Three or more digits wrong is invalid
5) If a user-entered SSN has a perfect match with a validated SSN, this user-entered SSN cannot be used in a combination with another validated SSN.
6) You are not allowed to change the schema of the two tables, with indexes as the only exception.
To clarify the non-adjacent rule:
a) A digit wrong in an odd position is allowed with a digit wrong on another odd position.
b) A digit wrong in an even position is allowed with a digit wrong on another even position.
c) A digit wrong in an odd position is not allowed with a digit wrong for any even position. This is because there is one flaw in the checksum algorithm used for the user-entered SSN's.
d) For very same reason as "c)", a digit wrong in an even position is not allowed with a digit wrong for any odd position.
Your task is to rewrite Fu's code (her code takes about 70-75 minutes to run) to produce a list in following style:
vSSN, ueSSN, Status
vSSN is the validated SSN from dbo.Validated table.
ueSSN is the user-entered SSN from the dbo.UserEntered table.
Status column is a tinyint describing how many characters that are wrong. 0 is a perfect match, 1 is one digit wrong match and 2 is a two digit wrong match.
The sorting is vital, because helpdesk will receive this output later for manual correction so an output with order by vSSN, ueSSN and Status is mandatory.
All entries have to be submitted before 9pm Sunday 30th of May 2010 (UTC). Entries will be evaluated as they are submitted and time allows. Final winner will be announced shortly after that.
First Prize for SQL Speed Phreak Challenge #5: $100 and a license for SQL Data Generator (used when creating the sample data) !
There will also be three categorized runner-up prizes:
* Best Cursor-based solution
* Best CLR solution
* Best Unusual/experimental attempt
The winners of these prizes will get a licensed copy of either SQL Prompt or SQL Refactor, whichever they prefer.
Don't forget to have fun and to be creative,
//Peso
Phil Factor:
The SSN problem was one of the most enjoyable to take part in, because the problem as so simple to define, but the answer was tricky to get right. In the end, all the solutions except for cursor-based ones, were performing in under a second, which made the final order rather more difficult to determine.
When differences are in tens of milliseconds, you can get a different order on different machines: This is a problem but when the size of the test data was determined, the existing cursor-based solution was running in fourteen minutes. We never predicted the extent of the Speed Phreakery we could achieve. The first set-based SQL solutions had this down to less than a handful of seconds, and soon the solutions were fighting in around the half-second mark.
The CLR-based solutions probably were at a slight advantage with this problem though it would be nice to see if this is true of different sizes of data. Matt and JAhlen were neck and neck with this, though JAhlen was a node ahead over the finishing line.
Daniel’s SSIS solution was a sober warning that one should never neglect the obvious answers. It gets the award for the Best Unusual/experimental attempt.
Matt was the clear winner of the best SQL solution (OK, Fu Bar was banned from the best Cursor-based solution). I hope he will want the license for Prompt/Refactor, (excellent for quickly formatting code) though he writes his own excellent SQL tools which are well worth a look.
Winner of the First Prize for the SQL Speed Phreak Challenge #5: $100 and a license for SQL Data Generator (used when creating the sample data) was Johan Åhlén (JAhlen).
It was a very difficult decision as to whether Daniel or Johan won since it was a photofinish. We asked various experts, but nobody would venture an opinion. (The actual timings will vary considerably on different servers, be warned. The amount of memory, no. Of CPUs and general machine configuration will make a difference). I almost had to toss a coin!
Kathi Kellenberger, who wrote the excellent http://www.simple-talk.com/sql/performance/set-based-speed-phreakery-the-fifo-stock-inventory-sql-problem/ Set-based Speed Phreakery: The FIFO Stock Inventory SQL Problem and http://www.simple-talk.com/sql/performance/writing-efficient-sql-set-based-speed-phreakery/ Writing Efficient SQL: Set-Based Speed Phreakery, is studying the results of this competition with the intention of writing it up for Simple-Talk like the others.
Many thanks to all. Peso for all the hard work, and for constructing such an interesting problem. Matt for his energy in coming up with both CLR and TSQL solutions, Scott for kicking things off, Fu Bar for trying her best, Daniel for thinking laterally, Johan for showing the way with the CLR routines, and all those whose comments encouraged the participants to greater efforts.
[1]: http://www.simple-talk.com/sql/performance/set-based-speed-phreakery-the-fifo-stock-inventory-sql-problem/