|
After we have fixed the performance problems with the codes written by Robert Bar and his brother Reggie Bar, now has the time come to their cousin, Fu Bar. She was facing a not-so-simple problem by matching a list of user-entered SSN's (from a web page) with a list of validated SSN's. So far so good, right, but as we know, users aren't that particular careful when entering their personal information, so the SSN's have a number of errors in them. The rules are as follow (in addition to the standard rules here) 1) A perfect match is valid To clarify the non-adjacent rule: Your task is to rewrite Fu's code (her code takes about 70-75 minutes to run) to produce a list in following style: vSSN, ueSSN, Status vSSN is the validated SSN from dbo.Validated table. The sorting is vital, because helpdesk will receive this output later for manual correction so an output with order by vSSN, ueSSN and Status is mandatory. All entries have to be submitted before 9pm Sunday 30th of May 2010 (UTC). Entries will be evaluated as they are submitted and time allows. Final winner will be announced shortly after that. First Prize for SQL Speed Phreak Challenge #5: $100 and a license for SQL Data Generator (used when creating the sample data) ! There will also be three categorized runner-up prizes:
The winners of these prizes will get a licensed copy of either SQL Prompt or SQL Refactor, whichever they prefer. Don't forget to have fun and to be creative, //Peso Phil Factor: The SSN problem was one of the most enjoyable to take part in, because the problem as so simple to define, but the answer was tricky to get right. In the end, all the solutions except for cursor-based ones, were performing in under a second, which made the final order rather more difficult to determine. When differences are in tens of milliseconds, you can get a different order on different machines: This is a problem but when the size of the test data was determined, the existing cursor-based solution was running in fourteen minutes. We never predicted the extent of the Speed Phreakery we could achieve. The first set-based SQL solutions had this down to less than a handful of seconds, and soon the solutions were fighting in around the half-second mark. The CLR-based solutions probably were at a slight advantage with this problem though it would be nice to see if this is true of different sizes of data. Matt and JAhlen were neck and neck with this, though JAhlen was a node ahead over the finishing line. Daniel’s SSIS solution was a sober warning that one should never neglect the obvious answers. It gets the award for the Best Unusual/experimental attempt. Matt was the clear winner of the best SQL solution (OK, Fu Bar was banned from the best Cursor-based solution). I hope he will want the license for Prompt/Refactor, (excellent for quickly formatting code) though he writes his own excellent SQL tools which are well worth a look. Winner of the First Prize for the SQL Speed Phreak Challenge #5: $100 and a license for SQL Data Generator (used when creating the sample data) was Johan Åhlén (JAhlen). It was a very difficult decision as to whether Daniel or Johan won since it was a photofinish. We asked various experts, but nobody would venture an opinion. (The actual timings will vary considerably on different servers, be warned. The amount of memory, no. Of CPUs and general machine configuration will make a difference). I almost had to toss a coin! Kathi Kellenberger, who wrote the excellent http://www.simple-talk.com/sql/performance/set-based-speed-phreakery-the-fifo-stock-inventory-sql-problem/ Set-based Speed Phreakery: The FIFO Stock Inventory SQL Problem and http://www.simple-talk.com/sql/performance/writing-efficient-sql-set-based-speed-phreakery/ Writing Efficient SQL: Set-Based Speed Phreakery, is studying the results of this competition with the intention of writing it up for Simple-Talk like the others. Many thanks to all. Peso for all the hard work, and for constructing such an interesting problem. Matt for his energy in coming up with both CLR and TSQL solutions, Scott for kicking things off, Fu Bar for trying her best, Daniel for thinking laterally, Johan for showing the way with the CLR routines, and all those whose comments encouraged the participants to greater efforts.
(comments are locked)
|
I'm titling this one: "Compute Scalar" :)
SELECT val.SSN [vSSN], uen.SSN [ueSSN],
CASE WHEN SUBSTRING(val.SSN,1,1) = SUBSTRING(uen.SSN,1,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,2,1) = SUBSTRING(uen.SSN,2,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,3,1) = SUBSTRING(uen.SSN,3,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,4,1) = SUBSTRING(uen.SSN,4,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,5,1) = SUBSTRING(uen.SSN,5,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,6,1) = SUBSTRING(uen.SSN,6,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,7,1) = SUBSTRING(uen.SSN,7,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,8,1) = SUBSTRING(uen.SSN,8,1) THEN 0 ELSE 1 END +
CASE WHEN SUBSTRING(val.SSN,9,1) = SUBSTRING(uen.SSN,9,1) THEN 0 ELSE 1 END [Status]
FROM (
SELECT v.[SSN], x.SSN2
FROM [dbo].[Validated] v
CROSS APPLY (SELECT SUBSTRING(v.SSN,2,1) + SUBSTRING(v.SSN,4,1) + SPACE(2)
+ SUBSTRING(v.SSN,1,1)
+ SUBSTRING(v.SSN,3,1)
+ SUBSTRING(v.SSN,5,1)
+ SUBSTRING(v.SSN,7,1)
+ SUBSTRING(v.SSN,9,1) [SSN2] UNION ALL
SELECT SUBSTRING(v.SSN,2,1) + SPACE(1) + SUBSTRING(v.SSN,6,1) + SPACE(1)
+ SUBSTRING(v.SSN,1,1)
+ SUBSTRING(v.SSN,3,1)
+ SUBSTRING(v.SSN,5,1)
+ SUBSTRING(v.SSN,7,1)
+ SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1) + SPACE(2) + SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1)
+ SUBSTRING(v.SSN,3,1)
+ SUBSTRING(v.SSN,5,1)
+ SUBSTRING(v.SSN,7,1)
+ SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SPACE(1) + SUBSTRING(v.SSN,4,1) + SUBSTRING(v.SSN,6,1) + SPACE(1)
+ SUBSTRING(v.SSN,1,1)
+ SUBSTRING(v.SSN,3,1)
+ SUBSTRING(v.SSN,5,1)
+ SUBSTRING(v.SSN,7,1)
+ SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SPACE(1) + SUBSTRING(v.SSN,4,1) + SPACE(1) + SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1)
+ SUBSTRING(v.SSN,3,1)
+ SUBSTRING(v.SSN,5,1)
+ SUBSTRING(v.SSN,7,1)
+ SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SPACE(2) + SUBSTRING(v.SSN,6,1) + SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1)
+ SUBSTRING(v.SSN,3,1)
+ SUBSTRING(v.SSN,5,1)
+ SUBSTRING(v.SSN,7,1)
+ SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SPACE(2) + SUBSTRING(v.SSN,5,1) + SUBSTRING(v.SSN,7,1) + SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SPACE(1) + SUBSTRING(v.SSN,3,1) + SPACE(1) + SUBSTRING(v.SSN,7,1) + SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SPACE(1) + SUBSTRING(v.SSN,3,1) + SUBSTRING(v.SSN,5,1) + SPACE(1) + SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SPACE(1) + SUBSTRING(v.SSN,3,1) + SUBSTRING(v.SSN,5,1) + SUBSTRING(v.SSN,7,1) + SPACE(1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1) + SPACE(2) + SUBSTRING(v.SSN,7,1) + SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1) + SPACE(1) + SUBSTRING(v.SSN,5,1) + SPACE(1) + SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1) + SPACE(1) + SUBSTRING(v.SSN,5,1) + SUBSTRING(v.SSN,7,1) + SPACE(1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1) + SUBSTRING(v.SSN,3,1) + SPACE(2) + SUBSTRING(v.SSN,9,1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1) + SUBSTRING(v.SSN,3,1) + SPACE(1) + SUBSTRING(v.SSN,7,1) + SPACE(1) UNION ALL
SELECT SUBSTRING(v.SSN,2,1)
+ SUBSTRING(v.SSN,4,1)
+ SUBSTRING(v.SSN,6,1)
+ SUBSTRING(v.SSN,8,1)
+ SUBSTRING(v.SSN,1,1) + SUBSTRING(v.SSN,3,1) + SUBSTRING(v.SSN,5,1) + SPACE(2))x
)val
JOIN
(SELECT ue.[SSN], y.SSN2
FROM [dbo].[UserEntered] ue
LEFT JOIN [dbo].[Validated] v ON (v.SSN = ue.SSN)
CROSS APPLY (SELECT SUBSTRING(ue.SSN,2,1) + SUBSTRING(ue.SSN,4,1) + SPACE(2)
+ SUBSTRING(ue.SSN,1,1)
+ SUBSTRING(ue.SSN,3,1)
+ SUBSTRING(ue.SSN,5,1)
+ SUBSTRING(ue.SSN,7,1)
+ SUBSTRING(ue.SSN,9,1) [SSN2] UNION ALL
SELECT SUBSTRING(ue.SSN,2,1) + SPACE(1) + SUBSTRING(ue.SSN,6,1) + SPACE(1)
+ SUBSTRING(ue.SSN,1,1)
+ SUBSTRING(ue.SSN,3,1)
+ SUBSTRING(ue.SSN,5,1)
+ SUBSTRING(ue.SSN,7,1)
+ SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1) + SPACE(2) + SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1)
+ SUBSTRING(ue.SSN,3,1)
+ SUBSTRING(ue.SSN,5,1)
+ SUBSTRING(ue.SSN,7,1)
+ SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SPACE(1) + SUBSTRING(ue.SSN,4,1) + SUBSTRING(ue.SSN,6,1) + SPACE(1)
+ SUBSTRING(ue.SSN,1,1)
+ SUBSTRING(ue.SSN,3,1)
+ SUBSTRING(ue.SSN,5,1)
+ SUBSTRING(ue.SSN,7,1)
+ SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SPACE(1) + SUBSTRING(ue.SSN,4,1) + SPACE(1) + SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1)
+ SUBSTRING(ue.SSN,3,1)
+ SUBSTRING(ue.SSN,5,1)
+ SUBSTRING(ue.SSN,7,1)
+ SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SPACE(2) + SUBSTRING(ue.SSN,6,1) + SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1)
+ SUBSTRING(ue.SSN,3,1)
+ SUBSTRING(ue.SSN,5,1)
+ SUBSTRING(ue.SSN,7,1)
+ SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SPACE(2) + SUBSTRING(ue.SSN,5,1) + SUBSTRING(ue.SSN,7,1) + SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SPACE(1) + SUBSTRING(ue.SSN,3,1) + SPACE(1) + SUBSTRING(ue.SSN,7,1) + SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SPACE(1) + SUBSTRING(ue.SSN,3,1) + SUBSTRING(ue.SSN,5,1) + SPACE(1) + SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SPACE(1) + SUBSTRING(ue.SSN,3,1) + SUBSTRING(ue.SSN,5,1) + SUBSTRING(ue.SSN,7,1) + SPACE(1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1) + SPACE(2) + SUBSTRING(ue.SSN,7,1) + SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1) + SPACE(1) + SUBSTRING(ue.SSN,5,1) + SPACE(1) + SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1) + SPACE(1) + SUBSTRING(ue.SSN,5,1) + SUBSTRING(ue.SSN,7,1) + SPACE(1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1) + SUBSTRING(ue.SSN,3,1) + SPACE(2) + SUBSTRING(ue.SSN,9,1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1) + SUBSTRING(ue.SSN,3,1) + SPACE(1) + SUBSTRING(ue.SSN,7,1) + SPACE(1) UNION ALL
SELECT SUBSTRING(ue.SSN,2,1)
+ SUBSTRING(ue.SSN,4,1)
+ SUBSTRING(ue.SSN,6,1)
+ SUBSTRING(ue.SSN,8,1)
+ SUBSTRING(ue.SSN,1,1) + SUBSTRING(ue.SSN,3,1) + SUBSTRING(ue.SSN,5,1) + SPACE(2))y
WHERE v.SSN IS NULL)uen
ON (val.SSN2 = uen.SSN2)
GROUP BY Val.SSN,uen.SSN
UNION ALL
SELECT v.[SSN] , ue.SSN , 0
FROM [dbo].[Validated] v
JOIN [dbo].[UserEntered] ue ON (ue.SSN = v.SSN)
ORDER BY [vSSN], [ueSSN], [status]
(comments are locked)
|
|
--Phil Factor 1A (I'll call this one the 'remove the hopeless and the obvious User-SSNS before you start' Good job Phil, I also am working on a solution that reduces the number of effective rows acted upon. With only 8K rows returned we should see some sub-second entries soon
May 17 '10 at 02:25 PM
Scot Hauder
The trick here is that any SSNs that have any hope of matching must have a complete match on either a string made up of the odd characters, or one of the even characters. You can safely junk the rest. This SQL is over-complicated because I'd wrongly assumed that we were doing a 'best-match' algorithm, and I altered it at the last minute. When I get a moment, I'll tidy it up.
May 18 '10 at 05:06 AM
Phil Factor
(comments are locked)
|
|
I call this the pure SQL CLR solution... J
Oh, by the way you need to add the SQL CLR assembly first... Vad kul att se dig här!
May 18 '10 at 11:31 AM
Peso
And for you non-swedish speakers; "Great to see you here!".
May 18 '10 at 11:32 AM
Peso
I get this error message trying to run the SQLCLR. "Msg 6532, Level 16, State 49, Line 1 .NET Framework execution was aborted by escalation policy because of out of memory. System.Threading.ThreadAbortException: Thread was being aborted. System.Threading.ThreadAbortException: at System.Collections.BitArray..ctor(Int32 length, Boolean defaultValue) at UserDefinedFunctions.Match()"
May 18 '10 at 11:41 AM
Peso
Om inte någon annan skriver ett snabbare CLR lösning, är du dömd att få pris för bästa CLR-baserad lösning. Välkommen till tävlingen, och tack för att komma in.
May 18 '10 at 01:04 PM
Phil Factor
Thanks. My SQL CLR solution works on my laptop (64-bit SQL Server 2008 R2 RTM with default settings) and takes approx 2 300 ms. If necessary I can make some changes so it takes less memory (but will be a bit slower).
May 18 '10 at 04:12 PM
JAhlen
(comments are locked)
|
|
New version (more memory conservative, but slower). I call this the pure SQL CLR solution 2... J
Code for the SQL CLR assembly... This is 100 ms faster than v1.
May 18 '10 at 06:20 PM
Peso
(comments are locked)
|
|
Matt v1 Titled the 'This probably really won't scale very well, but it's fast for the given data' version I'll try this after I run Fu's code again to get read count.
May 18 '10 at 06:21 PM
Peso
... and get some sleep while it's running! :)
May 18 '10 at 06:31 PM
Matt Whitfield ♦♦
... Just updated the order by clause
May 18 '10 at 06:56 PM
Matt Whitfield ♦♦
(comments are locked)
|
1 2 3 4 5 next page »


Hi Peso,
My question is with #5. I wanted to confirm that you DO want to allow the inverse of this: basically if a user-entered SSN matches perfectly to a validated SSN this validated SSN can be used with another user-entered SSN. Eg
select * from Validated v join UserEntered ue on ue.SSN = v.SSN where v.SSN='545013815'
has a perfect match but is also returned by Fu for a user that has entered it incorrectly: 545515815
Thx, have a great weekend
It's correct. The vSSN "545013815" has a perfect match with a user-entered SSN "545013815" and as such, the user-entered SSN "545013815" cannot be used for any other match. However, there can be more user-entered SSN's which can match the validated SSN "545013815", and it does with the user-entered SSN "545515815". Also make sure you have downloaded the latest code for Fu. I discovered she didn't comply to all the rules. Shame on her! :-) Fu's code was re-uploaded for an hour ago.
//Peso
Thanks Peso, I think that was it
I didn't think Fu modified the script that much but it went from 7min to 45min
That's the beauty of OR vs AND! :-)