We're trying to find an acceptable way to clean up production SSNs for use outside production. I've been looking at one way hashing with salt. I've looked at the HashBytes and Checksum functions. I can use these to generate a pretty good 4 byte hash of an SSN.
I'd really like to end up with a 9 digit number so the application actually works. Has anyone done anything like this? Any thoughts on how to accomplish this? It would need to be compliant with all the various privacy standards.
My data set is 100 million rows (20 million unique SSNs) and the SSN appears in multiple tables (yes, I know that's an issue). I'm starting to think that I may just need to build a temporary lookup table and generate 20 million or so random SSNs and then do updates to the original table. After that I can drop my original mapping table.
Answer by K. Brian Kelley ·
I would go with the temporary lookup table populated with fake values matched up with the real ones. Instead of doing just a random number generation, you could generate the 3 parts (area number, group number, and serial number) separately and combine them to form a 9 character string (I'm assuming that's how you're storing or you'd lose the leading zeroes). You could even keep track of the area number + group number and the current serial number in your SSN generation code so you can ensure there's no duplicates.
Answer by mrdenny ·
You could use the output from the CHECKSUM() function. Just pad the results with 0s at either end. Multiple values can have the same checksum, so while it's possible to reverse engineer the values, you would have to get all the possible values (which would be possible no matter what technique you use as SSNs are only 9 characters in length). And this way the values will be the same across all the tables.
Answer by Peso ·
American SSN's doesn't have a checksum, right?
What about a simple
UPDATE x SET x.SSN = x.recID FROM ( SELECT SSN, DENSE_RANK() OVER (ORDER BY SSN) + 99999999 AS recID FROM Table1 ) AS x
The new numbers are random in nature even if they are sequential.
Answer by chris 1 2 ·
Do you really care of the SSN's are valid everywhere? I mean, you want them all to be 9 digit numbers, but if you're not using them as keys then it's probably not a big deal right?
Otherwise, yeah, the solution you came up with in the 3rd paragraph is how I handle CC numbers: Create a bunch of random ones, link them together, hide the link table with security.
Answer by marlonribunal ·
You may find this eBook from Simple Talk Publishing and Red Gate Software helpful in protecting your data. I haven't read the book so I can't really tell if it can help you find ways to solve your problem. But it should at least give you solid concepts on protecting your data.
Answer by esabine ·
I also work in an area where we have SSNs, Fed Tax IDs, and CC numbers. We also have the same challenge but we address it with security and not obfuscation.
If you can't approach this from an access rights issue, I would recommend a table that stores the unique SSNs and then an auto assigned surrogate key. If it must be 9 digits then just seed it with 100000000 but you'll need to teach the users that this number isn't a tax ID.
Now 20MM rows is a lot. For a smaller population I'd suggest using foreign keys with ON UPDATE CASCADE and then allow SQL Server to cascade the updates out after you update the master table but.....20MM rows is a lot. Sounds like an off-hours maintenance effort to work through the updating.