x

How to hash SSN values

We're trying to find an acceptable way to clean up production SSNs for use outside production. I've been looking at one way hashing with salt. I've looked at the HashBytes and Checksum functions. I can use these to generate a pretty good 4 byte hash of an SSN.

I'd really like to end up with a 9 digit number so the application actually works. Has anyone done anything like this? Any thoughts on how to accomplish this? It would need to be compliant with all the various privacy standards.

My data set is 100 million rows (20 million unique SSNs) and the SSN appears
in multiple tables (yes, I know that's an issue). I'm starting to think
that I may just need to build a temporary lookup table and generate 20
million or so random SSNs and then do updates to the original table. After that I can drop my original mapping table.

more ▼

asked Oct 06, 2009 at 10:02 AM in Default

graz gravatar image

graz ♦
525 2 4 5

(comments are locked)
10|1200 characters needed characters left

7 answers: sort oldest

I would go with the temporary lookup table populated with fake values matched up with the real ones. Instead of doing just a random number generation, you could generate the 3 parts (area number, group number, and serial number) separately and combine them to form a 9 character string (I'm assuming that's how you're storing or you'd lose the leading zeroes). You could even keep track of the area number + group number and the current serial number in your SSN generation code so you can ensure there's no duplicates.

more ▼

answered Oct 07, 2009 at 10:40 PM

K. Brian Kelley gravatar image

K. Brian Kelley
933 2

This is basically what I'm doing now.
Oct 15, 2009 at 10:10 AM graz ♦
(comments are locked)
10|1200 characters needed characters left

Do you really care of the SSN's are valid everywhere? I mean, you want them all to be 9 digit numbers, but if you're not using them as keys then it's probably not a big deal right?

Otherwise, yeah, the solution you came up with in the 3rd paragraph is how I handle CC numbers: Create a bunch of random ones, link them together, hide the link table with security.

more ▼

answered Oct 06, 2009 at 11:31 AM

chris 1 2 gravatar image

chris 1 2
21

(comments are locked)
10|1200 characters needed characters left

American SSN's doesn't have a checksum, right?

What about a simple

UPDATE x SET x.SSN = x.recID FROM ( SELECT SSN, DENSE_RANK() OVER (ORDER BY SSN) + 99999999 AS recID FROM Table1 ) AS x

The new numbers are random in nature even if they are sequential.

more ▼

answered Oct 06, 2009 at 06:31 PM

Peso gravatar image

Peso
1.6k 5 6 8

The problem this is that I have SSN values in multiple tables. And a given SSN has to end up with the same value after it's obfuscated.
Oct 06, 2009 at 09:05 PM graz ♦
(comments are locked)
10|1200 characters needed characters left

I also work in an area where we have SSNs, Fed Tax IDs, and CC numbers. We also have the same challenge but we address it with security and not obfuscation.

If you can't approach this from an access rights issue, I would recommend a table that stores the unique SSNs and then an auto assigned surrogate key. If it must be 9 digits then just seed it with 100000000 but you'll need to teach the users that this number isn't a tax ID.

Now 20MM rows is a lot. For a smaller population I'd suggest using foreign keys with ON UPDATE CASCADE and then allow SQL Server to cascade the updates out after you update the master table but.....20MM rows is a lot. Sounds like an off-hours maintenance effort to work through the updating.

more ▼

answered Oct 07, 2009 at 12:47 PM

esabine gravatar image

esabine
12

(comments are locked)
10|1200 characters needed characters left

You could use the output from the CHECKSUM() function. Just pad the results with 0s at either end. Multiple values can have the same checksum, so while it's possible to reverse engineer the values, you would have to get all the possible values (which would be possible no matter what technique you use as SSNs are only 9 characters in length). And this way the values will be the same across all the tables.

more ▼

answered Oct 07, 2009 at 03:25 PM

mrdenny gravatar image

mrdenny
928 3

I considered this. The problem was the opposite. Checksum produced an integer with values up to 2 billion. SSNs are only 9 digits so values up to 1 billion. I was getting back a higher resolution number than I could deal with. I really wanted a 30 bit checksum.
Oct 07, 2009 at 03:47 PM graz ♦
In the event that the value is to large, simply trim the value down. Something like SELECT SUBSTRING(CAST(CASE WHEN CHECKSUM('123456789') < 0 THEN CHECKSUM('123456789')*-1 ELSE CHECKSUM('123456789') END AS VARCHAR(18)) + REPLICATE(0, 8), 0, 9)
Oct 07, 2009 at 04:11 PM mrdenny
This is interesting. I'll try this out and work and see if gets me where I need to go.
Oct 15, 2009 at 10:12 AM graz ♦
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x164

asked: Oct 06, 2009 at 10:02 AM

Seen: 3377 times

Last Updated: Oct 06, 2009 at 10:34 AM