I want to know why I should use an int as a lookup table's primary key instead of just using the lookup value as the primary key (which in most cases would be a string). I understand that using a nvarchar(50) rather than an int would use way more space if it is linked to a table with many records. But on the other hand using the lookup value directly would basically save us doing a join. I can imagine this would be a big saving if the join is always required. What are the advantages of using a int primary key (specifically for a lookup table) other than it being "the standard thing to do"?
There are two articles that I would say are must-reads on this topic. The first is [Michelle Ufford's post on what makes a good clustered index key](
http://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/). Now, primary keys don't necessarily need to be clustered indexes, but they are by default and there's usually a pretty good reason, especially if you are using surrogate keys. In this case, identity ints are NUSE: Narrow, Unique, Static, and Ever-Increasing. This is great for a clustered index, and also works just fine for a primary key. Also, [Gail Shaw has a post on joining on ints versus strings](
http://sqlinthewild.co.za/index.php/2011/02/15/are-int-joins-faster-than-string-joins-2/). In this, she shows us that joining on an int column uses fewer resources and can be faster than joining on a string column __even when the values are the same__. Given that strings are usually considerably longer, there is a higher cost associated with a join. The decision of whether to leave a data element on a table versus moving it off to a lookup table is a tough one and definitely has an "it depends" element to it because there are legitimate trade-offs. If you have a real lookup value (like a status), it would probably be better to use the lookup table and have a surrogate key for space reasons, especially as the main table gets larger. A 1-byte tinyint (or 2-byte smallint) versus an average of about 20 bytes for character data (including overhead costs for storing varchars) doesn't mean very much if you're talking about 100 rows, but if you're talking about 1,000,000 rows, that's about 18 MB of data which doesn't need to be cached and about 2000 fewer data pages for the table, just due to that one column. There are a few other concerns for whether to create a lookup table or keep a column in a table. If the column is directly associated with the natural key, that's generally not a good thing to move. If the information is duplicated quite often (like a status), you'll want to avoid update inconsistencies, especially if there is more than one process which can update a status. For low-volume tables, you're not really going to see a performance difference (or keeping the column on will involve somewhat fewer reads), but at the same time, those lookup tables stay in memory and involve fewer physical reads than wider tables.
I found this which explains it quite well. Essentially though the primary key must be unique and it's a lot harder to keep text unique than it is integers. [
http://www.databasedev.co.uk/primary_keys.html Also found this: You can use varchar as a PK, but integers are better for performance. They take up less space, and more importantly they are much quicker to compare. An integer comparison can be executed in one CPU instruction. For strings, each character must be compared in turn. That makes every PK lookup faster. If you have join tables, every join select will be looking up against the PK.
Don't forget that if the PK is the clustering key on the table it will appear in each and every index that is additionally created. If the PK is a CHAR(N) column, this will probably be larger in size than an int/smallint/tinyint (depends on value of N) and inflate the size of each additional index you create. The points about index fragmentation, page splits and speed comparisons are all very valid (+1 to all).
I agree with all the points about using identity keys as clustered indexes to reduce fragmentation. But there's more to it. If you usually search on exact matches in the lookup value, I agree that you could use identity column as PK and as clustered index, and that you put a non clustered index on the lookup value. Keep in mind that your DB will NOT bne normalized if you go for identity PK and unique nvarchar values. Therefore I conceptually dislike having unique in non-PK-columns, but I'm pragmatic and I like performance... But if you do LIKE searches (WHERE lookupValue LIKE 'A%') I strongly suggest you use the nvarchar column as clustered index. If it's a NON CLUSTERED index, you'd have to search through the NON CLUSTERED index, find all leafs that matches 'A%', follow the address pointer for each leaf and insert into a spool table, whereas if you have a CLUSTERED index, you just find the first column that matches 'A%' in the B-tree, and then read row by row until you find the first non-match. Regarding performance on joins, I don't really know what the performance difference is between int (and its likes) and varchar (and its likes). The performance difference was huge back in the 6.x days, but I think I heard that from 7.0 and forward, the performance difference is smaller. But that's relative, I don't know the absolute numbers.