question

Jaco Briers avatar image
Jaco Briers asked

Why use an int as a lookup table's primary key?

I want to know why I should use an int as a lookup table's primary key instead of just using the lookup value as the primary key (which in most cases would be a string). I understand that using a nvarchar(50) rather than an int would use way more space if it is linked to a table with many records. But on the other hand using the lookup value directly would basically save us doing a join. I can imagine this would be a big saving if the join is always required. What are the advantages of using a int primary key (specifically for a lookup table) other than it being "the standard thing to do"?
sql-serverprimary-keylookups
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Kevin Feasel avatar image
Kevin Feasel answered
There are two articles that I would say are must-reads on this topic. The first is [Michelle Ufford's post on what makes a good clustered index key]( http://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/). Now, primary keys don't necessarily need to be clustered indexes, but they are by default and there's usually a pretty good reason, especially if you are using surrogate keys. In this case, identity ints are NUSE: Narrow, Unique, Static, and Ever-Increasing. This is great for a clustered index, and also works just fine for a primary key. Also, [Gail Shaw has a post on joining on ints versus strings]( http://sqlinthewild.co.za/index.php/2011/02/15/are-int-joins-faster-than-string-joins-2/). In this, she shows us that joining on an int column uses fewer resources and can be faster than joining on a string column __even when the values are the same__. Given that strings are usually considerably longer, there is a higher cost associated with a join. The decision of whether to leave a data element on a table versus moving it off to a lookup table is a tough one and definitely has an "it depends" element to it because there are legitimate trade-offs. If you have a real lookup value (like a status), it would probably be better to use the lookup table and have a surrogate key for space reasons, especially as the main table gets larger. A 1-byte tinyint (or 2-byte smallint) versus an average of about 20 bytes for character data (including overhead costs for storing varchars) doesn't mean very much if you're talking about 100 rows, but if you're talking about 1,000,000 rows, that's about 18 MB of data which doesn't need to be cached and about 2000 fewer data pages for the table, just due to that one column. There are a few other concerns for whether to create a lookup table or keep a column in a table. If the column is directly associated with the natural key, that's generally not a good thing to move. If the information is duplicated quite often (like a status), you'll want to avoid update inconsistencies, especially if there is more than one process which can update a status. For low-volume tables, you're not really going to see a performance difference (or keeping the column on will involve somewhat fewer reads), but at the same time, those lookup tables stay in memory and involve fewer physical reads than wider tables.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Jaco Briers avatar image Jaco Briers commented ·
Thank you for the good explanation, I definitely realized a good understanding of what goes on in the background is needed to make this decision. To use an int PK seems like the right thing to do if in doubt.
0 Likes 0 ·
Mrs_Fatherjack avatar image
Mrs_Fatherjack answered
I found this which explains it quite well. Essentially though the primary key must be unique and it's a lot harder to keep text unique than it is integers. [ http://www.databasedev.co.uk/primary_keys.html][1] [1]: http://www.databasedev.co.uk/primary_keys.html Also found this: You can use varchar as a PK, but integers are better for performance. They take up less space, and more importantly they are much quicker to compare. An integer comparison can be executed in one CPU instruction. For strings, each character must be compared in turn. That makes every PK lookup faster. If you have join tables, every join select will be looking up against the PK.
6 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Mrs_Fatherjack avatar image Mrs_Fatherjack commented ·
It just highlighted the pros and cons.
1 Like 1 ·
Jaco Briers avatar image Jaco Briers commented ·
Thanks for the link. I understand it is harder to keep text unique, but in the case of a lookup table, I think it is an advantage to use the text description as the primary key because it ensures that the lookup values stay unique. Also in the case of lookup tables, they are generally smaller (at most a couple of hundred records) and not that hard to manage.
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
Really? A reference to an Access article? Nothing in BOL or elsewhere?
0 Likes 0 ·
Jaco Briers avatar image Jaco Briers commented ·
After the edit: Agreed on it taking up less space and being faster to compare. But in terms of joins : there would be no need for ever doing a join with the lookup table because the PK = lookup value and would already be in the table that you are joining with.
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
@Jaco Briers - But what do you do when you have another table that needs to reference this value in the future? A second table makes more sense in the long run, IMO.
0 Likes 0 ·
Show more comments
Blackhawk-17 avatar image
Blackhawk-17 answered
One of the big reasons to use an INT is that it can be constantly increasing thus reducing or eliminating index fragmentation.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image
WilliamD answered
Don't forget that if the PK is the clustering key on the table it will appear in each and every index that is additionally created. If the PK is a CHAR(N) column, this will probably be larger in size than an int/smallint/tinyint (depends on value of N) and inflate the size of each additional index you create. The points about index fragmentation, page splits and speed comparisons are all very valid (+1 to all).
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Magnus Ahlkvist avatar image Magnus Ahlkvist commented ·
Is it really so? I thought any NON CLUSTERED index on a table had address pointers in its leaf level, not clustered key values?
0 Likes 0 ·
Kevin Feasel avatar image Kevin Feasel commented ·
If you have a clustered index, all non-clustered indexes will have the various columns on the index as well as the clustered index key. If you don't have a clustered index, it will include the RID.
0 Likes 0 ·
Magnus Ahlkvist avatar image Magnus Ahlkvist commented ·
Ok, I wasn't aware of that. +1 to @WilliamD for that then. I have been quickly reading up on this, and if I get it right, the clustered key will only appear in the leaf level of the non clustered index, if the clustered index is unique, which it will be if it's PK.
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
@Magnus - yes, the leaf level for non clustered, but it is still "wasted" space if the clustering key is larger than it needs to be.
0 Likes 0 ·
Magnus Ahlkvist avatar image
Magnus Ahlkvist answered
I agree with all the points about using identity keys as clustered indexes to reduce fragmentation. But there's more to it. If you usually search on exact matches in the lookup value, I agree that you could use identity column as PK and as clustered index, and that you put a non clustered index on the lookup value. Keep in mind that your DB will NOT bne normalized if you go for identity PK and unique nvarchar values. Therefore I conceptually dislike having unique in non-PK-columns, but I'm pragmatic and I like performance... But if you do LIKE searches (WHERE lookupValue LIKE 'A%') I strongly suggest you use the nvarchar column as clustered index. If it's a NON CLUSTERED index, you'd have to search through the NON CLUSTERED index, find all leafs that matches 'A%', follow the address pointer for each leaf and insert into a spool table, whereas if you have a CLUSTERED index, you just find the first column that matches 'A%' in the B-tree, and then read row by row until you find the first non-match. Regarding performance on joins, I don't really know what the performance difference is between int (and its likes) and varchar (and its likes). The performance difference was huge back in the 6.x days, but I think I heard that from 7.0 and forward, the performance difference is smaller. But that's relative, I don't know the absolute numbers.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

TimothyAWiseman avatar image
TimothyAWiseman answered
Kevin's answer is absolutely fantastic, but I think no discussion on primary keys can be complete without considering the pros and cons of natural, artificial, and surrogate keys, and the best place for that is Joe Celko's article at [ http://www.informationweek.com/news/software/bi/showArticle.jhtml?articleID=201806814][1] [1]: http://www.informationweek.com/news/software/bi/showArticle.jhtml?articleID=201806814
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.