Why use an int as a lookup table's primary key?

I want to know why I should use an int as a lookup table's primary key instead of just using the lookup value as the primary key (which in most cases would be a string). I understand that using a nvarchar(50) rather than an int would use way more space if it is linked to a table with many records. But on the other hand using the lookup value directly would basically save us doing a join. I can imagine this would be a big saving if the join is always required.

What are the advantages of using a int primary key (specifically for a lookup table) other than it being "the standard thing to do"?
more ▼

asked Mar 25 '11 at 04:12 AM in Default

Jaco Briers gravatar image

Jaco Briers
23 1 1 3

(comments are locked)
10|1200 characters needed characters left

6 answers: sort voted first

There are two articles that I would say are must-reads on this topic. The first is Michelle Ufford's post on what makes a good clustered index key. Now, primary keys don't necessarily need to be clustered indexes, but they are by default and there's usually a pretty good reason, especially if you are using surrogate keys. In this case, identity ints are NUSE: Narrow, Unique, Static, and Ever-Increasing. This is great for a clustered index, and also works just fine for a primary key.

Also, Gail Shaw has a post on joining on ints versus strings. In this, she shows us that joining on an int column uses fewer resources and can be faster than joining on a string column even when the values are the same. Given that strings are usually considerably longer, there is a higher cost associated with a join.

The decision of whether to leave a data element on a table versus moving it off to a lookup table is a tough one and definitely has an "it depends" element to it because there are legitimate trade-offs. If you have a real lookup value (like a status), it would probably be better to use the lookup table and have a surrogate key for space reasons, especially as the main table gets larger. A 1-byte tinyint (or 2-byte smallint) versus an average of about 20 bytes for character data (including overhead costs for storing varchars) doesn't mean very much if you're talking about 100 rows, but if you're talking about 1,000,000 rows, that's about 18 MB of data which doesn't need to be cached and about 2000 fewer data pages for the table, just due to that one column.

There are a few other concerns for whether to create a lookup table or keep a column in a table. If the column is directly associated with the natural key, that's generally not a good thing to move. If the information is duplicated quite often (like a status), you'll want to avoid update inconsistencies, especially if there is more than one process which can update a status. For low-volume tables, you're not really going to see a performance difference (or keeping the column on will involve somewhat fewer reads), but at the same time, those lookup tables stay in memory and involve fewer physical reads than wider tables.
more ▼

answered Mar 25 '11 at 05:25 AM

Kevin Feasel gravatar image

Kevin Feasel
6.1k 3 5 11

Thank you for the good explanation, I definitely realized a good understanding of what goes on in the background is needed to make this decision. To use an int PK seems like the right thing to do if in doubt.
Mar 25 '11 at 08:49 AM Jaco Briers
(comments are locked)
10|1200 characters needed characters left

I found this which explains it quite well. Essentially though the primary key must be unique and it's a lot harder to keep text unique than it is integers.


Also found this:

You can use varchar as a PK, but integers are better for performance. They take up less space, and more importantly they are much quicker to compare.

An integer comparison can be executed in one CPU instruction. For strings, each character must be compared in turn. That makes every PK lookup faster. If you have join tables, every join select will be looking up against the PK.
more ▼

answered Mar 25 '11 at 04:27 AM

Mrs_Fatherjack gravatar image

4.7k 57 62 66

Thanks for the link. I understand it is harder to keep text unique, but in the case of a lookup table, I think it is an advantage to use the text description as the primary key because it ensures that the lookup values stay unique. Also in the case of lookup tables, they are generally smaller (at most a couple of hundred records) and not that hard to manage.
Mar 25 '11 at 04:42 AM Jaco Briers
Really? A reference to an Access article? Nothing in BOL or elsewhere?
Mar 25 '11 at 04:46 AM Fatherjack ♦♦
It just highlighted the pros and cons.
Mar 25 '11 at 05:26 AM Mrs_Fatherjack

After the edit: Agreed on it taking up less space and being faster to compare.

But in terms of joins : there would be no need for ever doing a join with the lookup table because the PK = lookup value and would already be in the table that you are joining with.
Mar 25 '11 at 05:47 AM Jaco Briers

@Jaco Briers - But what do you do when you have another table that needs to reference this value in the future?

A second table makes more sense in the long run, IMO.
Mar 25 '11 at 06:01 AM WilliamD
(comments are locked)
10|1200 characters needed characters left
One of the big reasons to use an INT is that it can be constantly increasing thus reducing or eliminating index fragmentation.
more ▼

answered Mar 25 '11 at 05:28 AM

Blackhawk-17 gravatar image

11.8k 28 30 35

(comments are locked)
10|1200 characters needed characters left

Don't forget that if the PK is the clustering key on the table it will appear in each and every index that is additionally created. If the PK is a CHAR(N) column, this will probably be larger in size than an int/smallint/tinyint (depends on value of N) and inflate the size of each additional index you create.

The points about index fragmentation, page splits and speed comparisons are all very valid (+1 to all).
more ▼

answered Mar 25 '11 at 05:44 AM

WilliamD gravatar image

25.8k 17 19 41

Is it really so? I thought any NON CLUSTERED index on a table had address pointers in its leaf level, not clustered key values?
Mar 25 '11 at 06:04 AM Magnus Ahlkvist
If you have a clustered index, all non-clustered indexes will have the various columns on the index as well as the clustered index key. If you don't have a clustered index, it will include the RID.
Mar 25 '11 at 06:40 AM Kevin Feasel

Ok, I wasn't aware of that. +1 to @WilliamD for that then.

I have been quickly reading up on this, and if I get it right, the clustered key will only appear in the leaf level of the non clustered index, if the clustered index is unique, which it will be if it's PK.
Mar 25 '11 at 06:50 AM Magnus Ahlkvist
@Magnus - yes, the leaf level for non clustered, but it is still "wasted" space if the clustering key is larger than it needs to be.
Mar 26 '11 at 12:34 PM WilliamD
(comments are locked)
10|1200 characters needed characters left

I agree with all the points about using identity keys as clustered indexes to reduce fragmentation. But there's more to it. If you usually search on exact matches in the lookup value, I agree that you could use identity column as PK and as clustered index, and that you put a non clustered index on the lookup value. Keep in mind that your DB will NOT bne normalized if you go for identity PK and unique nvarchar values. Therefore I conceptually dislike having unique in non-PK-columns, but I'm pragmatic and I like performance...

But if you do LIKE searches (WHERE lookupValue LIKE 'A%') I strongly suggest you use the nvarchar column as clustered index. If it's a NON CLUSTERED index, you'd have to search through the NON CLUSTERED index, find all leafs that matches 'A%', follow the address pointer for each leaf and insert into a spool table, whereas if you have a CLUSTERED index, you just find the first column that matches 'A%' in the B-tree, and then read row by row until you find the first non-match.

Regarding performance on joins, I don't really know what the performance difference is between int (and its likes) and varchar (and its likes). The performance difference was huge back in the 6.x days, but I think I heard that from 7.0 and forward, the performance difference is smaller. But that's relative, I don't know the absolute numbers.
more ▼

answered Mar 25 '11 at 06:17 AM

Magnus Ahlkvist gravatar image

Magnus Ahlkvist
15.9k 15 19 32

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.



asked: Mar 25 '11 at 04:12 AM

Seen: 3883 times

Last Updated: Mar 25 '11 at 04:40 AM