Indexing VarChar vs BigInt

This is a FACT Table in a Data Warehouse

It has a composite index as follows

ALTER TABLE [dbo].[Fact_Data] 
    [Column1_VarChar_10] ASC,
    [Column2_VarChar_10] ASC,
    [Column3_Int] ASC,
    [Column4_Int] ASC,
    [Column5_VarChar_10] ASC,
    [Column6_VarChar_10] ASC,
    [Column7_DateTime] ASC,
    [Column8_DateTime] ASC

In this structure, all of the varchar 10 columns have numeric values only. Is it going to be beneficial for me to change this 78 million row structure to hold BIGINT instead of VARCHAR in terms of querying and indexing?

Any other benefits / drawbacks that I should consider?

more ▼

asked Oct 21 '09 at 05:20 PM in Default

Raj More gravatar image

Raj More
1.7k 77 82 84

(comments are locked)
10|1200 characters needed characters left

2 answers: sort voted first

Performance: BIGINT comparisons are faster using VARCHAR. But the big question is this: How are these columns compared to other columns or values? If, for instance, you frequently use [Column1_VarChar_10] for an inner join to another table with a similar VARCHAR column, then switching this one to BIGINT will hinder performance. If, however, the other table already uses an INT or BIGINT or other numeric data type, then converting this one would improve the performance of that query.

Storage: BIGINT uses 8 bytes. VARCHAR(10) uses 2-12 bytes. It's impossible to determine which is better without knowing more about the data (min, max, mean, median, etc.).

Sorting consideration: Since the fields are currently VARCHAR, they are sorted as strings. That is much different than numerical sorting. Any value beginning with a 1 ('100', '1000', '123456789', etc.) will be considered less than '2'. That is the way it is behaving today with the current index. If you change the fields to BIGINT, the sort order will change. This is true not only with the index but also with any query using an ORDER BY on one of those columns. This could have undesired effects on end user reports.

Additional suggestion: An 8-column index is heavy. In this particular example, your key size will be 32-72 bytes. Since you said you only have 78 million rows, what about creating a surrogate INT (identity) key? That would only be 4 bytes long. Keep in mind the clustered index dictates the sort order of the storage in the table itself. If one of the columns in your key has new values that fluctuate a lot, it will greatly degrade insert performance. The identity surrogate key allows for fast inserts because the new value is always greater than the current maximum value.

more ▼

answered Oct 21 '09 at 07:45 PM

Tom Staab gravatar image

Tom Staab
5.8k 6 8 10

I think Tom has covered this pretty well. I'd really reexamine the need for a PK this large. It makes not only your data, but your index, and backups large.
Oct 21 '09 at 08:51 PM Steve Jones - Editor ♦♦
I agree completely with the need for a smaller/simpler primary key. The only question I have is whether or not it is early in the development cycle with this table. If only a small number of changes need to be made, I believe everyone agrees that significant space savings and performance gains could result from switching datatypes from varchar(10).
Oct 22 '09 at 12:52 AM Bob Hovious
@TomStaab: This is a fact table in my data warehouse. Should I still consider putting a surrogate key on it?
Oct 22 '09 at 09:31 AM Raj More
@Raj More: My initial answer was going to be "no" because fact table searches would use the composite key anyway, so the surrogate key would not help. I decided to double check via internet search. I found this article written by Bob Becker of the Kimball Group: http://www.ralphkimball.com/html/designtipsPDF/DesignTips2006/KU81FactTableSurrogateKeys.pdf. He says a fact table surrogate key can sometimes help when loading data into the fact table (i.e. ETL process) but not when querying it.
Oct 22 '09 at 11:16 AM Tom Staab
(comments are locked)
10|1200 characters needed characters left

Is it going to be beneficial for me to change this 78 million row structure to hold BIGINT instead of VARCHAR

I wouldn't bother. Instead, I'd be looking at having a much smaller set of keys for your Clustered Index. Remember that every non-clustered index must include the CIX key for every row, bloating your database significantly for very little gain.

Consider having a unique non-clustered index on those columns, but use a surrogate key of int (or bigint if you might need more than 2 billion rows one day) for your CIX.

Also... a 72-byte key will take a lot more non-leaf pages, increasing the cost of almost all your queries.

more ▼

answered Oct 21 '09 at 11:20 PM

Rob Farley gravatar image

Rob Farley
5.7k 15 18 20

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.



asked: Oct 21 '09 at 05:20 PM

Seen: 3775 times

Last Updated: Oct 22 '09 at 09:30 AM