question

Raj More avatar image
Raj More asked

Indexing VarChar vs BigInt

This is a FACT Table in a Data Warehouse

It has a composite index as follows

ALTER TABLE [dbo].[Fact_Data] 
ADD  CONSTRAINT [PK_Fact_Data] 
PRIMARY KEY CLUSTERED 
(
    [Column1_VarChar_10] ASC,
    [Column2_VarChar_10] ASC,
    [Column3_Int] ASC,
    [Column4_Int] ASC,
    [Column5_VarChar_10] ASC,
    [Column6_VarChar_10] ASC,
    [Column7_DateTime] ASC,
    [Column8_DateTime] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON
) ON [PRIMARY]
GO

In this structure, all of the varchar 10 columns have numeric values only. Is it going to be beneficial for me to change this 78 million row structure to hold BIGINT instead of VARCHAR in terms of querying and indexing?

Any other benefits / drawbacks that I should consider?

performanceindexing
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Tom Staab avatar image
Tom Staab answered

Performance: BIGINT comparisons are faster using VARCHAR. But the big question is this: How are these columns compared to other columns or values? If, for instance, you frequently use [Column1_VarChar_10] for an inner join to another table with a similar VARCHAR column, then switching this one to BIGINT will hinder performance. If, however, the other table already uses an INT or BIGINT or other numeric data type, then converting this one would improve the performance of that query.

Storage: BIGINT uses 8 bytes. VARCHAR(10) uses 2-12 bytes. It's impossible to determine which is better without knowing more about the data (min, max, mean, median, etc.).

Sorting consideration: Since the fields are currently VARCHAR, they are sorted as strings. That is much different than numerical sorting. Any value beginning with a 1 ('100', '1000', '123456789', etc.) will be considered less than '2'. That is the way it is behaving today with the current index. If you change the fields to BIGINT, the sort order will change. This is true not only with the index but also with any query using an ORDER BY on one of those columns. This could have undesired effects on end user reports.

Additional suggestion: An 8-column index is heavy. In this particular example, your key size will be 32-72 bytes. Since you said you only have 78 million rows, what about creating a surrogate INT (identity) key? That would only be 4 bytes long. Keep in mind the clustered index dictates the sort order of the storage in the table itself. If one of the columns in your key has new values that fluctuate a lot, it will greatly degrade insert performance. The identity surrogate key allows for fast inserts because the new value is always greater than the current maximum value.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Rob Farley avatar image
Rob Farley answered

Is it going to be beneficial for me to change this 78 million row structure to hold BIGINT instead of VARCHAR

I wouldn't bother. Instead, I'd be looking at having a much smaller set of keys for your Clustered Index. Remember that every non-clustered index must include the CIX key for every row, bloating your database significantly for very little gain.

Consider having a unique non-clustered index on those columns, but use a surrogate key of int (or bigint if you might need more than 2 billion rows one day) for your CIX.

Also... a 72-byte key will take a lot more non-leaf pages, increasing the cost of almost all your queries.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.