question

Raj More avatar image
Raj More asked

Is the new date better than the old date?

We have a CalendarTimeUTC Dimension table in our Data Warehouse that looks like this: ![alt text][1] Sample data: ![alt text][2] The PK in the table is the CalendarTimeUTCId (clustered). This used to be a Int field. All fact tables have CalendarId in them (some of these are multi-billion row partitioned tables). We want to go from putting a DATE value to DATE+HOUR value in this field. Now with the advent of SQL 2008, and the shiny new implementation of DATE, is there a reason to switch the dimension from INT to DATETIME? How will it affect the Index Size in the fact tables? More importantly, how will it affect performance? [1]: /upfiles/CalendarTimeTable.jpg [2]: /upfiles/Old_and_New_Data.jpg
sql-server-2005sql-server-2008-r2performanceindexdate
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Raj More avatar image Raj More commented ·
@Blackhawk-17: Data structure & Sample provided
1 Like 1 ·
Blackhawk-17 avatar image Blackhawk-17 commented ·
Can you give some examples of the INT values you store?
0 Likes 0 ·

1 Answer

·
Kevin Feasel avatar image
Kevin Feasel answered
An int is 4 bytes and a datetime is 8, so switching int to datetime will be a 4-byte increase for each CalendarId reference you change. This will increase the size of any index which includes that datetime key, as well as the table storing the datetime foreign key itself. On the other hand, if you take a standard int and append the hours to the end, you might not be pleased with the results; 9999-12-31 at 11 PM translates to 9,999,123,123, which would overflow an int field. That would be bigint territory, which is itself 8 bytes. The Kimball folks say that you should use a surrogate key even for a date dimension because you might not know the date, so you can put in a reasonable "I do not know" key like -1. If you use a datetime, you lose that luxury and either put in a fake datetime or have null values. That's why I would personally recommend that, if you need to do this, go with a bigint before putting the actual datetime on your fact table. Another alternative, and something I would *really* personally recommend, could be to keep hour separate, as that means your date dimension won't balloon up. If you have 100 years of dates, you'll have roughly 36,500 days. But if you include hours, that jumps to roughly 876,000 records. Instead, if you have an Hours dimension, you could have a 4-byte Date dimension int surrogate key and a 1-byte Hour dimension tinyint surrogate key. Even if you need to go down to the second, have a Time dimension instead with 3600 total records and a smallint surrogate key. Then, Date+Time = 6 bytes, as opposed to the 8-byte bigint for Date+Hour, and you have growth of N + 24 (or N + 3600) as opposed to N * 24 (or, egads, N * 3600) records, where N is the number of days in your Date dimension.
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Blackhawk-17 avatar image Blackhawk-17 commented ·
When you're getting that granular you could look at a date data type which is only 3 bytes and save 25% per record/index entry. As an aside... datetime2 can be 6 bytes of storage... @Kevin Feasel - a "-1" and a "fake" date are pretty well the same end result: a known unknown.
2 Likes 2 ·
Kevin Feasel avatar image Kevin Feasel commented ·
True on all three counts. I'd say the defense for #3 is that -1 is obviously fake, and can also be your "unknown" surrogate key for all dimensions so you have a consistent "unknown" record. In contrast, maybe something did happen on 1/1/1900...or was it in 1753? Or 9999? That's why, even with the date type being 3 bytes, I'd still go with an int and a -1 "unknown" record. It's one case where the extra byte (and its few billion friends on the fact table) could still spare some grief and make it easier on report writers.
0 Likes 0 ·
Pavel Pawlowski avatar image Pavel Pawlowski commented ·
I totally agree with @Kevin to go by using separate dimensions (Date and Time). You will save storage and have a greater possibilities. And as Kevin mentioned Kimball, they are smart enough and have experiences with such implementations and they know what they propose.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.