question

dataminor avatar image
dataminor asked

Strategy for refreshing a reporting table

I want to develop a process for refreshing some reporting tables. For the minute it’s enough to assume that we have a fact table that gets refreshed every night. It takes about 30 seconds, during which, if you ran a report off the table it would have zero rows in it and your report would give you the wrong answer. This is clearly unacceptable. To move away from this, I want to know whether there are risks associated with the following strategy (assume my fact table is just called MyFacts for the purposes of this discussion): 1. Do a SELECT...INTO to create a staging table (MyFactsStg) that contains all data of correct type and nullability (but no indexes or keys yet) 2. Rename the indexes on MyFacts to *Old 3. Add primary key and create indexes on MyFactsStg using the object names that were previously on MyFacts (but which have since been renamed to *Old. This (MyFactsStg) is now in terms of DDL, an exact copy of MyFacts, but with more current rows 4. Rename table MyFacts to MyFactsOld 5. Rename table MyFactsStg to MyFacts 6. Drop table MyFactsOld Taking it at face value, it seems the only downtime for the users will be a few milliseconds between steps 4 and 5. As far as my testing goes, if someone is using MyFacts while I’m trying to rename it, the rename sproc appears to wait patiently until it has exclusive access to the table and then renames it, so it appears all but seamless to the end user. What worries me is that I didn’t read about this strategy in a book so it’s probably full of gotchas that I have not thought of yet.
reportingtablesrenamesynonymsinto
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

dataminor avatar image dataminor commented ·
If it clarifies anything, by "refresh" I mean bring it up to date so it reflects all activity including any inserts, updates and deletes since the last refresh. In this instance it's easiest not to get bogged down in CDC and just repopulate the table from scratch.
0 Likes 0 ·
Kev Riley avatar image
Kev Riley answered
You could also use [synonyms][1] to handle the switch, making the time between old and new data, practically zero [1]: http://msdn.microsoft.com/en-us/library/ms191230.aspx
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

dataminor avatar image dataminor commented ·
So I'd always have two tables and just point the synonym (using a drop and create) to the most current table?
0 Likes 0 ·
pipthegeek avatar image
pipthegeek answered
The only downside I can see with your approach is the complexity. Suddenly adding or removing an index on MyFacts is a breaking change and requires changing the script that handles the re-name. There is also the issue of recovering from failure to consider. If the ETL has half run and then failed, the next run needs to be able to recognise this and tidy up before starting. I think your idea is valid, I would suggest looking at SNAPSHOT ISOLATION as well, it may or may not make things simpler in your situation. You could turn on read committed snapshot, then all transactions that are running at read committed isolation would in fact use snapshot isolation instead. Basically SQL server will ensure they see a transactionally consistent snapshot of the data from the point in time that they started. This CAN require significant tempdb space but we haven't found that to be the case. The only caveat is that you really need the entire delete / refill process on MyFacts to run in a single transaction. We use this method to update our gazeetter, the updates are staged into staging tables in the destination database, and then a SQL Script handles the final ETL from the staging tables into the main table as a single transaction. We couldn't sensibly use the stage / rename technique becuase we are only changing a fairly small percentage of the data and creating an entire copy and then re-indexing it just to apply the changes to would be pretty wasteful.
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

pipthegeek avatar image pipthegeek commented ·
You ought to test snapshot isolation and see what the typical extra storage is likely to be. We were really suprised at how little extra it needed. Even with artificially long "read" transactions running to force the previos information to be kept for longer.
1 Like 1 ·
dataminor avatar image dataminor commented ·
"...adding or removing an index on MyFacts is a breaking change..." Ah. Yes I see your point.
0 Likes 0 ·
dataminor avatar image dataminor commented ·
"...If the ETL has half run and then failed, the next run needs to be able to recognise this and tidy up before starting..." Agreed. The sproc does a conditional drop of MyFactsStg and MyFactsOld at the start, but I think I need to use at least two transactions. I don't think I'm going to have much joy convincing the DBAs to turn on isolation levels. When we discussed it previously, they were lukewarm about the extra storage requirements and so on.
0 Likes 0 ·
Scot Hauder avatar image
Scot Hauder answered
This doesn't have to be that complex. 1 CREATE your staging table with the same columns/types as the destination table, don't worry about indexes or anything else. Do not use SELECT..INTO, use CREATE TABLE statement instead. 2 INSERT the refreshed data into the StagingTable 3 If step 2 succeeds do something like this: BEGIN TRAN DELETE FROM ReportTable WITH (HOLDLOCK, TABLOCKX) INSERT ReportTable SELECT * FROM StagingTable COMMIT TRAN Holding an exclusive lock for the duration of the transaction will block any reads from reports and they will simply wait for the transaction to complete.
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

dataminor avatar image dataminor commented ·
Thanks Scott. More good ideas! Where you say "...Do not use SELECT..INTO, use CREATE TABLE statement instead..." what's the issue with SELECT...INTO? In this instance, assuming 30 seconds to load, it seems that the worst the users would experience is a 30 second delay. This will be fine for the table referred to into original post. If there's a quicker way (such as the synonyms @Kev has suggested below) then there are other tables that will benefit from that. We want to refresh some much more regularly and so minimising the changeover for them will be important.
0 Likes 0 ·
Scot Hauder avatar image Scot Hauder commented ·
30 seconds to load what? To get the original data into the staging table or copying of data from the staging table to the report table?
0 Likes 0 ·
dataminor avatar image dataminor commented ·
To copy the data from the staging table to the report table - although now I think about it, it should be a fair bit less than that now because we're only talking one step of a process that takes 30 seconds all up. I'll have to test to see for sure.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.