I know that a huge update like 50% of the data will cause an index fragmentation.
The table ( when clustered ) and the index ( nonclustered ) stores data "ordered", lets say, from 0 to 10.
So, does updating it to -1 will cause all remaining values to be moved on the index and be slower then updating it to 9999 ( or max int size )?
What is better/faster update it to -1 or 9999 ? or it doen't matter ? AND if we consider the defragmentation of the index as a required step ?
Edit: A news question: Wont the update to the maximum value ( or 9999 ) affect further work with the table, for example, if 50% of the data is on the last pages of the table, any include/update will be in the middle of the data, so any pages after that could be affected...
Edit2: I performed the following tests:
But i got inconclusive results, the difference was in miliseconds on both update and reorganize index even when a lot of insertions appeared in the middle of the pages.
I am starting to guess there is no difference between minimum value ( or -1 ) and maximum value ( 9999 ) when updating an indexed column, What do you think ?
Have you considered partitioned views? Partitioned tables?
answered Oct 22 '09 at 04:36 PM
I'm staring at Kalen Delany's book on the SQL 2005 storage engine, but I don't recall anything in it precisely on point. I suspect that it would be slightly faster to set to 9999 because as rows were moved to new pages, the new pages could be populated without having to update pointers to the first page starting with 0 in your subject column.
In other words, after the first page of -1s is populated, it would point to the first page containing 0's as the next page. When a second page for -1s is created, the first page of -1s might have to be altered to point to the second page of -1s rather than the starting page for 0s. The new page of -1s would point to the first page of 0s.
However, I'm not sure whether or not such an update would have to happen for each new page, or just for the first page of 9999s. Also, I would expect both pages involved to be in ram at the time of the update, so the performance difference should be miniscule.
Fragmentation should be a wash, because I wouldn't expect page splits to occur on the new pages with either value, and the old pages would be fragmented identically. However this is all speculation.
My strong suggestion is to create a sample table with a million rows or so rows and conduct some time trials. Nothing beats testing to get a definitive answer.
answered Oct 22 '09 at 01:19 PM
I can't say for sure if -1 or 9999 is faster (but probably 9999). But i would say that this is theoretical.
If it is a non clustered index, i would drop the index and then update the column and then add the index with ONLINE=on an SORT_IN_TEMPDB=on, but then you probably say that you cannot get a service window to do this update. If thats the case, you probably can't do your update anyway, due to the fact that your update will lock the entire table during the transaction. Of course, you can prevent lock escalation, but that will slow you down even more, i would even force a tablock to get the best performance.
But then you say, I use a where condition on this column and need the index, but the index will not help you if as much as 50M out of 100M will be updated, and SQL will do a table or index scan anyway.
answered Oct 22 '09 at 01:58 PM
If you are on SQL Server 2008 then you can create a filtered index - which would ignore values that didn't match the filter. Therefore when you updated the rows, they would simply be removed from the index, which, while leaving spaces, wouldn't fragment it any further. This is, of course, assuming that you don't need the inactive rows to be accessed (i.e. you don't have a query that says 'select the inactive rows').
You would then find that the cost of re-organising that index afterwards was significantly reduced.
answered Oct 22 '09 at 03:25 PM
Matt Whitfield ♦♦
If the index is nonclustered, then I recommend using a filter to exclude inactive records. That way, your index won't even include those values and thus will perform faster for active records.
If you don't filter the index, then I think a very high number (beyond reach of valid values) would be better because it might help with certain queries that perform an index scan. I have no real knowledge or testing experience, but that is my gut feeling.
Edit: different approach
Is there a reason you need this inactive data? If so, is there a reason it can't be moved to a different table? I know this can cause a problem with referential integrity and foreign keys, but I suggestion that might work even if that is the case. This may or may not be worthwhile given your situation.
I think most queries could remain unchanged if the current table name is used for the active records. For those queries that need to check just inactive records, change to that table. For queries that need to check both, use the view. Obviously, if you need to check both a lot, this is not a good solution because it will mean using an outer join view a lot.
We did something like this on a project a few years ago, and it worked quite well. (I just hope I'm not forgetting any details we discovered along the way.)