We are going to be bulk loading a hundred million records at time and we need to be VERY performance considerate. I have seen several methods but need the best performing one.... ideas?
We are going to be bulk loading a hundred million records at time and we need to be VERY performance considerate. I have seen several methods but need the best performing one.... ideas?
I'm a big fan of bcp. For large tables, I'll test the loads with keys/indexes in place and without them. If there's a significant advantage to the latter, I'll script dropping the keys/indexes, doing the bcp in, and then recreating the keys/indexes (the clustered one first). And if you need to clear out a table before piping in the new data, use truncate table if possible.
I agree with Susan... BCP is great for stuff like this
If SSIS is an avenue that you're looking to go down be VERY careful especially if you do a "Lookup Match". Doing this will basically process your data row by row much like a cursor would (not a good performance choice!! :) lol).
you've a fast parse option available in SSIS which gives a faster performance.
http://weblogs.sqlteam.com/mladenp/archive/2006/05/26/9992.aspx
Usually get the data into a table and then do the upsert via a stored procedure. The load would be bulk insert or ssis. Try them for performance - I prefer bulk insert for ease of use. It stoppd supporting UTF-8 though (not sure if that was 2005 or 2008) so if you are getting data in that format probably ssis is best.
Once in the table you can then insert on unfound rows and update found rows. For this you will need a natural key - have a non-clustered index on this in the destination table to do the check. For this number of rows you might be better with an artificial kett two (identity)? and make that the clustered index. Can then efficiently find rows that match and check the values for update. Also depending on how much memory you have it might be best to run the process in batches from the import (staging) table.
There is a temptation to do all this in SSIS but you often end up either doing a database read for each row to insert or loading all the natural keys and a checksum from the destination table. Whatever you do don't be tempted to try the slowly changing dimension task!
You will need two statements - update existing then insert new. Might be worth trying adding an indexed identity to the staging table and using that to flag in another table inserts and updates then joining to that for the actual process.
I'm currently doing this in v2008 using a merge statement (to populate an audit trail at the same time). Had my doubts about the usefulness of merge but it's quite handy.
No one has followed this question yet.