I am writing ETL that brings in about 50M rows into one of my tables and then creates a Clustered Index. Which option should I do: 1. Sort in source, insert sorted, create clustered index 2. Read from source unsorted, sort while creating clustered index in my system
If you can insert the data sorted, I would create the clustered index first, then insert. But if you can't insert data sorted, you should create the clustered index later. Whichever is fastest is hard to tell, it depends on how fast you can sort the data in the source, how fast is your disk, how fast is your CPU(s), how fast and big is your RAM etc. Test the two scenarios with substantially smaller data sets. When you do insert, you might (probably will) benefit from splitting your inserts into smaller packages.
As you mentioned that you insert sorted data first then create Index so I would recommend to create cluster index first then insert the sorted data. Or you can also try given below process- (1) Without sort data insert in table. (2) Create Clustered Index with option - with (SORT_IN_TEMPDB=ON).