I have a table with 70 columns in a prod database and need to work on the contents of a Text type column. My theory is to extract the relevant columns out to a quieter instance to create a Full Text Index and then let rip with the analysis. Records with certain content will be tagged and then their records in live will be updated depending on tags allocated.
So, I tried SSIS and its slow so I am trying BCP. I created a format file for the table and edited it to the 5 columns I am interested in (inc PK column for source table and Text column) and then use BCP with queryout to extract the data (8M rows). When reviewing the data in the destination table the identity column has wrong values and dupes (if I set a PK on the table before insert the bcp in fails with PK violation error.)
my format file
I have tried this with the IDENTITY on the source off but it makes no difference. The output file (.dat) has the PK values in it, in order as expected.
I don't quite understand the problem here:
Your source table has a HistoryID (Bigint, PK) and the other columns. You spew these out into your CRHIST file. You read that into a second instance to do the data crunching, batched in 100k lots.
Are your duplicates occuring in the export to data file, or the insert into target table?
Have you tried running bcp without the batching? Set the target DB into BULK_LOGGED recovery to speed that up. 8 million rows should fly into a table, especially when the data is as small as you have there (130MB by my calculation - or have i misread the table def?).
You can always create the PK afterwards.
Alternatively, have you tried to grab the data via a linked server and
I think I got the bugger!
Test data so you can give it a try: