question

Fatherjack avatar image
Fatherjack asked

BCP - Text data and Identity columns

I have a table with 70 columns in a prod database and need to work on the contents of a Text type column. My theory is to extract the relevant columns out to a quieter instance to create a Full Text Index and then let rip with the analysis. Records with certain content will be tagged and then their records in live will be updated depending on tags allocated. So, I tried SSIS and its slow so I am trying BCP. I created a format file for the table and edited it to the 5 columns I am interested in (inc PK column for source table and Text column) and then use BCP with queryout to extract the data (8M rows). When reviewing the data in the destination table the identity column has wrong values and dupes (if I set a PK on the table before insert the bcp in fails with PK violation error.) my format file > 10.0 > 5 > 1 SQLBIGINT 0 8 "@#\t\t#@" 1 HistoryID "" > 2 SQLCHAR 0 7 "@#\t\t#@" 2 Client\_Ref SQL_Latin1_General\_CP1\_CI\_AS > 3 SQLCHAR 0 1 "@#\t\t#@" 3 Type\_Ref SQL_Latin1_General_CP1_CI_AS > 4 SQLCHAR 0 1 "@#\t\t#@" 4 Category\_Ref > SQL_Latin1_General\_CP1\_CI\_AS > 5 SQLCHAR 0 0 "\\r\\n" 5 Notes SQL\_Latin1\_General\_CP1\_CI\_AS **bcp out** bcp "SELECT [ch].[HistoryID] , [ch].[Client_Ref] , [ch].[Category_Ref] , [ch].[Type_Ref] , [ch].[Notes] FROM database.[dbo].table AS ch WITH (NOLOCK) where ch.client_ref is not null" queryout E:\CRHist.datx -U username -P pwd -c -S svr1\inst1 -t @#\t\t#@ -r \r\n -C ACP -E **bcp in** bcp doombar.dbo.Table in e:\crhist.datx -S svr2\inst2 -T -f e:\crhist_c.fmt -e e:\BCP_ERR.Log -m 900 -b 100000 -E I have tried this with the IDENTITY on the source off but it makes no difference. The output file (.dat) has the PK values in it, in order as expected. [Edit] I have since been looking at the destination data. In the source a row may have a HistoryID of 1,2,3,4 etc, in the destination is is always 49,50,51,52 etc. I am wondering if the delimiter is affecting the perceived value of the first column? ascii('@')= 64 and ascii('#')=35 so not sure how `1@# #@CLIENTREF@# #@D@#` in the .dat file is changing to `49,'CLIENTREF'` once its imported... `2177801` becomes `13845291861487922` though so not a straight addition of 48 :( [Edit II] Have identified the issue (i think) - the text column has a \r\n string in it and is breaking the format of the data file. What can I use as alternatives to \r\n? [Edit III] OK, its working. I changed the row terminator to \\r\\r\\n and the identity values import cleanly.
sql-server-2008bcpidentity-insert
11 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image WilliamD commented ·
It will be the delimeter, what was the reason for such a strange one? is it at all possible to go for something a little more "normal", a double-pipe "||" for example - that may help
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
desperation! I tried double and even triple tab (\\t\\t\\t) as the delimiter but they crop up in the text column and then the import goes all to hell. Will try || and BRB.
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
using "||" as the delimiter results in Unexpected EOF encountered. Zero rows imported.
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
Have you tried using -C RAW or -C OEM instead of ACP for the encoding?
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
wheels fell off when using RAW, not tried OEM ... BRB ... No difference :(
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
tried the linked server, or is that not a possibility?
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
Have identified the issue (i think) - the text column has a \\r\\n string in it and is breaking the format of the data file. What can I use as alternatives to \\r\\n
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
BCP is really fast, but flimsy with the format files. I'm surprised that SSIS isn't as fast - do you have some sort of error in the setup there? You were at SQLBits, did you see Thomas Kejser showing how fast SSIS can be?
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
For the \r\n issue, i suggest a strange character again, similar to double pipe to make sure it can't really happen, square rbrackets maybe? "]["
0 Likes 0 ·
SQL Kiwi avatar image SQL Kiwi commented ·
Wouldn't it have been easier to use native format? It's faster, and you don't have to worry about terminators. I'm surprised the SSIS solution was 'slow' - had you set it up to use fast load on the destination table, and met all the conditions for minimally-logged fast loads?
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
@SQLKIWI - SSIS was slower than I expected and I wanted a comparison so that I can choose the best solution. Yes - destination was fast load. I dont know enough SSIS to say I have done everything for fast loads - is there a source I can reference to check? Native seemed to have issues with the Text column contents
0 Likes 0 ·

1 Answer

·
WilliamD avatar image
WilliamD answered
I don't quite understand the problem here: Your source table has a HistoryID (Bigint, PK) and the other columns. You spew these out into your CRHIST file. You read that into a second instance to do the data crunching, batched in 100k lots. Are your duplicates occuring in the export to data file, or the insert into target table? Have you tried running bcp without the batching? Set the target DB into BULK_LOGGED recovery to speed that up. 8 million rows should fly into a table, especially when the data is as small as you have there (130MB by my calculation - or have i misread the table def?). You can always create the PK afterwards. Alternatively, have you tried to grab the data via a linked server and `SELECT INTO`? **EDIT:** I think I got the bugger! Test data so you can give it a try: CREATE TABLE [dbo].[TableA] ( [Col1] [int] NOT NULL, [Col2] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[TableZ] ( [Col1] [int] NOT NULL, [Col2] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ) ON [PRIMARY] GO INSERT INTO Sandbox.dbo.TableA SELECT 1,'blah' UNION ALL SELECT 2,'hjdf' -- export, pipes as column splitter, newline for row xp_cmdshell'bcp "SELECT Col1,Col2 FROM Sandbox.dbo.TableA TA" queryout "C:\CRHist.datx" -T -c -S -t"||" -r"\n" -C OEM' -- create a format file according to the target table xp_cmdshell'bcp Sandbox.dbo.TableZ format nul -f c:\format.Fmt -n -T ' -- use the format file with bulk insert INSERT SandBox.dbo.TableZ SELECT Col1-48, -- Interprets the int as an ASCII code FAIL!!! Col2 FROM OPENROWSET(bulk N'C:\CRHist.datx',FORMATFILE = 'c:\format.Fmt') AS a DROP TABLE SandBox.dbo.TableA DROP TABLE SandBox.dbo.TableZ
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Fatherjack avatar image Fatherjack ♦♦ commented ·
the .dat file is well formed and accurate ( at least for the 1st few hundred rows that I have inspected so I assume all of it), however the HistoryID value in the destination is altered to random values and includes duplicates. Thus stopping me adding the PK and making sensible use of the data. Somehow the data is being altered during the import. See my edit to the question.
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
Oh, .dat file is 1.2GB uncompressed. Speed of insert isnt the issue, simply data integrity currently!
0 Likes 0 ·
WilliamD avatar image WilliamD commented ·
tried the linked server, or is that not a possibility?
0 Likes 0 ·
Fatherjack avatar image Fatherjack ♦♦ commented ·
haven't tried linked servers - was happy with disconnected transfer of data rather than having to wait for quiet time on both source and dest. Rapidly thinking slow perf SSIS is better option. Never had a good experience with BCP :(
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.