question

Tatyana avatar image
Tatyana asked

SSIS VS. SQL: where to check record's validity before inserting?

Hi, I have an SSIS package that processes an incoming spreadsheet. Before inserting/updating records I need to check if a date in each of them - say, "Job Close Date" - is not before a corresponding date - say, "Job Open Date" - that is stored, for each "Job" in a table is the database. I use a store procedure that is called from SSIS package and performs dates check and, if "Job Close Date" is valid, goes on and performs insert/update. My boss does not like this technique, he says that the best practice would be to put this logic into the package using, say, conditional split. Is this correct that performing check-ups and insert/update via SQL stored procedure is worse than doing this in the SSIS package? If yes, why? And, anyway, I don't see how conditional split could check "Job Close Date" record by record... could it? Thank you, - Tatyana
ssissplitconditional
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Mark avatar image Mark commented ·
Do you want to reject the whole batch of records if there is an invalid date or process all of the records with valid dates? The logic could probably be writen into the SSIS package for either situation, but I typically think of a SSIS packages rejecting the whole file if one error occurs.
1 Like 1 ·
Tatyana avatar image Tatyana commented ·
Hi Mark, I want to process all the records with valid dates, and reject (and possibly save them in an error output file) those with invalid dates. I think that "logically" invalid values would not cause rejecting the whole file as it might be in a case of an invalid format. It would just go on and insert me all the records...
0 Likes 0 ·
Daniel Ross avatar image
Daniel Ross answered
i agree with Kannan, it would be better to perform the checks on the data before inserting them into the table. Use a conditional split, and ignore the data that is invalid. If you are inserting the data into a table, then use SSIS, however if you are updating existing data, then you should always insert the data into a temp table and then use SQL to perform the updates. Using the OLEDB Command transform to perform updates only updates one row at a time (RBAR) and should be avoided. And, because you are removing the data you don't need in the SISS package, you can remove the where clauses in the update statement that you would normally use.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Tatyana avatar image Tatyana commented ·
I think this answer together with Daniel's comment above give me a complete understanding of my options and their reasons. Adding another source to a routine certainly opens way to many possibilities that I did not consider before. Thank you, Daniel!
0 Likes 0 ·
aRookieBIdev avatar image
aRookieBIdev answered
My preference would be to do this in a SSIS package. The Conditional split gives you two outputs ( conditions based on the validity of Job close date) The out puts can then be inserted/updated parallelly. Yes it does check the condition for each row. Based on properties like DefaultBufferMaxRows and DefaultBufferSize in the data flow task you could further enhance the performance. Thanks, Kannan
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Daniel Ross avatar image Daniel Ross commented ·
now you are getting into a more complex package. There a two main ways of getting data from a table. 1. you can perform a lookup and add the data from the table into the dataflow, or two you can add another OLEDB datasource and use the merge join transformation. The difference of the two can vary depending on the data, as always you should test the two different methods. After you have the Job Close Date then you can use this in the Conditional Split. You should also test the SQL method, it may well come out that the SQL method is faster than getting the existing data from the package. If it turns out that the SQL method is a faster method and your boss is open to constructive criticism (Some aren't) then show your boss your results!
1 Like 1 ·
Tatyana avatar image Tatyana commented ·
Thank you, Kannan. From you explanation (and Daniel Ross's comment below) I understood that it is, indeed, more effecient to perform inserts/updates from SSIS. However, I don't see how the Conditional Split can be used here. If I had one "Job close date" for all the records, I could put it into the condition expression; however, it is different for each row (based on a "Job ID") and is stored in another table in the database. In the Conditional Split Editor I can reach for the values of Variables or values of the Columns in the currently processed file, but not of the fields of tables stored in the database. If I'm wrong, maybe you could give me a sample of an expression that would allow to compare a value in a record of a currently processed file with a corresponding value in the database? Thank you!
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.