x

SSIS VS. SQL: where to check record's validity before inserting?

Hi, I have an SSIS package that processes an incoming spreadsheet. Before inserting/updating records I need to check if a date in each of them - say, "Job Close Date" - is not before a corresponding date - say, "Job Open Date" - that is stored, for each "Job" in a table is the database. I use a store procedure that is called from SSIS package and performs dates check and, if "Job Close Date" is valid, goes on and performs insert/update. My boss does not like this technique, he says that the best practice would be to put this logic into the package using, say, conditional split. Is this correct that performing check-ups and insert/update via SQL stored procedure is worse than doing this in the SSIS package? If yes, why? And, anyway, I don't see how conditional split could check "Job Close Date" record by record... could it? Thank you, - Tatyana
more ▼

asked Jan 21 '11 at 07:26 AM in Default

Tatyana gravatar image

Tatyana
413 15 15 18

Do you want to reject the whole batch of records if there is an invalid date or process all of the records with valid dates? The logic could probably be writen into the SSIS package for either situation, but I typically think of a SSIS packages rejecting the whole file if one error occurs.
Jan 21 '11 at 08:57 AM Mark
Hi Mark, I want to process all the records with valid dates, and reject (and possibly save them in an error output file) those with invalid dates. I think that "logically" invalid values would not cause rejecting the whole file as it might be in a case of an invalid format. It would just go on and insert me all the records...
Jan 23 '11 at 04:55 PM Tatyana
(comments are locked)
10|1200 characters needed characters left

2 answers: sort newest

i agree with Kannan, it would be better to perform the checks on the data before inserting them into the table. Use a conditional split, and ignore the data that is invalid.

If you are inserting the data into a table, then use SSIS, however if you are updating existing data, then you should always insert the data into a temp table and then use SQL to perform the updates. Using the OLEDB Command transform to perform updates only updates one row at a time (RBAR) and should be avoided. And, because you are removing the data you don't need in the SISS package, you can remove the where clauses in the update statement that you would normally use.
more ▼

answered Jan 23 '11 at 03:53 PM

Daniel Ross gravatar image

Daniel Ross
2.9k 10 12 13

I think this answer together with Daniel's comment above give me a complete understanding of my options and their reasons. Adding another source to a routine certainly opens way to many possibilities that I did not consider before. Thank you, Daniel!
Jan 25 '11 at 05:11 AM Tatyana
(comments are locked)
10|1200 characters needed characters left

My preference would be to do this in a SSIS package. The Conditional split gives you two outputs ( conditions based on the validity of Job close date) The out puts can then be inserted/updated parallelly. Yes it does check the condition for each row. Based on properties like DefaultBufferMaxRows and DefaultBufferSize in the data flow task you could further enhance the performance.

Thanks, Kannan
more ▼

answered Jan 21 '11 at 07:38 AM

aRookieBIdev gravatar image

aRookieBIdev
2.3k 42 54 58

Thank you, Kannan. From you explanation (and Daniel Ross's comment below) I understood that it is, indeed, more effecient to perform inserts/updates from SSIS. However, I don't see how the Conditional Split can be used here. If I had one "Job close date" for all the records, I could put it into the condition expression; however, it is different for each row (based on a "Job ID") and is stored in another table in the database. In the Conditional Split Editor I can reach for the values of Variables or values of the Columns in the currently processed file, but not of the fields of tables stored in the database. If I'm wrong, maybe you could give me a sample of an expression that would allow to compare a value in a record of a currently processed file with a corresponding value in the database? Thank you!
Jan 23 '11 at 05:33 PM Tatyana

now you are getting into a more complex package. There a two main ways of getting data from a table. 1. you can perform a lookup and add the data from the table into the dataflow, or two you can add another OLEDB datasource and use the merge join transformation. The difference of the two can vary depending on the data, as always you should test the two different methods. After you have the Job Close Date then you can use this in the Conditional Split.

You should also test the SQL method, it may well come out that the SQL method is faster than getting the existing data from the package. If it turns out that the SQL method is a faster method and your boss is open to constructive criticism (Some aren't) then show your boss your results!
Jan 23 '11 at 09:39 PM Daniel Ross
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x900
x22
x14

asked: Jan 21 '11 at 07:26 AM

Seen: 3485 times

Last Updated: Jan 21 '11 at 07:26 AM