SSIS Buffers, Synchronous Transforms and Estimated Row Size

Can someone provide further insight into how the SSIS buffer manager calculates the 'Estimated Row Size'?

I know that metadata of source components is examined and that the maximum size of variable-length strings is used. Am I right in thinking that it also depends on the Transformations used in the buffer's execution tree?

I ask because the Derived Column Transformation is synchronous (no memory is copied as buffers are reused). This means that any Derived Column Transforms in the same execution tree affect the number of rows the buffer can handle. This obviously extends to other Transformations and is not limited to synchronous ones.

I feel I'm answering my own question here but would really appreciate a nod from someone more familiar with SSIS internals.

It's also worth pointing out as every online article I've consulted on this cites 'the metadata [about your] source data' as the basis for the Estimated Row Size. I've yet to find one that mentions Transformations!

more ▼

asked May 24, 2012 at 04:37 PM in Default

avatar image

0 1 1 2

(comments are locked)
10|1200 characters needed characters left

1 answer: sort voted first

I do not think you would find any mentioning of the Transformations to calculate the Estimated Row Size ;) You are mixing it up.

Estimated Row size is used to calculate the size of the buffer.

Asynchronous and Synchronous transformations defines whether additional buffers would be needed or the incoming input buffers can be utilized.

I have kept it simple, so you do not get more confused.

more ▼

answered May 28, 2012 at 12:27 PM

avatar image

Usman Butt
14.6k 6 13 21

I know that Estimated Row Size is used for calculating the size of the buffer. My point is that I've run tests which indicate that the Estimated Row Size, and therefore buffer size, is governed by Transormations also. You say: 'Asynchronous and Synchronous transformations defines whether additional buffers would be needed or the incoming input buffers can be utilized.' but that doesn't make sense in the context of synchronous transforms where buffers are always reutilised (such as in the Derived Column Transform). If they are reutilised, where do these 'additional buffers' come in?

This is the test I've performed:

I've created an OLEDB source component returning three integers (DT_I4). I'd expect the memory required for a single row to be 12 bytes (24 if the error output occupies memory space in the same buffer).

I've set the data flow properties as follows:

DefaultBufferMaxRows: 500,000

DefaultBufferSize: 10,485,760

The required buffer size is therefore be 12bytes 500,000 = 6,000,000. This is below DefaultBufferSize so SSIS should scale up the number of rows per 10,485,760 byte buffer to 873,813 (10,485,760 / 12). On execution, however, I observe exactly half, 436,907 rows/buffer ((10,485,760 / 24*), leading me to believe that the error output occupies half the memory space within the buffer.

With these results at hand, I added a Derived Column Transform to insert a further 7 integer columns to the pipeline. The memory required for a given row should now be 40 bytes (80 bytes if the error output requires its own space in the buffer). Using the same buffer size properties as before:

DefaultBufferMaxRows: 500,000

DefaultBufferSize: 10,485,760

The required buffer size should be 40bytes * 500,000 = 20,000,000 bytes. This exceeds 10,485,760 so SSIS scales down the actual number of rows/buffer to the observed 131,072 rows/buffer. Again, this is exactly half of BufferSize/RowSize (or exactly 10,485,760 / 80) implying the error output is using half the buffer.

What does this show?

1) The Derived Column Transform has reduced the number of rows/buffer indicating that the part of the buffer (70% in this case) is allocated to the Dervied Columns.

2) The error output in the OLEDB source component utilises half the memory of its output buffer.

If this is incorrect, please elaborate and let me know where I've gone wrong. And please, don't worry about confusing me further! :)

May 28, 2012 at 02:56 PM andrew.diniz

@andrew.diniz I appreciate your motivation and testing. But, interestingly, how did you calculate the Estimated Row size? ;) Through the Derived Column Transformation? I do not think so. ;-) Through the meta-data, yes seems very much so?

Now, coming to the point i.e. your saying, derived column transformation decreased the rows per buffer? It is not actually the transformation. It is actually the amount of information you wanted. You have increased the number of columns and hence changed the meta-data. So, SSIS engine had to change the buffer size to keep up with the default values. I am supplementing it using your example with a different approach. You have the 3 four bytes integers from OLE DB Source, derived column transformation with 3 columns, but instead of adding these as new columns, use the Replace column property. Hence, replacing all 3 input columns. You will see that the buffer size would not change. It happened, because the columns did change but the meta-data (the number of bytes) remained the same and you are still using the derived column transformation.

Both Asynchronous and Synchronous transformations are more specific to performance and getting into their details is off the topic. They have nothing to do with the buffer size. It is always the meta-data that corresponds to the estimated row size and buffer size.

For your second point i.e. error output; since I don’t have to bother making you confused, if error output is part of the buffer, should not it be considered for estimating the row size? If yes, then should not the estimated row size for your example 1 be 24bytes * 500,000 = 12,000,000 ;-)

BTW, I hope you have already read this old but very thorough article

May 29, 2012 at 02:03 PM Usman Butt

Yes, I'm Estimating the row size using the meta data. My point is that we have to include the metadata of fields added to the pipeline from Transformations - not just Data Source Components. So really, I'm not 'mixing it up'! ;)

Regarding the reduction in rows/buffer as a result of the Derived Column transform... Semmantics. Admitedly, it would have been more accurate to write 'the presence of the additional columns created by the Derived Column Transform...' but my understanding is ok and I'd expect the behaviour you described in your example of replacing existing columns).

The final point is still bothering me. I had considered what you said (24bytes 500,000=12,000,000) and was not convinced but it's the closest I could come to an explanation (albeit incorrect). Can you explain why the actual rows/buffer out of the Source Component is exactly half* the predicted one? This is really poorly documented. The only possibly relevant resource I've found states that Data Source Components are Asynchronous and output 2 sets buffers: one for standard ouput and one for error. I don't see how this explains what I'm observing.

And of course, I have read that whitepaper :)

May 29, 2012 at 02:28 PM andrew.diniz

If you think your understanding is correct, then it is alright :-) Regarding exactly half rows/buffer size, you need to let know your testing and environment details. I remember things changed a bit with the versions, service packs etc. Also, through "show advanced editor" option, you could count the columns as well as sum up the number of bytes. There, you will also see two extra columns for ErrorCode and ErrorColumn.

May 30, 2012 at 09:32 AM Usman Butt
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.



asked: May 24, 2012 at 04:37 PM

Seen: 2936 times

Last Updated: May 30, 2012 at 09:32 AM

Copyright 2018 Redgate Software. Privacy Policy