question

Ben Rees avatar image
Ben Rees asked

Cleaning data with T-SQL

Hi - does anyone have any tips for cleansing data with T-SQL? I've got a lot of columns of data to clean up and it seems the only way of cleaning it all is to handcraft T-SQL code to:

1) first profile my data, then, 2) Clean it.

For example, I have a column of datetimes with a lot of noise in it - nulls, future dates (there shouldn't be!) and dates that are far too old (i.e. they were incorrectly entered). I need to end up with some script that strips out (or possibly even cleans) the invalid rows. Trouble is, I need to do this for a lot of columns...

Does anyone know of a quicker way of doing this - is there something like this in SSIS or elsewhere that will make this job easier/quicker? A profiling tool of some sort?

Thank!

business-intelligencedata-cleansing
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

TimothyAWiseman avatar image
TimothyAWiseman answered

I think you will get the best results by using hand crafted T-sql. However, SSIS does have some tools built in, as described in this article by http://msdn.microsoft.com/en-us/magazine/cc163731.aspx .

10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Grant Fritchey avatar image
Grant Fritchey answered

There's nothing built into SSIS or TSQL to facilitate this. You'd need to track down a third party tool.

10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

GPO avatar image
GPO answered
I know it's a necessary evil in organisations where there's a buck-passing attitude that "app is near enough, the BI team can sort the junk out later" but cleanse at your own peril and only as part of a comprehensive feedback loop. 1. If there is stuff that just should not be getting into the data, the fault needs to be fed back and fixed in the source application and database ASAP. 2. If the business managers are too lazy or flippant to care about getting the source application working properly, a business intelligence solution is probably beyond them anyway. It's a tool. It's not a magic wand. 3. Whatever "cleansing" you do, make sure your decisions are fed back to the owners of the data and that they AGREE and SIGN OFF on them. 4. As Jeff Moden says "If the data is worth having, it's worth validating." I'd just add to that "...as close to the point of entry as possible."
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.