Do you think it's reasonable to use SQL to automate string manipulation when files coming in are never the same? I have used SAS, Excel, and FoxPro in the past to do this type data cleansing and just want to make sure I'm headed down the right path in learning the manipulation techniques in SQL. Things that would need to be performed: - cleansing & parsing addresses - calculations on dates/dates & times for employment records - merging of records - identifying duplicates - reporting a summary of data issues and resolutions. These are the most common things. Please any advice would be helpful on what would be a development software to use or if i should continue to learn the sql techniques necessary.
Where is the data being collected? Its fine and dandy to have PoSh, SSIS, Excel Macros, Python, Perl, C# and any number of other methods to catch and cleanse data but where it is collected is the best place to fix the issue. Get the UI improved to 'encourage' the user to enter valid, well formatted values that then support the database functions rather than distract effort from other areas just to clean up a mess. I know its not easy but I would seriously be reporting the time/effort spent on sorting the mess so that someone can see that a bit of time from a developer could make things permanently a lot smoother later down the data life-cycle.
As other answers have made clear, there are lots of options for this. I run into this regularly, and I personally use either [Python] or IronPython. I personally find Python to be more intuitive and easier to use than things like PowerShell, more flexible than SSIS, and skips the compile step for languages like C#. Now, with that said, PowerShell, SSIS, and C# are all fine tools that I have also used for this purpose depending on the details of the situation. And in particular, if you want to do substantially string manipulation inside of SQL Server for some reason, then most likely your best option is to create a CLR assembly in C# or
VB.NET and attach it to SQL server. So, I do not think there is a best string manipulation software all around. All other things being equal, I use and recommend Python, but there are certainly times when SSIS, PowerShell, C#, etc. are very good candidates. One of the key questions is, which of the many tools available are you most comfortable with? :