|
Say the source data comes in excel format, below is how I import the data.
I was wondering how to make this procedure more efficient in every step? I have some idea but not complete. For step 2&6, using scripts that can check automatically and print out all error row/column data. So it's easier to remove all errors once. For step 3&5, is there any way to automatically update the table without manually go through the importing steps? Could the community advise, please? Thanks.
(comments are locked)
|
|
In summary, to import a dirty excel file, I would write a script to clean it up if possible (it almost always is depending on how it is dirty) and then use DTS and that should be the entire process. No additional steps should normally be needed to ge the data into SQL. Here is my comments on each step:
[Edited to address the comments] Regarding Python libraries that work with excel, I would start with http://www.python-excel.org/. It has the downloads for xlrd and xlwt which together let you handle excel from within Python quite nicely. pyExcelerator is another option I have used in the past, though I think that xlrd and xlwt are superior for most use cases. As to the date format, that should be relatively easy to handle though precisely though how you do it depends on the exact situations. If they are stored as dates in excel (or you can easily convert them to that) then you should have no problem importing them into SQL as dates. If they are stored as text that is in one of the standard date formats, then SQL is pretty good (but not perfect so do some testing!) about doing implicit conversions with very little user work. And if they are truly messed up then the Python datetime library is very good at parsing any date format that is remotely close to any of the standard formats so you can use that as one of many options to do that cleanup in an automated fashion. Thanks for the reply.
May 13 '10 at 10:08 PM
Stan
Also, to explain how dirty the source data is: 1. might have duplicate columns that violate key restriction, 2. datatime format in Excel cells might be inconsistent, eg. some may be 'May 13 2010 5:00pm', some 'May 13 2010'. 3. some rows that have several columns empty
May 13 '10 at 10:15 PM
Stan
(comments are locked)
|

