question

red68 avatar image
red68 asked

Load .txt and .csv flat files with minimal effort and fast

I have looked and worked with a lot of different scenarios for loading files. SSIS is good but if the metadata changes, you have to remap the flat file connection manager and other things. I don't want the department to do this continually. I am looking for a fast, efficient approach to load files by reading a header record, create table and load it whether it be comma (,) or pipe (|) delimited. We can just create a "StagingTable", load the contents to it. Any ideas? Prefer SSIS but open to stored procs, etc. We remote into an windows server and from there connect to the SQL database server, so if we use bulk insert, files have to be on the database server. I haven't figured out how to use UNC path and it work. Ideas from anyone is appreciated. Think about loading files continuously week by week for processing. Next week, the metadata can be the same or change. Thanks!
ssis2012
11 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

@red68 -- considering the dynamic nature of this problem, I'd look into a solution using PowerShell (just because I am more comfortable with it) Looks like you need to develop a try and error solution to check for things and behave accordingly. Scripting is good at doing it, IMHO. There is a cmdlet in PowerShell called import-csv ( https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/import-csv?view=powershell-5.1) I'd start with it and then branch out the ideas from there to see if it can help solve the issue. Maybe there is a way of doing this in SSIS, I am just not that advanced in it.
0 Likes 0 ·
tried the PowerShell solution and it appears to be slow on large files like 10 million rows. Super fast for small, 100,000 rows. I will keep researching.
0 Likes 0 ·
@red68 -- well, you will experience slowness when you start reaching these heights with the text files or any large data set for that matter. My recommendation would be to start thinking about how you can make this process run in parallel. So instance, splitting the files in multiple pieces, use WORKFLOWs, and a load in a way that won't create any issues on your database server. As long as the compute can handle it.
0 Likes 0 ·
@red68 Are you open to the solution which includes the following logic: - Use the C# code to open and read just the first line of the delimited file in order to figure out the names of the columns and create a staging table with all columns as varchar. - Use the T-SQL script to insert the data into this newly created staging table via bulk insert. This method will be as fast as it can be and will have no problems whatsoever with loading 10 million rows or more pretty quickly. Please let me know and I can type up the required C# and T-SQL, this would be a simple code, should not take too long to type. You can then translate it into SSIS package with one script task and one T-SQL task, no manual remapping would be required.
0 Likes 0 ·
Perfect that is exactly what I am looking for Oleg. I will wait for your code. Thanks a million.
0 Likes 0 ·
Show more comments
Oleg avatar image
Oleg answered
As discussed in the comments, here is the solution in C#, which may be split into SSIS package steps if needs to be, but this would be an overkill as SSIS adds quite a bit of weight, so creating the SQL Job with the single step to call the EXE could do it instead. Optionally, this C# solution may be easily restated into T-SQL script. The pure T-SQL solution is listed below the C# solution. C# code: static void Main(string[] args) { try { using (StreamReader reader = new StreamReader("c:\\temp\\source.csv")) { // Script to drop the staging table if it already exists from the previous import. // This might be necessary in case if each source file has different shape/dlimiter. string dropTable = @" if exists ( select 1 from sys.objects where [object_id] = object_id(N'dbo.StagingTable') and [type] = N'U' ) drop table dbo.StagingTable;"; // Script which may be used to create staging table before bulk inserting data into it. string createTable = "create table dbo.StagingTable("; string headers = reader.ReadLine(); // the line below assumes that the delimiter is pipe or comma. Enhance if necessary. string delimiter = headers.Contains(",") ? "," : "|"; string bulkInsert = "bulk insert dbo.StagingTable from 'c:\\temp\\source.csv' " + "with (fieldterminator = '" + delimiter + "', firstrow = 2);"; List columns = new List (headers.Split(delimiter.ToCharArray())); columns = columns.ConvertAll (item => item += " varchar(100) null"); createTable += string.Join(", ", columns) + ");"; reader.Close(); using (SqlConnection cn = new SqlConnection("Integrated Security=true;Initial Catalog=Test;server=.")) { cn.Open(); using (SqlCommand cm = new SqlCommand()) { cm.CommandType = CommandType.Text; cm.Connection = cn; // drop table if exists cm.CommandText = dropTable; cm.ExecuteNonQuery(); // create staging table cm.CommandText = createTable; cm.ExecuteNonQuery(); // bulk insert records from delimited file cm.CommandText = bulkInsert; cm.ExecuteNonQuery(); } // instance of command object is disposed // Close the connection cn.Close(); } // connection is disposed } // reader is disposed } catch (Exception ex) { Console.WriteLine("Error occurred: " + ex.Message); // throw ex; /* re-throw error if needed */ } // Console.ReadLine(); } When I tested this solution on my old Windows 7 laptop with 8 GB of RAM, I saw that importing a million rows takes about 10 seconds, meaning that I had about 100K worth of records per second import speed. This is still better that pumping data from flat file via SSIS package. The import speed should be better if the SQL Server instance is more powerful than my old laptop :) The same can be accomplished with T-SQL. I will add pure T-SQL solution to this answer in about 30 minutes (or whatever it will take me to "translate" the C# to T-SQL. Even though ordinarily C# is about **what you can do to a row** while T-SQL is about **what you can do to a column**, it should be a straightforward exercise in this case because there was no any "row processing" in the C# in this answer. *** Edit (added T-SQL solution, and also handling of the double quotes as text qualifier) *** The T-SQL script which accomplishes the same thing, and also handles the double quotes as text qualifier (which will work if everything is qualified, header and lines). Usually, there is a reason why the text qualifier is needed and used, it is there to handle the possible scenario when the column's value itself includes the delimiter as a part of it's value. In this case, the common convention is to qualify all fields and all values in the file with double quotes. Using the separate process to "prepare" the file by removing all double quotes is, therefore, ill advised. For example, suppose there is a field named **FullName**, and the value of this column in one of the records is **Bozo, The Clown**. If the file is comma-delimited then it would cause problems with bulk insert which has comma as the field terminator, unless the file is text qualified with double quotes. One of the ways to handle this situation is to specify the field terminator as **","**. This will allow to still use bulk insert dynamically without the need of format file. The only downside is in this case the first column and last column will end up prefixed or suffixed with double quote respectively. This will necessitate the final touch up (the update of all rows of the staging table, replacing the double quote in first and last columns with blank). The good news is that this final touch up will be executed only if necessary. Here is the updated script: declare @headers varchar(max); declare @delimiter varchar(3); /* char(1) is not enough if text qualifier is present */ declare @createTable nvarchar(max); declare @rowTerminator varchar(6); declare @bulkInsert nvarchar(max); declare @lastColumn int; declare @updateScript nvarchar(max); -- read the header row, figure out delimiter and column names, and generate bulk insert script select @headers = replace(substring(t.BulkColumn, 1, charindex(char(10), t.BulkColumn) - 1), char(13), ''), @delimiter = case when charindex('","', @headers) > 0 then '","' when charindex('"|"', @headers) > 0 then '"|"' when charindex(',', @headers) > 0 then ',' else '|' end, -- outer replace is needed to remove 2 double quotes (in the beginning and the end of the headers) @createTable = replace('create table dbo.StagingTable(' + replace(@headers, @delimiter, ' varchar(50) null,') + ' varchar(50) null)', '"', ''), @rowTerminator = case when charindex(char(13) + char(10), t.BulkColumn) > 0 then '0x0d0a' else '0x0a' end, @bulkInsert = 'bulk insert dbo.StagingTable from ''c:\temp\ 4357-test2.txt'' with ( fieldterminator = ''' + @delimiter + ''', firstrow = 2, rowterminator = ''' + @rowterminator + ''' );' from openrowset(bulk 'c:\temp\ 4357-test2.txt', single_clob) t; -- drop the staging table if already exists if exists( select 1 from sys.objects where [object_id] = object_id(N'dbo.StagingTable') and [type] = N'U' ) drop table dbo.StagingTable; -- create staging table exec sp_executesql @statement = @createTable; -- bulk insert records from delimited file exec sp_executesql @statement = @bulkInsert; -- Final touch up, which is needed only if the file included the double quotes as the -- text qualifier: if text is qualified then bulk insert worked but first column ended -- up prefixed and the lastcolumn - suffixed with the double quotes. Remove both. if charindex('"', @delimiter) > 0 begin select @lastColumn = max(column_id) from sys.columns where [object_id] = object_id(N'dbo.StagingTable'); select @updateScript = 'update dbo.StagingTable set ' + stuff(( select ', ' + quotename([name]) + ' = replace(' + quotename([name]) + ', ''"'', '''')' from sys.columns where [object_id] = object_id(N'dbo.StagingTable') and column_id in (1, @lastColumn) for xml path('')), 1, 2, ''); exec sp_executesql @statement = @updateScript; end; go Test of the T-SQL script on my laptop had shown a considerable improvement in import rate as it took only 3 seconds to read the headers, create table and bulk insert a million rows into the staging table, which means that the insert speed was about 300K rows per second. Testing 10 million rows took 40 seconds. Hope this helps. Oleg
10 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

I get an error using T-SQL script. Invalid length parameter passed to the LEFT or SUBSTRING function. My header does not end with a '|' but a CRLF. I believe this is causing an issue with returning a 0 or something. Any help, greatly appreciated. Also, if I enter a new filename, I get the following error messages. Msg 102, Level 15, State 1, Line 1 Incorrect syntax near '|'. Msg 208, Level 16, State 82, Line 1 Invalid object name 'dbo.StagingTable'. Thanks!
0 Likes 0 ·
@red68. I know that the lines in the files do not end with pipe, but usually end with carriage return followed by linefeed (char(13) + char(10) in T-SQL, a.k.a. CRLF). There is no way that the script does not handle it if the lines do end with these characters. On the other hand, it is possible that the lines end with just linefeed (char(10) only). If this is the case then you can just adjust the line setting the value of the @headers variable to search just for linefeed and always discard the carriage return whether it is present or not. In this case, the variable will be set correctly regardless of whether the lines end with linefeed only, or CRLF. Here is the updated line of code setting the @headers variable: @headers = replace( substring(t.BulkColumn, 1, charindex(char(10), t.BulkColumn) - 1), char(13), ''), Also, there is no way that you would get any script errors when changing one file name for the other unless you introduce the error accidentally. When pasting the file name, please note that there are 2 places, one of them requires that the path is wrapped in single quotes, and the other - pair of single quotes.
0 Likes 0 ·
Thanks, I will give this a shot and let you know.
0 Likes 0 ·
Still getting "Incorrect syntax near '|'. Here is the code. I don't see a syntax issue. select @headers = replace(substring(t.BulkColumn, 1, charindex(char(10), t.BulkColumn) - 1), char(13), ''), @delimiter = case when charindex(',', @headers) > 0 then ',' else '|' end, @createTable = 'create table dbo.StagingTable(' + replace(@headers, ',', ' varchar(100) null,') + ' varchar(100) null)', @bulkInsert = 'bulk insert dbo.StagingTable from ''E:\PipeLoad\AWL\ DeNovoAH00231370_TU_Linked_OP.txt'' with (fieldterminator = ''' + @delimiter + ''', firstrow = 2);' from openrowset(bulk 'E:\PipeLoad\AWL\ DeNovoAH00231370_TU_Linked_OP.txt', single_clob) t;
0 Likes 0 ·
@red68 I copied and pasted the script from comment and it parses and runs just fine. This probably means that there is something else in your SSMS script window which causes the syntax error. Please run the script below, and let me know whether you get any errors: declare @headers varchar(max); declare @delimiter char(1); declare @createTable nvarchar(max); declare @bulkInsert nvarchar(max); select @headers = replace(substring(t.BulkColumn, 1, charindex(char(10), t.BulkColumn) - 1), char(13), ''), @delimiter = case when charindex(',', @headers) > 0 then ',' else '|' end, @createTable = 'create table dbo.StagingTable(' + replace(@headers, ',', ' varchar(100) null,') + ' varchar(100) null)', @bulkInsert = 'bulk insert dbo.StagingTable from ''E:\PipeLoad\AWL\ DeNovoAH00231370_TU_Linked_OP.txt'' with (fieldterminator = ''' + @delimiter + ''', firstrow = 2);' from openrowset(bulk 'E:\PipeLoad\AWL\ DeNovoAH00231370_TU_Linked_OP.txt', single_clob) t; select @headers Headers, @delimiter Delimiter, @createTable CreateTable, @bulkInsert BulkInsert; Thank you.
0 Likes 0 ·
Show more comments
red68 avatar image
red68 answered
[link text][1] [1]: /storage/temp/ 4357-test.txt

test.txt (154 B)
9 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

@red68 The script works with attached file. The only way I found to reproduce the error is by NOT changing the set of the create table variable, i.e. replace(@headers, ',', ' varchar(100) null,') is wrong (as it is wrong in my answer), while replace(@headers, @deimiter, ' varchar(100) null,') is correct.
0 Likes 0 ·
Thanks, I will test it with this test file. There could be a record in the 11 million record file with an issue. Thanks for all the help.
0 Likes 0 ·
It worked as well for me. Thanks again.
0 Likes 0 ·
[link text][1] [1]: /storage/temp/ 4358-unix.txt
0 Likes 0 ·
unix.txt (740 B)
@red68 This file comes from Unix meaning that it does not have a normal CRLF terminator but uses just the LF instead. For Windows files there is no need to provide the rowterminator option to bulk insert while for Unix files you have to. In order to make it work for both, you can still add the rowterminator option and set it to either '0x0d0a' for Windows files or to '0x0a' for the ones which come from Unix. Here is the relevant part of the script which will work for both after the @rowterminator variable is declared as varchar(6) (I will not bother myself to handle the files from Mac which use just the CR to terminate the lines): select -- other fields @rowterminator = case when charindex(char(13) + char(10), t.BulkColumn) > 0 then '0x0d0a' else '0x0a' end, @bulkInsert = 'bulk insert dbo.StagingTable from ''c:\temp\ 4358-unix.txt'' with ( fieldterminator = ''' + @delimiter + ''', firstrow = 2, rowterminator = ''' + @rowterminator + ''' );' from openrowset(bulk 'c:\temp\ 4358-unix.txt', single_clob) t; This should work for both.
0 Likes 0 ·
Show more comments

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.