question

tanglesoft avatar image
tanglesoft asked

Loading XML in SQL Server

I have an 80MB XML file with a reasonable complex XML structure, e.g Root ..Header ..Member ....Payroll ......PayrollDetail ..Benefits ....Benefit ......BenefitElement ........Beneficiary I want to load the XML into a replica schema that exactly the same structure as the XML. Currently I am using OPENROWSET(BULK 'file, SINGLE_BLOB) etc to XML from the file into an XML type. Then I use .value and .nodes methods to parse the XML and load into the target tables. This all works fine but it is slow and takes about 10 minutes to load the XML. Another issue is that it is hard to determine the parent of an element at times. Quite often a parent has no unique keys to identify it so I have to go to the parents parent or further up the hierarchy. Interested to know what other peoples approach would be. Would be happy to write in C# if anyone thinks this is faster or more maintainable. Cheers Stuart
sql serverxml
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Valentino Vranken avatar image
Valentino Vranken answered
I hope I understood your question completely. In the end, you want to load a table based on some data out of the XML, right? In that case I would give SSIS in combination with XSLT a try, using the method as I explained here: [Loading Complex XML Using SSIS][1] Have to admit though that I haven't tested this method with large files, but I think 80MB should be feasible. If you happen to give this a test, keep us in the loop. Would be interesting to know! [1]: http://blog.hoegaerden.be/2011/04/20/loading-complex-xml-using-ssis/
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

tanglesoft avatar image
tanglesoft answered
Thanks Valentino. The performance of the T-SQL gets worse. My target schema has datetime and numeric fields which when NULL are represented as empty nodes in the XML. I'm using case statements like CASE WHEN R.ref.value('declare namespace NS="http://CQuence/DataHub/LNFF"; NS:Benefits[1]/NS:Benefit[1]/NS:BenefitElement[1]/NS:ElementVersionEndDate[1]','VARCHAR(20)') = '' THEN NULL ELSE CONVERT(DATETIME, R.ref.value('declare namespace NS="http://CQuence/DataHub/LNFF"; NS:Benefits[1]/NS:Benefit[1]/NS:BenefitElement[1]/NS:ElementVersionEndDate[1]','VARCHAR(20)'), 103) END ElementVersionEndDate, This destroys the performance. One alternative is to create a parallel set of tables with varchar fields and load the XML into there then carry out the type conversion from here. However I like your idea of using SSIS. So I am going to investigate that first. First up I think will write an XSLT that converts the source XML to CSV. I will let you know how the performance compares.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Valentino Vranken avatar image Valentino Vranken commented ·
Nice, looking forward to the result of your test :)
0 Likes 0 ·
tanglesoft avatar image
tanglesoft answered
After 5 days of development I have completed the core loader code using SSIS to load the XML rather than T-SQL. In summary the 21 minute load time was reduced to 48 seconds. In fairness the end solution cheats a bit along the way but the requirement at the start was to load 80Mb of XML containing 500,000+ elements. The approach I have employed is to use the XMLTask to merge the 5 documents into a single document. The result of this is the fed into XML Source that splits out the XML document into delimited files based on the types defined in source document. In some cases 3rd and 4th order elements are merged with their parents so that data can be re-strutured later on. The extract of the various XML element files occurs in the parallel. Each file is then loaded into SQL Server and this is where type conversion occurs also, again this runs in parallel. The solution is far easier to maintain than the original T-SQL model and an order of magnitude quicker. It may be possible to eliminate the intermediate file stage but I was concerned about the memory usage.
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Valentino Vranken avatar image Valentino Vranken commented ·
So your XMLTask is using "Merge" as OperationType, did I get that right? Interesting.
0 Likes 0 ·
tanglesoft avatar image tanglesoft commented ·
The XML we receive from a third party is split into multiple XML files with 5000 customers in each. Each file is well formed in its own right. Initially I processed them individually but ran into a problem with the autogenerated identifier that SSIS creates for each element. The identifier is only unique on a per file basis so I had the choice of of carrying through another key into the load files or using the merge option on the XML task.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.