x

Has anyone tried variations on 8K Splitter?

Hi,

I am in a situation, where I am using the famous [8K Splitter function][1] developed by well-known, well-regarded Mr. Jeff Moden. Often it is referred on this forum as well. So, I believe I must share my findings and seek valuable suggestions/corrections from this great community. To start with, the original function mentioned in the article has been further optimized and is called “[DelimitedSplit8K_T1]” (BTW, we are using it currently). It is available in the attachment section of the aforementioned link. But for ready reference I am posting it here

CREATE FUNCTION [dbo].[DelimitedSplit8K_T1]
--===== Define I/O parameters
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover VARCHAR(8000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT 1 UNION ALL -- does away with 0 base CTE, and the OR condition in one go!
                 SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT s.N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
                   FROM cteStart s
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = SUBSTRING(@pString, l.N1, l.L1)
   FROM cteLen l

;
GO

So far I have come up with some modified versions but 2 of them are performing better than the DelimitedSplit8K_T1. But those would need some testing and there could be more room for improvement. I am posting them here. I have used some comments to give the main idea and differences.

/*===== USES BINARY COLLATION TO SPEED UP THE STRING COMPARISON=======*/
CREATE FUNCTION [dbo].[DelimitedSplit8K_T1_With_Binary_Collation]
--===== Define I/O parameters
        (@pString VARCHAR(8000) , @pDelimiter VARCHAR(1)/*===USE OF VARCHAR TO AVOID IMPLICIT CONVERSION WHILE COMPARING====*/

)
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover VARCHAR(8000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT 1 UNION ALL -- does away with 0 base CTE, and the OR condition in one go!
                 SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString ,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT s.N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1),0)-s.N1,8000)
                 FROM cteStart s
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = SUBSTRING(@pString , l.N1, l.L1)
   FROM cteLen l
;
GO


/*====SO FAR SEEMS TO BE THE BEST PERFORMER===========================*/
/*===== USES BINARY COLLATION TO SPEED UP THE STRING COMPARISON=======*/
/*===== FIRST DELIMITED ITEM HANDLED SEPERATELY=======================*/
CREATE FUNCTION [dbo].[DelimitedSplit8K_T1_First_Item_Handling_Seperately]
--===== Define I/O parameters
(@pString VARCHAR(8000), @pDelimiter VARCHAR(1)/*===MADE IT VARCHAR TO AVOID IMPLICIT CONVERSION WHILE COMPARING====*/
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover VARCHAR(8000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 /*====FIRST ITEM HANDLED AT THE END======*/
                 --SELECT 1 UNION ALL -- does away with 0 base CTE, and the OR condition in one go!
                 /*=======================================*/
                 SELECT N+1
           FROM cteTally  
           WHERE SUBSTRING(@pString ,N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,N1),0) - N1,8000)
                   FROM cteStart 
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
/*====FIRST ITEM HANDLING===============*/
SELECT ItemNumber = 0, Item = SUBSTRING(@pString , 1, ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1, 8000)) 
UNION ALL  
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY N1),
        Item      = SUBSTRING(@pString , N1, L1)
   FROM cteLen
;
GO

Any suggestions,corrections,optimization would be highly appreciated. I am using the same performance tests available in the above said article. Thanks in advance.

[1]: http://www.sqlservercentral.com/articles/Tally+Table/72993/
more ▼

asked Apr 26, 2012 at 07:25 AM in Default

Usman Butt gravatar image

Usman Butt
13.9k 6 8 14

@Usman There are 5 pages worth of pretty intersting discussion on the main site [starting from here][1]. Jeff Moden's 8K splitter, my attempt on it with XML and Pavel Pawlowski's CLR solution are pretty thoroughly examined there and Pavel's CLR function is a clear winner.

[1]: http://www.sqlservercentral.com/Forums/Topic943562-338-2.aspx
Apr 26, 2012 at 02:50 PM Oleg
Many thanks Oleg for pointing to a very informative discussion. And sorry for the delay.
May 23, 2012 at 10:02 AM Usman Butt
@Oleg I forgot to let know that CLR is not permissible, so had to stick with TSQL solution.
May 24, 2012 at 04:56 AM Usman Butt
Since my version is faster and I have not find any bugs yet. Marking my answer as accepted. Last day for the bounty as well ;)
May 31, 2012 at 02:24 PM Usman Butt
(comments are locked)
10|1200 characters needed characters left

2 answers: sort voted first

We have managed to complete the 10000 rows, 1133 elements i.e. near to 8k characters performance test (described in the above-mentioned pointer) in 19 secs approximately. Which is quite a drastic improvement for us.

The major change was the conversion of tally table column's data type from BIGINT to INT. Another change (which shows marginally increased performance) was to add pre-calculated column which have the value for N + 1.

Following performance test includes the two versions we have come up with

USE [tempdb]
GO
SET NOCOUNT ON;
GO
IF OBJECT_ID('dbo.CsvTest') IS NOT NULL
DROP TABLE CsvTest
GO

DECLARE @NumberOfElements INT,
       @NumberOfRows INT


/*======== PARAMETER VALUES ==============================*/
SELECT @NumberOfElements    = 1133
,      @NumberOfRows   = 10000
/*========================================================*/


SELECT TOP (@NumberOfRows) --Controls the number of rows in the test table        
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma                
                ( --=== This builds CSV row with a leading comma                 
                SELECT TOP (@NumberOfElements) --Controls the number of CSV elements in each row                        
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM sys.All_Columns t3      --Classic cross join pseudo-cursor                  
                CROSS JOIN sys.All_Columns t4 --can produce row sets up 121 million.                 
                WHERE t1.Object_ID <> t3.Object_ID --Without this line, all rows would be the same                    
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS Csv
INTO CsvTest
FROM sys.All_Columns t1      --Classic cross join pseudo-cursor
   CROSS JOIN sys.All_Columns t2 --can produce row sets up 16 million rows

GO
ALTER TABLE CsvTest
    ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;
    GO
     IF OBJECT_ID('dbo.Tally','U') IS NOT NULL
        DROP TABLE dbo.Tally;

--===== Create and populate the Tally table on the fly.
     -- This ISNULL function makes the column NOT NULL
 SELECT TOP 11000
        IDENTITY(INT,1,1) AS N, 
        ISNULL(ISNULL(CAST(ROW_NUMBER() OVER ( ORDER BY (SELECT NULL)) AS INT),0)+ 1, 0) N1
   INTO dbo.Tally
   FROM sys.all_columns ac1
  CROSS JOIN sys.all_columns ac2
;
GO

--===== Add a CLUSTERED Primary Key to maximize performance
  ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally_N 
        PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
;
--===== Update Stats
UPDATE STATISTICS dbo.Tally WITH FULLSCAN
;
GO

--=====================================================================================================================
--      Conditionally drop and recreate the Split8KTally_With_Column_N1 splitter function.
--=====================================================================================================================
IF OBJECT_ID(N'dbo.Split8KTally_With_Column_N1', N'IF') IS NOT NULL
DROP FUNCTION dbo.Split8KTally_With_Column_N1;
GO
--===== Tally Table (Split8KTally_With_Column_N1 ITVF) ===============================================================================
/*===== THIS FUNCTION USES COLUMN "N1" OF TALLY TABLE HAVING PRE-CALCULATED VALUES FOR N+1 ===========*/
CREATE FUNCTION dbo.Split8KTally_With_Column_N1 (
    @pString VARCHAR(8000), @pDelimiter VARCHAR(1)
)
RETURNS TABLE 
AS
RETURN
    /*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/
    SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER
    ,       ItemValue = SUBSTRING(@pString, 1,
                             ISNULL(NULLIF(CHARINDEX(@pDelimiter,
                                          @pString COLLATE Latin1_General_BIN), 0)
                                    - 1, 8000 ))

            UNION ALL   -- CONCATENATE THE OUTPUTS
    /*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE  ======================================*/
            SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )
    ,       ItemValue = SUBSTRING(@pString, [N1],
                             ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
                                          @pString COLLATE Latin1_General_BIN, [N1]), 0)
                                    - [N1], 8000) )
            FROM     [dbo].[Tally] WITH ( NOLOCK ) --NOLOCK HINT IS NOT NECESSARY
            WHERE    [N] BETWEEN 1 AND LEN(@pString) 
            AND      SUBSTRING(@pString, [N], 1)  = @pDelimiter  

GO
--=====================================================================================================================
--      Conditionally drop and recreate the Split8KTally_With_N_PLUS_1 splitter function.
--=====================================================================================================================
IF OBJECT_ID('dbo.Split8KTally_With_N_PLUS_1','IF') IS NOT NULL
DROP FUNCTION [dbo].Split8KTally_With_N_PLUS_1
GO
--===== Tally Table (Split8KTally_With_N_PLUS_1 ITVF) ===============================================================================
/*===== THIS FUNCTION USES RUNTIME CALCULATION "N + 1" INSTEAD OF USING THE PRE-CALCULATED COLUMN "N1" OF THE TALLY TABLE===========*/
CREATE FUNCTION dbo.Split8KTally_With_N_PLUS_1 (
    @pString VARCHAR(8000), @pDelimiter VARCHAR(1)
)
RETURNS TABLE 
AS
RETURN

    /*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/
    SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER
    ,       ItemValue = SUBSTRING(@pString, 1,
                             ISNULL(NULLIF(CHARINDEX(@pDelimiter,
                                          @pString COLLATE Latin1_General_BIN), 0)
                                    - 1, 8000 ))

            UNION ALL -- CONCATENATE THE OUTPUTS
    /*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE  ======================================*/
            SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )
    ,       ItemValue = SUBSTRING(@pString, (N+1),
                             ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
                                          @pString COLLATE Latin1_General_BIN, (N+1)), 0)
                                    - (N+1), 8000) )
            FROM     [dbo].[Tally] WITH ( NOLOCK )--NOLOCK HINT IS NOT NECESSARY
            WHERE    [N] BETWEEN 1 AND LEN(@pString) 
            AND      SUBSTRING(@pString, [N], 1)  = @pDelimiter  

GO

--Tally Test
PRINT '/*====== dbo.Split8KTally_With_N_PLUS_1 =================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
,   @Item VARCHAR(8000) ;
SELECT  @ItemNumber = V.ItemNumber
,       @Item = V.ItemValue
FROM    dbo.CsvTest D
        CROSS APPLY dbo.Split8KTally_With_N_PLUS_1(D.Csv, ',') V
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
PRINT '/*====== dbo.Split8KTally_With_Column_N1 =================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
,   @Item VARCHAR(8000) ;
SELECT  @ItemNumber = V.ItemNumber
,       @Item = V.ItemValue
FROM    dbo.CsvTest D
        CROSS APPLY dbo.Split8KTally_With_Column_N1(D.Csv, ',') V
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
more ▼

answered May 23, 2012 at 10:18 AM

Usman Butt gravatar image

Usman Butt
13.9k 6 8 14

Since my version is faster and I have not find any bugs yet. Marking my answer as accepted ;)
May 31, 2012 at 02:25 PM Usman Butt
Let me be the first one to upvote it :)
May 31, 2012 at 02:29 PM Oleg
@Oleg Thanks. I know you are quite busy these days (read it on another thread), but I will post the VARCHAR(MAX) testing results as well. Your xml solution and RCTE solution (surprisingly) worked really well. ;)
Jun 01, 2012 at 05:58 AM Usman Butt
(comments are locked)
10|1200 characters needed characters left

As I said, we did some testing on VARCHAR(MAX) versions as well. As tally table solution does not behave well with wider strings. Two of them performed really well

  1. Inline Recursive CTE solution (Split_RCTE, Tweaked a little bit for better performance)
  2. TVF XML solution by Oleg (Split_XML_Solution_By_Oleg)

But surprisingly, RCTE was winner on some occasions with good margin as compared to losing. If I do the same 10000 rows, 1133 elements with 125-150 characters width, RCTE beat the XML solution marginally. But if the numbers of elements are decreased to 500, the RCTE was almost twice as fast as XML. But I would not count out any of the solutions, as they could fit according the requirement and environment.

For RCTE solution, I must say it is more resource intensive. The index spool and SORT operators in the execution plan clearly indicates that it would hit the memory, tempdb and processors hard. It also reflects that with more CPUs, memory and capacity planned tempdb, the solution is viable to a better performance as compared to other solutions. Having said that, I still may tilt towards the XML solution.

To me, why the RCTE performs much better than the tally table solution (both used the Charindex and Substring), is because the tally table solution compare each character with the delimiter, and due to the Out of Row phenomenon, this does not scale well. Whereas, RCTE solution does it for only the required number of times. Moreover, till VARCHAR(8000), the

SUBSTRING(@pString, N,1) = @pDelimiter

is dealt as a predicate in addition to the Seek predicate. But for VARCHAR(MAX), this is divided into two steps i.e. A Seek predicate followed by a Filter predicate which decreases the performance quite a bit.

Now the performance test for RCTE and XML solutions

USE [tempdb]
GO
SET NOCOUNT ON;
GO
IF OBJECT_ID(N'dbo.iFunction', N'V') IS NOT NULL
DROP VIEW iFunction
GO

SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
CREATE VIEW dbo.iFunction AS 
/**********************************************************************************************************************
 Purpose:
 This view is callable from UDF's which allows us to indirectly get a NEWID() within a function where we can't do such
 a thing directly in the function.  This view also solves the same problem for GETDATE().

 Usage:
 SELECT MyNewID FROM dbo.iFunction; --Returns a GUID
 SELECT MyDate  FROM dbo.iFunction; --Returns a Date


 Revision History:
 Rev 00 - 06 Jun 2004 - Jeff Moden - Initial creation
 Rev 01 - 06 Mar 2011 - Jeff Moden - Formalize code.  No logic changes. 
**********************************************************************************************************************/
SELECT MyNewID = NEWID(),
       MyDate  = GETDATE();
GO

IF OBJECT_ID(N'dbo.CsvTest', N'U') IS NOT NULL
DROP TABLE CsvTest
GO

DECLARE @MaxElementWidth INT, 
 @MinElementWidth INT, 
 @NumberOfElements INT,
 @NumberOfRows INT


/*======== PARAMETER VALUES ==============================*/
SELECT @MaxElementWidth = 150 
,    @MinElementWidth = 125 
,    @NumberOfElements = 500
,    @NumberOfRows = 10000
/*========================================================*/


SELECT TOP (@NumberOfRows) --Controls the number of rows in the test table
        ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
        CSV =
        (--==== This creates each CSV
         SELECT CAST(
                    STUFF( --=== STUFF get's rid of the leading comma
                         ( --=== This builds CSV row with a leading comma
                          SELECT TOP (@NumberOfElements) --Controls the number of CSV elements in each row
                                 ','
                               + LEFT(--==== Builds random length variable within element width constraints
                                      LEFT(REPLICATE('1234567890',CEILING(@MaxElementWidth/10.0)), @MaxElementWidth),
                                      ABS(CHECKSUM((SELECT MyNewID FROM dbo.iFunction)))
                                            % (@MaxElementWidth - @MinElementWidth + 1) + @MinElementWidth
                                     )
                            FROM sys.All_Columns ac3            --Classic cross join pseudo-cursor
                           CROSS JOIN sys.All_Columns ac4       --can produce row sets up 16 million.
                           WHERE ac3.Object_ID <> ac1.Object_ID --Without this line, all rows would be the same.
                             FOR XML PATH('')
                         )
                    ,1,1,'')
                AS VARCHAR(MAX))
        )
   INTO CsvTest
   FROM sys.All_Columns ac1      --Classic cross join pseudo-cursor
   CROSS JOIN sys.All_Columns ac2 --can produce row sets up 16 million rows

GO
PRINT '/*====== dbo.CSVTest Population completed ================*/'
PRINT CHAR(10) + CHAR(13)
GO
ALTER TABLE CsvTest
    ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;
    GO

IF OBJECT_ID(N'dbo.Split_RCTE', N'IF') IS NOT NULL
DROP FUNCTION dbo.Split_RCTE
GO
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
--Create Split_RCTE function VARCHAR(MAX) version
CREATE FUNCTION dbo.Split_RCTE
    (
     @pString VARCHAR(MAX)
    ,@pDelimiter VARCHAR(1)
    )
RETURNS TABLE
    WITH SCHEMABINDING
AS
RETURN
    WITH    cteSplit
              AS ( SELECT   StartPosition = 0
                   ,        EndPosition = CONVERT(INT, CHARINDEX(@pDelimiter,
                                                              @pString COLLATE Latin1_General_BIN))
                   UNION ALL
                   SELECT   StartPosition = EndPosition + 1
                   ,        EndPosition = CONVERT(INT, CHARINDEX(@pDelimiter,
                                                   @pString COLLATE Latin1_General_BIN,
                                                   EndPosition + 1))
                   FROM     cteSplit
                   WHERE    EndPosition > 0
                 )
    SELECT  [ItemNumber] = ROW_NUMBER() OVER ( ORDER BY StartPosition )
    ,       SUBSTRING(@pString, StartPosition,
   CASE EndPosition
 WHEN 0 THEN CONVERT(INT, LEN(@pString)) + 1
 ELSE EndPosition - StartPosition
   END) ItemValue
    FROM    cteSplit
GO

IF OBJECT_ID(N'dbo.Split_XML_Solution_By_Oleg', N'TF') IS NOT NULL
DROP FUNCTION dbo.Split_XML_Solution_By_Oleg
GO

--Create Split_XML function VARCHAR(MAX) version

SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
CREATE FUNCTION dbo.Split_XML_Solution_By_Oleg
    (
     @Parameter VARCHAR(MAX)
    ,@Delimiter VARCHAR(1)
    )
RETURNS @Result TABLE
    (
     ItemNumber INT
    ,ItemValue VARCHAR(MAX)
    )
AS 
    BEGIN
        DECLARE @XML XML ;
        SET @Parameter = ( SELECT   @Parameter
                         FOR XML PATH('')
                         ) ;
        SELECT  @XML = '<r>' + REPLACE(@Parameter, @Delimiter, '</r><r>') + '</r>' ;

        INSERT  INTO @Result
                (
                 ItemNumber
                ,ItemValue
                )
                SELECT  ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL) ) AS ItemNumber
                ,       Item.value('text()[1]', 'VARCHAR(MAX)') AS ItemValue
                FROM    @XML.nodes('//r') R ( Item ) ;
        RETURN ;
    END ;
GO
PRINT '/*====== dbo.Split_RCTE ==================================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
,   @Item VARCHAR(MAX) ;
SELECT  @ItemNumber = V.ItemNumber
,       @Item = V.ItemValue
FROM    dbo.CsvTest D
        CROSS APPLY dbo.Split_RCTE(D.Csv, ',') V
OPTION  ( MAXRECURSION 0 )
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
PRINT '/*====== dbo.Split_XML ===================================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
,   @Item VARCHAR(MAX) ;
SELECT  @ItemNumber = V.ItemNumber
,       @Item = V.ItemValue
FROM    dbo.CsvTest D
        CROSS APPLY dbo.Split_XML_Solution_By_Oleg(D.Csv, ',') V
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
As always, everyone's mileage may differ. Please note that testing was done on SQL Server 2005. For SQL 2008, the conversion to INT may not be needed in RCTE solution.
more ▼

answered Jun 08, 2012 at 11:15 AM

Usman Butt gravatar image

Usman Butt
13.9k 6 8 14

@Usman Thank you for bringing my function back to life. @Pavel Pawlowski's CLR solution is probably the best for this specific problem but if wheeling out a CLR is not an option, it is nice to explore other alternatives. My mileage did vary on the box I use here at work (4 cores, 8 GB of RAM, Windows 7 64 bit). For some reason, the xml solution beat the RCTE, maybe my box is tired :) :

/*== dbo.Split_XML ==*/

 SQL Server Execution Times:
   CPU time = 118857 ms,  elapsed time = 120583 ms.
  
/*== dbo.Split_RCTE ==*/
  
 SQL Server Execution Times:
   CPU time = 134660 ms,  elapsed time = 138365 ms.
  

I also checked my inline TVF xml function, but really sucks, despite the fact that it is inline:

create function dbo.udf_SplitXmlInlineTvf
(
    @Parameter varchar(max), @Delimiter char(1)
)
returns table with schemabinding as

return
    select cast(row_number() over (order by 
        (select null)) as int) ItemNumber,
        R.Item.value('text()[1]', 'varchar(max)') ItemValue
    from (select cast(''+replace(@Parameter, 
        @Delimiter, '')+'' as xml)) X(N)
    cross apply N.nodes('//r') R(Item);
go
I guess the cross apply of the cross apply is not a good way to go.
Jun 08, 2012 at 03:22 PM Oleg
@Oleg I totally agree that CLR is the best option. Moreover, thanks for sharing the test results. As I said earlier, everyone's mileage may differ ;) But what version of SQL Server you tested upon? What MAXDOP setting you had?
Jun 11, 2012 at 07:33 AM Usman Butt

Nicely done. It would be nice to see what the results from your runs actually were because I had to stop the rCTE solution after more than 3 minutes.

Also, the collation trick didn't help me at all because I use the default collation. It’s still a good idea, though.

Last but not least, the real key to the performance improvement was switching from the Itzik-style cteTally to a real Tally Table. The DelimitedSplit8K-T1 function runs in the same amount of time as your variations when it's changed to a Tally Table. Don’t forget to add the length check in when you do.

The CLR is still going to beat it but we're not doing bad for a T-SQL only solution.,Nicely done. It would be nice to see what the results from your runs actually were because I had to stop the rCTE solution after more than 3 minutes. Also, the collation trick didn't help me at all because I use the default collation. Last but not least, the real key to the spead improvement was switiching from the Itzik-style cteTally to a real Tally Table. The DelimitedSplit8K function runs in the same amout of time as your variations when it's changed to a Tally Table.
Jun 16, 2012 at 04:31 AM Jeff Moden

@Usman Here is what I used:

OS: Windows 7 SP1
**SQL Server**: 2008 R2 SP1, 10.50.2500.0
**Processor**: Intel Q9650 3 GHz, 4 cores, 4 logical processors
**RAM**: 8 GB, both memory and maxdop are set to "all it can eat" for SQL Server
Jun 19, 2012 at 04:21 PM Oleg

Just noticed a bug in dbo.Split_XML_Solution_By_Oleg : the parameter @Delimiter is declared but never used

The line

SELECT @XML = '' + REPLACE(@Parameter, ',', '') + '' ;

should read

SELECT @XML = '' + REPLACE(@Parameter, @Delimiter, '') + '' ;
Jul 04, 2012 at 03:19 PM Kev Riley ♦♦
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x343
x9
x1

asked: Apr 26, 2012 at 07:25 AM

Seen: 1769 times

Last Updated: Jul 05, 2012 at 11:35 AM