Saturday, September 26, 2009

High Performance String Split Functions

To split a string into pieces is a quiet common task in T-SQL and there are many different solutions to handle this. So, why yet another one? This article should not only show yet another way to split strings, but it tries to compare all the different approaches. Before the competition starts, let me present the participants.
  • WHILE Loop
  • Numbers Table
  • XML
  • SQL CLR

First I want to show you the different approaches. At the end of the article I'll show you some performance comparisons.

WHILE Loop
If you search the internet for functions to split strings you will primary find solutions based on WHILE loops. Since this is the most common approach we should start with it.

How does the cursor work?

The common topology is to start at the first position of the string and search the first occurrence of the specified separator by using CHARINDEX. Now use SUBSTRING to get the first item from position 1 to the first found separator position; now find the next item starting at the previously found position and so on.
Other versions work with PATINDEX instead of CHARINDEX but this is much slower and not needed for the requirement.

This should be a quiet fast WHILE loop based solution.
IF (OBJECT_ID('tvf_SplitString_Cursor') IS NULL)
   EXECUTE('CREATE FUNCTION tvf_SplitString_Cursor() 
            RETURNS @t TABLE (Id INT) AS 
            BEGIN INSERT INTO @t VALUES(1) 
            RETURN END');
GO
ALTER FUNCTION tvf_SplitString_Cursor
(
   @text NVARCHAR(MAX), 
   @separator NVARCHAR(255)
)
   RETURNS @ReturnData TABLE (Item NVARCHAR(4000))
AS
BEGIN
   DECLARE 
      @pos           INT
      ,@next         INT
      ,@separatorLen INT = LEN(@separator);
   

   -- Get the first occurence to get the item before the first delimiter
   SELECT @pos = CHARINDEX(@separator, @text, 1);
   
   -- Insert the first item
   INSERT INTO @ReturnData
      SELECT SUBSTRING(@text, 1, @pos - 1);
   
   -- Step over the first delimiter
   SELECT @pos = @pos + @separatorLen;

   WHILE (1 = 1)
   BEGIN
      -- Get the next delimiter position from our previous position
      SELECT @next = CHARINDEX(@separator, @text, @pos);
      
      IF (@next = 0) BREAK -- nothing more to do

      -- Insert the next found item
      INSERT INTO @ReturnData
         SELECT SUBSTRING(@text, @pos, @next - @pos);

      -- Step of the delimiter
      SELECT @pos = @next + @separatorLen;
   END

   -- Due to the fact that our cursor only jumps from delimter 
   -- to delimiter the last item would be lost
   INSERT INTO @ReturnData
      SELECT SUBSTRING(@text, @pos, LEN(@text) - 1);

   RETURN;
END

Numbers Table
A Numbers table is a table which contains incremental numbers from 1 to N, where N depends on your requirements to the table. This tables can be used for different solutions, as I showed here.

The following script is based on Jeff Moden's great article The "Numbers" or "Tally" Table: What it is and how it replaces a loop at SQLServerCentral.com. I don't want to say "It's my idea".

How can I use a Numbers table to split a string?

The general idea is to transform the horizontal string into a vertical table where each row contains one character of the source string. The WHERE clause is used to return only the specified separators. Whenever we reach a separator we use SUBSTRING to extract the item from the current position to the next occurrence of the separator. To determine the next position of the separator we use CHARINDEX, just like the WHILE loop based approach.

Here is a sample, how to use a Numbers table to split a string into pieces.
DECLARE 
   @text VARCHAR(100)      = ',a,b,c,'
   ,@separator CHAR(1)     = ',';

SELECT
   -- Extract the current part from specified string
   -- Start position is the current position within the numbers table 
   --    plus the length of the separator
   -- The length is the next position of the separator minus the
   --    current position and the len of the separator
   SUBSTRING(
      @text
      ,t.Num + LEN(@separator)
      ,CHARINDEX(@separator, @text, t.Num + 1) - t.Num - LEN(@separator)
      ) AS Item
FROM dbo.Numbers t
WHERE 
   -- Scan only until the length of the text - 1 
   -- to avoid wrong parameter for SUBSTRING
   t.Num < LEN(@text)
   -- The current item has to match to the specified delimiter
   AND SUBSTRING(@text, t.Num, LEN(@separator)) = @separator
If you want to use a Numbers table to split strings you have to keep two important things in mind. The first thing is, the source string always needs to be enclosed with a leading and trailing separator. This depends on the topology of the approach to get the items between the separators. As second you have to ensure that your Numbers table contains enough rows. The count of rows has to be at least the maximal length of the source strings plus two separators. A way to ensure a Numbers table with enough rows is to use a common table expression (CTE) in this case.
WITH 
n1 (Num) AS (SELECT 1 UNION ALL SELECT 1),   -- 2
n2 (Num) AS (SELECT a.Num FROM n1 a, n1 b),  -- 4
n3 (Num) AS (SELECT a.Num FROM n2 a, n2 b),  -- 16
n4 (Num) AS (SELECT a.Num FROM n3 a, n3 b),  -- 256
n5 (Num) AS (SELECT a.Num FROM n4 a, n4 b),  -- 65536
-- 4.294.967.296
Numbers (Num) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) 
                  FROM n5 a, n5 b)
SELECT
      *
   FROM Numbers
   WHERE Num < 100000;
This provides a Numbers table with up to 4.294.967.296 which fits for string with up to 4GB. XML
Some solutions to split a string are based on a conversion to XML to use XQuery to extract the items. Sounds strange? It is ;-). How can I do that? To split a string using XML first you have to convert the source string into a XML data type and replace the specified separator with XML tags. After that you can use XQuery to get the items within the string. The following script shows how to work with XML to split strings.
DECLARE 
   @text VARCHAR(100)      = 'a,b,c'
   ,@separator CHAR(1)     = ',';

WITH cte (data) AS
(
   SELECT CONVERT(XML, '<y>' 
                       + REPLACE(@text, @separator, '</y><y>') 
                       + '</y>')
)
SELECT
      T.C.value('.', 'nvarchar(2000)')
   FROM cte
      CROSS APPLY cte.data.nodes('y') T(C)
This method should work with most kind of data, but it runs into problems if the source string contains XML language elements like "<", ">" or "&".
SQL CLR
Now the (hopefully) last possibility to split strings into pieces is to use SQL CLR. How to do? Well, there are many different ways to split a string using .NET. The simplest way would be the build in function String.Split() which is available on every instance of a System.String. The most flexible way to split a string with .NET would be a regular expression. Nevertheless we use a non build-in raw function to handle that. Why?? String.Split() is definitely a cool function but we are speaking about a function for a Server and String.Split() causes the complete string twice in memory which can affect the performance when your server in on load. Same depends on regular expressions, which is way to hungry for resources and a functional overkill to just split a string. We use a function which iterates through the complete string and searches for the specified separator. Whenever a separator is reached we use yield return to return the item between the last position of a separator and and the current position. The functionality is quiet alike the WHILE loop approach but enclosed within a compiled .NET assembly. Here is the body of your function.
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;


public partial class StoredProcedures
{
   [Microsoft.SqlServer.Server.SqlFunction(
      FillRowMethodName = "FillRowSplitString",
      TableDefinition = "Item nvarchar(4000)",
      IsDeterministic = true,
      SystemDataAccess = SystemDataAccessKind.None
      )
   ]
   public static IEnumerable SplitString(
      [SqlFacet(MaxSize = -1, IsFixedLength=false, IsNullable=false)]
      SqlString text,
      [SqlFacet(MaxSize = 1, IsFixedLength = true, IsNullable = false)]
      SqlString separator
      )
   {
      // init variables
      char s = separator.Value[0];
      string buffer = text.Value;
      int lastIndex = 0;
      int len = 0;
      int i;

      // loop through the string
      for (i = 0; i < buffer.Length; i++)
      {
         // check if the current character matches to the separator
         if (buffer[i] == s)
         {
            // return the item from last and current positions
            yield return buffer.Substring(lastIndex, len);
            lastIndex = i + 1;
            len = 0;
         }
         else
         {
            len++;
         }
      }

      // return the last item
      yield return buffer.Substring(lastIndex, buffer.Length - lastIndex);
   }

   // fill output row
   public static void FillRowSplitString(
      object obj,
      out SqlString item
      )
   {
      item = (string)obj;
   }
};
Processing Performance Now, that we know all possible ways to split a string within T-SQL, let me compare the performance of the different approaches. Since the performance tests depend on the hardware, a short description of my test environment.
CPUIntel Quad 2.4GHz Q6600
RAM8 GB
Operating SystemWindows Server 2008 x64 SP2
SQL ServerSQL Server 2008 x64 SP2
Networklocalhost
It migth be quiet boring to split "a,b,c" by "," and show you zero durations for all tests. So I worked with whole table strings. The columsn of the following table mean:
  • Item Length: Is the length of the resulting items within the source text.
  • Item count is the count of resulting items for each row.
  • Row count is the count of source rows which have to be handled.
  • Data type is the column type of the source table.
Here you can find a table with some of my test results.
# Item Length Item Count Row Count Data Type Method Duration (ms)
1 10 50 1000 NVARCHAR(4000)
WHILE2850
Numbers763
CLR1246
XML27273
2 50 50 100 NVARCHAR(4000)
WHILE307
Numbers213
CLR163
XML7190
3 50 50 1000 NVARCHAR(4000)
WHILE3853
Numbers2133
CLR1593
XML72116
4 200 19 100 NVARCHAR(4000)
WHILE203
Numbers293
CLR80
XML3850
5 200 19 100 NVARCHAR(MAX)
WHILE206
Numbers773
CLR76
XML4120
6 10 50 1000 VARCHAR(8000)
WHILE2750
Numbers643
CLR1263
XML16673
7 50 50 100 VARCHAR(8000)
WHILE340
Numbers180
CLR163
XML2723
8 50 50 1000 VARCHAR(8000)
WHILE3086
Numbers1726
CLR1586
XML71693
9 200 19 100 VARCHAR(8000)
WHILE153
Numbers230
CLR80
XML3850
10 200 19 100 NVARCHAR(MAX)
WHILE156
Numbers706
CLR80
XML4196
11 10 50 10000 VARCHAR(8000)
WHILE27610
Numbers6396
CLR12726
12 50 50 5000 VARCHAR(8000)
WHILE15466
Numbers8596
CLR8043
13 50 50 5000 VARCHAR(MAX)
WHILE15363
Numbers24456
CLR8060
14 1000 20 200 VARCHAR(MAX)
WHILE516
Numbers18933
CLR310
15 1000 50 200 VARCHAR(MAX)
WHILE1176
Numbers42420
CLR800
Note: To ensure fair tests, I changed the WHILE loop function from NVARCHAR to VARCHAR for test with non-unicode data types. Note: I kept the XML approach away from the last five test since these test reached a higher level of data. Concurrent Work Last but not least we have to look at the concurrency behavior. I did these tests with the following string configuration.
Item Length50
Item Count50
Row Count500
Data TypeNVARCHAR(4000)
And here are the test results.
Threads Cursor Numbers CLR
115401068779
3330522591646
5438423652246
101120847455249
1001238393706053512
Conclusion
We've seen four completely different ways to split strings within T-SQL. XML appears to be the wrong technology to do the job. XML is a validated data type which has to created on the fly by using REPLACE which causes a complete copy of all data. XQuery is really fast if source data are already available in XML, but the conversion from text to XML seems to be way to expensive. WHILE loops are the most common approach to handle this kind of job. They are okay for smaller strings or in cases where they are not needed so often. They are the wrong solution for high performance split routines. The way to use a Numbers table for this kind of work sounds curious if you never did before. As you see, the Numbers table is a great tool once again. For text with small items this is the best performing solution. Especially the concurrency performance is really great. There are some issues though. You have to ensure the leading and trailing separator. If you have to split data from client side, which can add the separators before the data reach the server, everything is fine. If you have to work with data which are already stored in database - without being enclosed - you have to add the separators on the fly; in this case you may end up in same problem as the XML since all data have to be copied in memory. My tests showed me that the performance of this approach goes down if you encapsulate it into a table-valued function (even a single-statement table-valued function). The Numbers table split becomes slower with the length of result items within the source text. There are quiet less tasks which should be done with SQL CLR since now. Anyway, splitting strings is one of the tasks .NET is optimized for. It performs great for any kind and count of data. Even the concurrency performance is not as perfect as the Numbers table split but even better than linear.

3 comments:

  1. Hi Flo,

    With a little help from some friends, there's a new "DelimitedSplit8K" function out that smokes the old one. The article for it is at the following url:

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Good "seeing" you again. It's been a while.

    --Jeff Moden.

    ReplyDelete
  2. Hey Jeff

    Sorry for the late response. Just noticed that email notification for comments was turned of..

    Just read your article and it's another great resource man! Thanks for the URL, that's definitely a new entry in my favorites.
    (I have to ask this, feelin' dirty after using CLR? :-D)

    Flo

    ReplyDelete
  3. Now it's my turn to apologize for the late response. Couldn't type for that whole time... I was too busy washing my mouth out from talking about CLR. ;-D

    Actually, I thought SQLCLR was the best thing since sliced bread and round wheels. Still do, actually. It's what some people have done with them that's absolutely frightning. I had one fellow submit a CLR for me to promote to prod that did (drum roll please) a Modulus! Isn't that a brilliant use? ;-)

    Once again, good "seeing" you, ol' friend. I'll try not to wait 2 years (almost to the day) before I stop by again.

    ReplyDelete