To split a string into pieces is a quiet common task in T-SQL and there are many different solutions to handle this. So, why yet another one? This article should not only show yet another way to split strings, but it tries to compare all the different approaches. Before the competition starts, let me present the participants.
- WHILE Loop
- Numbers Table
- XML
- SQL CLR
First I want to show you the different approaches. At the end of the article I'll show you some performance comparisons.
WHILE Loop
If you search the internet for functions to split strings you will primary find solutions based on WHILE loops. Since this is the most common approach we should start with it.
How does the cursor work?
The common topology is to start at the first position of the string and search the first occurrence of the specified separator by using CHARINDEX. Now use SUBSTRING to get the first item from position 1 to the first found separator position; now find the next item starting at the previously found position and so on.
Other versions work with PATINDEX instead of CHARINDEX but this is much slower and not needed for the requirement.
This should be a quiet fast WHILE loop based solution.
WHILE Loop
If you search the internet for functions to split strings you will primary find solutions based on WHILE loops. Since this is the most common approach we should start with it.
How does the cursor work?
The common topology is to start at the first position of the string and search the first occurrence of the specified separator by using CHARINDEX. Now use SUBSTRING to get the first item from position 1 to the first found separator position; now find the next item starting at the previously found position and so on.
Other versions work with PATINDEX instead of CHARINDEX but this is much slower and not needed for the requirement.
This should be a quiet fast WHILE loop based solution.
IF (OBJECT_ID('tvf_SplitString_Cursor') IS NULL) EXECUTE('CREATE FUNCTION tvf_SplitString_Cursor() RETURNS @t TABLE (Id INT) AS BEGIN INSERT INTO @t VALUES(1) RETURN END'); GO ALTER FUNCTION tvf_SplitString_Cursor ( @text NVARCHAR(MAX), @separator NVARCHAR(255) ) RETURNS @ReturnData TABLE (Item NVARCHAR(4000)) AS BEGIN DECLARE @pos INT ,@next INT ,@separatorLen INT = LEN(@separator); -- Get the first occurence to get the item before the first delimiter SELECT @pos = CHARINDEX(@separator, @text, 1); -- Insert the first item INSERT INTO @ReturnData SELECT SUBSTRING(@text, 1, @pos - 1); -- Step over the first delimiter SELECT @pos = @pos + @separatorLen; WHILE (1 = 1) BEGIN -- Get the next delimiter position from our previous position SELECT @next = CHARINDEX(@separator, @text, @pos); IF (@next = 0) BREAK -- nothing more to do -- Insert the next found item INSERT INTO @ReturnData SELECT SUBSTRING(@text, @pos, @next - @pos); -- Step of the delimiter SELECT @pos = @next + @separatorLen; END -- Due to the fact that our cursor only jumps from delimter -- to delimiter the last item would be lost INSERT INTO @ReturnData SELECT SUBSTRING(@text, @pos, LEN(@text) - 1); RETURN; END
Numbers Table
A Numbers table is a table which contains incremental numbers from 1 to N, where N depends on your requirements to the table. This tables can be used for different solutions, as I showed here.
The following script is based on Jeff Moden's great article The "Numbers" or "Tally" Table: What it is and how it replaces a loop at SQLServerCentral.com. I don't want to say "It's my idea".
How can I use a Numbers table to split a string?
The general idea is to transform the horizontal string into a vertical table where each row contains one character of the source string. The WHERE clause is used to return only the specified separators. Whenever we reach a separator we use SUBSTRING to extract the item from the current position to the next occurrence of the separator. To determine the next position of the separator we use CHARINDEX, just like the WHILE loop based approach.
Here is a sample, how to use a Numbers table to split a string into pieces.
The following script is based on Jeff Moden's great article The "Numbers" or "Tally" Table: What it is and how it replaces a loop at SQLServerCentral.com. I don't want to say "It's my idea".
How can I use a Numbers table to split a string?
The general idea is to transform the horizontal string into a vertical table where each row contains one character of the source string. The WHERE clause is used to return only the specified separators. Whenever we reach a separator we use SUBSTRING to extract the item from the current position to the next occurrence of the separator. To determine the next position of the separator we use CHARINDEX, just like the WHILE loop based approach.
Here is a sample, how to use a Numbers table to split a string into pieces.
DECLARE @text VARCHAR(100) = ',a,b,c,' ,@separator CHAR(1) = ','; SELECT -- Extract the current part from specified string -- Start position is the current position within the numbers table -- plus the length of the separator -- The length is the next position of the separator minus the -- current position and the len of the separator SUBSTRING( @text ,t.Num + LEN(@separator) ,CHARINDEX(@separator, @text, t.Num + 1) - t.Num - LEN(@separator) ) AS Item FROM dbo.Numbers t WHERE -- Scan only until the length of the text - 1 -- to avoid wrong parameter for SUBSTRING t.Num < LEN(@text) -- The current item has to match to the specified delimiter AND SUBSTRING(@text, t.Num, LEN(@separator)) = @separator
If you want to use a Numbers table to split strings you have to keep two important things in mind. The first thing is, the source string always needs to be enclosed with a leading and trailing separator. This depends on the topology of the approach to get the items between the separators. As second you have to ensure that your Numbers table contains enough rows. The count of rows has to be at least the maximal length of the source strings plus two separators. A way to ensure a Numbers table with enough rows is to use a common table expression (CTE) in this case.
WITH n1 (Num) AS (SELECT 1 UNION ALL SELECT 1), -- 2 n2 (Num) AS (SELECT a.Num FROM n1 a, n1 b), -- 4 n3 (Num) AS (SELECT a.Num FROM n2 a, n2 b), -- 16 n4 (Num) AS (SELECT a.Num FROM n3 a, n3 b), -- 256 n5 (Num) AS (SELECT a.Num FROM n4 a, n4 b), -- 65536 -- 4.294.967.296 Numbers (Num) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM n5 a, n5 b) SELECT * FROM Numbers WHERE Num < 100000;This provides a Numbers table with up to 4.294.967.296 which fits for string with up to 4GB. XML
Some solutions to split a string are based on a conversion to XML to use XQuery to extract the items. Sounds strange? It is ;-).
How can I do that?
To split a string using XML first you have to convert the source string into a XML data type and replace the specified separator with XML tags. After that you can use XQuery to get the items within the string.
The following script shows how to work with XML to split strings.
DECLARE @text VARCHAR(100) = 'a,b,c' ,@separator CHAR(1) = ','; WITH cte (data) AS ( SELECT CONVERT(XML, '<y>' + REPLACE(@text, @separator, '</y><y>') + '</y>') ) SELECT T.C.value('.', 'nvarchar(2000)') FROM cte CROSS APPLY cte.data.nodes('y') T(C)
This method should work with most kind of data, but it runs into problems if the source string contains XML language elements like "<", ">" or "&".
SQL CLR
Now the (hopefully) last possibility to split strings into pieces is to use SQL CLR.
How to do?
Well, there are many different ways to split a string using .NET. The simplest way would be the build in function String.Split() which is available on every instance of a System.String. The most flexible way to split a string with .NET would be a regular expression. Nevertheless we use a non build-in raw function to handle that. Why?? String.Split() is definitely a cool function but we are speaking about a function for a Server and String.Split() causes the complete string twice in memory which can affect the performance when your server in on load. Same depends on regular expressions, which is way to hungry for resources and a functional overkill to just split a string.
We use a function which iterates through the complete string and searches for the specified separator. Whenever a separator is reached we use yield return to return the item between the last position of a separator and and the current position. The functionality is quiet alike the WHILE loop approach but enclosed within a compiled .NET assembly.
Here is the body of your function.
using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Collections; public partial class StoredProcedures { [Microsoft.SqlServer.Server.SqlFunction( FillRowMethodName = "FillRowSplitString", TableDefinition = "Item nvarchar(4000)", IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None ) ] public static IEnumerable SplitString( [SqlFacet(MaxSize = -1, IsFixedLength=false, IsNullable=false)] SqlString text, [SqlFacet(MaxSize = 1, IsFixedLength = true, IsNullable = false)] SqlString separator ) { // init variables char s = separator.Value[0]; string buffer = text.Value; int lastIndex = 0; int len = 0; int i; // loop through the string for (i = 0; i < buffer.Length; i++) { // check if the current character matches to the separator if (buffer[i] == s) { // return the item from last and current positions yield return buffer.Substring(lastIndex, len); lastIndex = i + 1; len = 0; } else { len++; } } // return the last item yield return buffer.Substring(lastIndex, buffer.Length - lastIndex); } // fill output row public static void FillRowSplitString( object obj, out SqlString item ) { item = (string)obj; } };Processing Performance Now, that we know all possible ways to split a string within T-SQL, let me compare the performance of the different approaches. Since the performance tests depend on the hardware, a short description of my test environment.
CPU | Intel Quad 2.4GHz Q6600 |
RAM | 8 GB |
Operating System | Windows Server 2008 x64 SP2 |
SQL Server | SQL Server 2008 x64 SP2 |
Network | localhost |
- Item Length: Is the length of the resulting items within the source text.
- Item count is the count of resulting items for each row.
- Row count is the count of source rows which have to be handled.
- Data type is the column type of the source table.
# | Item Length | Item Count | Row Count | Data Type | Method | Duration (ms) |
---|---|---|---|---|---|---|
1 | 10 | 50 | 1000 | NVARCHAR(4000) | ||
WHILE | 2850 | |||||
Numbers | 763 | |||||
CLR | 1246 | |||||
XML | 27273 | |||||
2 | 50 | 50 | 100 | NVARCHAR(4000) | ||
WHILE | 307 | |||||
Numbers | 213 | |||||
CLR | 163 | |||||
XML | 7190 | |||||
3 | 50 | 50 | 1000 | NVARCHAR(4000) | ||
WHILE | 3853 | |||||
Numbers | 2133 | |||||
CLR | 1593 | |||||
XML | 72116 | |||||
4 | 200 | 19 | 100 | NVARCHAR(4000) | ||
WHILE | 203 | |||||
Numbers | 293 | |||||
CLR | 80 | |||||
XML | 3850 | |||||
5 | 200 | 19 | 100 | NVARCHAR(MAX) | ||
WHILE | 206 | |||||
Numbers | 773 | |||||
CLR | 76 | |||||
XML | 4120 | |||||
6 | 10 | 50 | 1000 | VARCHAR(8000) | ||
WHILE | 2750 | |||||
Numbers | 643 | |||||
CLR | 1263 | |||||
XML | 16673 | |||||
7 | 50 | 50 | 100 | VARCHAR(8000) | ||
WHILE | 340 | |||||
Numbers | 180 | |||||
CLR | 163 | |||||
XML | 2723 | |||||
8 | 50 | 50 | 1000 | VARCHAR(8000) | ||
WHILE | 3086 | |||||
Numbers | 1726 | |||||
CLR | 1586 | |||||
XML | 71693 | |||||
9 | 200 | 19 | 100 | VARCHAR(8000) | ||
WHILE | 153 | |||||
Numbers | 230 | |||||
CLR | 80 | |||||
XML | 3850 | |||||
10 | 200 | 19 | 100 | NVARCHAR(MAX) | ||
WHILE | 156 | |||||
Numbers | 706 | |||||
CLR | 80 | |||||
XML | 4196 | |||||
11 | 10 | 50 | 10000 | VARCHAR(8000) | ||
WHILE | 27610 | |||||
Numbers | 6396 | |||||
CLR | 12726 | |||||
12 | 50 | 50 | 5000 | VARCHAR(8000) | ||
WHILE | 15466 | |||||
Numbers | 8596 | |||||
CLR | 8043 | |||||
13 | 50 | 50 | 5000 | VARCHAR(MAX) | ||
WHILE | 15363 | |||||
Numbers | 24456 | |||||
CLR | 8060 | |||||
14 | 1000 | 20 | 200 | VARCHAR(MAX) | ||
WHILE | 516 | |||||
Numbers | 18933 | |||||
CLR | 310 | |||||
15 | 1000 | 50 | 200 | VARCHAR(MAX) | ||
WHILE | 1176 | |||||
Numbers | 42420 | |||||
CLR | 800 |
Item Length | 50 |
Item Count | 50 |
Row Count | 500 |
Data Type | NVARCHAR(4000) |
Threads | Cursor | Numbers | CLR |
---|---|---|---|
1 | 1540 | 1068 | 779 |
3 | 3305 | 2259 | 1646 |
5 | 4384 | 2365 | 2246 |
10 | 11208 | 4745 | 5249 |
100 | 123839 | 37060 | 53512 |
We've seen four completely different ways to split strings within T-SQL.
XML appears to be the wrong technology to do the job. XML is a validated data type which has to created on the fly by using REPLACE which causes a complete copy of all data. XQuery is really fast if source data are already available in XML, but the conversion from text to XML seems to be way to expensive.
WHILE loops are the most common approach to handle this kind of job. They are okay for smaller strings or in cases where they are not needed so often. They are the wrong solution for high performance split routines.
The way to use a Numbers table for this kind of work sounds curious if you never did before. As you see, the Numbers table is a great tool once again. For text with small items this is the best performing solution. Especially the concurrency performance is really great. There are some issues though. You have to ensure the leading and trailing separator. If you have to split data from client side, which can add the separators before the data reach the server, everything is fine. If you have to work with data which are already stored in database - without being enclosed - you have to add the separators on the fly; in this case you may end up in same problem as the XML since all data have to be copied in memory. My tests showed me that the performance of this approach goes down if you encapsulate it into a table-valued function (even a single-statement table-valued function). The Numbers table split becomes slower with the length of result items within the source text.
There are quiet less tasks which should be done with SQL CLR since now. Anyway, splitting strings is one of the tasks .NET is optimized for. It performs great for any kind and count of data. Even the concurrency performance is not as perfect as the Numbers table split but even better than linear.