Florian Reischl

After some mainly architecture related posts it's time to get back into some more technical stuff.

Sometimes we find ourselves in a requirement where we need to match a larger list of entities, from a file or any other data exchange interface, against our database.

Possible scenarios could be loading data of a OLTP database into a data warehouse, synchronizing data between databases in a distributed environment, or getting master data from an external system (like customer or product information).

Whenever we need to import data from another source there are two different kinds of scenarios. Either we only need to insert new rows, or also need to merge updated information into existing rows. In this post we will focus on the second scenario, especially the part of how to select large amounts of data by specific criteria.

While selecting single rows by one or more criteria is straight forward, it can become a performance bottleneck if we need to load hundreds or thousands of rows. In this post we talk about different ways to implement bulk load operations and compare their performances.

Test Data

In the following sections I will work with a simple table called "TestData100k" that contains 100,000 rows.

-- Create table
CREATE TABLE TestData100K (
 Id int IDENTITY(1,1) NOT NULL,
 SomeGuid UNIQUEIDENTIFIER NOT NULL,
   PRIMARY KEY CLUSTERED ([Id] ASC)
);

-- Fill 100k rows
INSERT INTO TestData100K
SELECT TOP(100000)
   NEWID()
FROM master.sys.all_columns a
   CROSS JOIN master.sys.all_columns b

For sake of simplicity we will use a text file containing only IDs that we match against our test table:

1
2
3
...
n

We will select rows by their ID and return the "SomeGuid" column as a pseudo information to be processed in subsequent operations.

Single Select Statements

The simplest solution is to process the process the data row by row.

Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
 
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand("SELECT SomeGuid FROM TestData100k WHERE Id = @id", cn, tx))
using (var ids = CreateIdReader()) {
   cmd.Parameters.Add("@id", SqlDbType.Int);
   foreach (var id in ids) {
      cmd.Parameters[0].Value = id;
      var guid = (Guid)cmd.ExecuteScalar();
      guids.Add(id, guid);
   }
   tx.Commit();
}

However, as we will see in the performance comparison section, this solution might not be the best one when we need to work with larger amounts of data.

Bulk Copy (BCP)

When talking to database developers or DBAs, they will most likely tell you to use BCP. BCP is a powerful SQL Server API that allows copying of large amounts into an SQL Server. BCP does not work like usual INSERT statements, instead, it copies data in large streams to a table. This means you cannot use custom TSQL to implement logic while copying data to the server. Therefore the the common approach to work with BCP is to use a staging table to copy all data into and then use TSQL to implement business logic that uses the data from the staging table.

The probably most common way to use BCP is by utilizing the bcp.exe, what is a command line tool that is installed with SQL Server. The usage of this tool is described in thousands of good articles all over the internet, so I will not add a 1001th description here.

What we are going to do here is to talk about using the .NET implementation of the BCP API, provided by the SqlBulkCopy class. The nice present we get, when using this class, is that we are able to reuse existing .NET business logic while importing. The class provides two different ways to write data to the server. We can import an instance of a DataTable or we can import an instance of a IDataReader. My personal advice is to consider to use a IDataReader if you need to import many data since we usually don't need the considerable overhead of the DataTable and the data table based approach requires all data to be loaded into the clients memory before it can be written to the server.

// write data to server
// The IdDataReader represents an implementation of the IDataReader interface that 
// streams over the file and reads the IDs row by row
using (var idReader = new IdDataReader()) {
   SqlBulkCopy bcp = new SqlBulkCopy(_cnStr, SqlBulkCopyOptions.TableLock);
   bcp.BatchSize = _count + 1;
   bcp.DestinationTableName = "StageBulkIds";
   bcp.ColumnMappings.Add(new SqlBulkCopyColumnMapping(0, 0));
   bcp.WriteToServer(idReader);
}
 
// select guids
const string sql = @"SELECT td.Id, td.SomeGuid"
                     + " FROM StageBulkIds ids"
                     + " JOIN TestData100k td ON ids.Id = td.Id";
using (var cn = CreateOpenConnection())
using (var cmd = new SqlCommand(sql, cn)) {
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   using (var reader = cmd.ExecuteReader()) {
      while (reader.Read()) {
         guids.Add((int)reader[0], (Guid)reader[1]);
      }
   }
}

IN Statements (and their Issues )

Another commonly used solution is to build a SQL statement that contains a large list of IN arguments to be matched.

var lines = File.ReadLines(_fileName);
string sql = "SELECT Id, SomeGuid FROM TestData100k WHERE Id IN (" 
             + string.Join(",", lines)
             + ");";
         
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand(sql, cn, tx))
using (var reader = cmd.ExecuteReader()) {
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   while (reader.Read()) {
      guids.Add((int)reader[0], (Guid)reader[1]);
   }
   reader.Dispose();
   tx.Commit();
}

Unfortunately this solution brings up some issues.

IN statements only work if we need to match one single column. If we need to match data by more than one column, this approach does not work
Since it is not possible to send the list of IDs as parameter, we need to create our SQL statement by string concatenation, what opens the door for SQL injection.
Since every statement looks different, due to the different arguments within the IN clause, SQL server cannot reuse existing execution plans, and needs to compile each statement
Due to the fact that we send a SQL statement with thousands of arguments within the IN clause, the whole statement becomes really large. This requires much CPU time for the SQL Servers compile engine to prepare the execution. If the list becomes too large we can even get a out of memory exception from the compiler (not the SQL Server). The screenshot below shows a CPU utilization of almost 2 seconds for a list of 10,000 INT IDs

Server Side String Split Functions

Another approach is to create a list of concatenated IDs, send them as text parameter to the server and split the IDs into a temp table on server side. After this the temp table can be used for further processing.

You can use one of the functions that I posted at High Performance String Split Functions or any other, preferably fast, one from the internet.

For the performance tests in this post I used a CLR based split function, since they are the fastest split functions, as Jeff Moden and Paul White just showed again in a great article on SQLServerCentral.com, "Tally OH! An Improved SQL 8K “CSV Splitter” Function".

string sql = @"CREATE TABLE #Ids (Id INT NOT NULL PRIMARY KEY CLUSTERED);"
 
            + " INSERT INTO #Ids"
            + " SELECT item"
            + " FROM clr_SplitString (@ids, ',');"
 
            + " SELECT td.Id, td.SomeGuid"
            + " FROM #Ids ids"
            + "    JOIN TestData100k td ON ids.Id = td.Id";
 
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand(sql, cn, tx)) {
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   var lines = File.ReadLines(_fileName);
   var ids = string.Join(",", lines);
 
   cmd.Parameters.Add("@ids", SqlDbType.VarChar, -1).Value = ids;
   using (var reader = cmd.ExecuteReader()) {
      while (reader.Read()) {
         guids.Add((int)reader[0], (Guid)reader[1]);
      }
   }
   tx.Commit();
}

As a site hint, even if this concatenation looks like another open door for SQL injection, since the IDs are sent as VARCHAR parameter that is not executed as SQL statement on server side, this cannot cause SQL injection issues. However, parsing the list of arguments on server side can, for sure, still cause type conversion errors.

One disadvantage of this solution is, like IN statements, it supports only ID matching for single column criteria. If we need to match our data by more than one column, this approach becomes tricky, since we would need to send and parse a hierarchical list of IDs.

Table Valued Parameters

As I've already shown in other posts, table valued parameters are a really powerful tool of SQL Server. They provide the possibility to not only send scalar values to the server but represent a native way to send sets of data in one network round trip and can be directly used within TSQL.

First thing we need to do is to create an appropriate user defined table type on SQL Server.

CREATE TYPE IntIDs AS TABLE (
 Id INT NOT NULL PRIMARY KEY CLUSTERED
)

Second, we need to implement an IEnumerable of SqlDataRecord that will be used by ADO.NET to create the rows to be sent to SQL Server. (Like SqlBulkCopy it is also possible to send DataTables, although with the same issues.)

class IntRecordEnumerator : IEnumerable {
   public IEnumerator GetEnumerator() {
      SqlMetaData metaData = new SqlMetaData("Id", SqlDbType.Int);
      StreamReader sr = new StreamReader(_fileName);
 
      string line;
      while (null != (line = sr.ReadLine())) {
         SqlDataRecord record = new SqlDataRecord(metaData);
         record.SetInt32(0, int.Parse(line));
         yield return record;
      }
   }
 
   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}

Now we can use a SqlCommand to send the whole data in one bunch as SqlParameter.

string sql = @"SELECT td.Id, td.SomeGuid"
               + " FROM @Ids ids"
               + "    JOIN TestData100k td ON ids.Id = td.Id";
 
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand(sql, cn, tx)) {
   // parameter for our IDs. SqlDbType.Structured tells ADO.NET that we
   // use a table valued parameter
   var p = cmd.Parameters.Add("@ids", SqlDbType.Structured);
   // type of the Table Valued Type on SQL Server
   p.TypeName = "IntIDs";
   // the enumerator that provides the values to be filled 
   // into the table type variable on server side
   p.Value = new IntRecordEnumerator();
 
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   using (var reader = cmd.ExecuteReader()) {
      while (reader.Read()) {
         guids.Add((int)reader[0], (Guid)reader[1]);
      }
   }
   tx.Commit();
}

Performance Comparison

Now that we have seen several different ways to meet our requirements , it's time to do some performance considerations.

First, let's compare all options we've seen in this post.

For sake of correctness I have to admit that the performance of the single statements, compared to the IN statements, might not 100% valid for a production scenario. I used a local instance of SQL Server, what means that network latency is almost zero. In another environment the single statements might take more time, depending on the network architecture. Nevertheless, compared with other approaches, the IN statements are by far out performed - and we should keep in mind all the other issues when using IN statements.

Since there are three techniques that are so close to each other at the bottom of the above diagram, let's do some more performance tests. This time with higher amounts of IDs and with a focus on those three approaches (String Split, BCP and Table Valued Parameters).

As we see, while the string split approach is really fast, compared with the IN statements or single selects, it comes to its limits compared with the other solutions. The BCP way to match data shows, as expected, a great performance, even when working with really large amounts of data. However, the surprise (even to me) is, that table valued parameters are not just fast, but even 30% faster than using SQL Servers BCP API.