A Strange Kind of a Relational EDI


Today I'll speak about a very different kind of data exchange. The following approach might sound strange for the first moment (it did to me!). However, after some days thinking about, I think this might be a new way which probably provides some powerful new solutions.

First as first. Let's start up with a general discussion about different types and directions of EDIs (Electronic Data Interchange).

Many of us work with EDIs to exchange data between different systems and/or locations. I think there are two main types of data interchange directions.

The first direction of EDIs are data-loader routines or processes which load data from 3rd level applications. Especially todays EAI (Enterprise Application Integration) applications are often related to many different other applications. This might be an order system which takes customer information from a CRM tool and some employee information from a HR-software. The important attribute here for this kind of data exchange is the communication with other applications which are not part of your own project.

The second kind of EDIs are synchronization routines or processes which are used to keep systems in sync. These processes are often used to exchange data between different servers (e.g. locations) of a distributed system. Another sample for sync routines is a offline support of a product catalog on a sales notebook. Last but not least synchronization processes are used for staging areas between operational and analysis systems like SSAS. The important attribute of this kind of data exchange in this post is the communication between systems of the same project or company.

As you see, I want to point the difference of communication between your own project modules and external systems.

Next we'll go on with a short listing of the most common available EDI formats and their strengths and weaknesses.

An EDI Format Inventory
Todays most common EDI formats seem to be CSV, XML. Sure there are several other - more specific - EDI formats like EDIFACT, BORD128, binary serialization and so forth. Though, for sake of brevity those formats are not scope of this post.

CSV
CSV files (separated by comma, tab, semicolon or whatever) are especially used to exchange huge amounts of data. The main strength of this format is its compactness. CSV files contain almost no overhead. The files are human readable with tools like Notepad (or any other text editor) or Excel. Especially data storage systems like SQL Server, many other RDBMSs or SSIS provide build in functionality to write and read CSV files.

On the other hand, CSV files have some weaknesses. First, the files are not relational, which means you cannot specify links between different kinds of data. Sure you can use distinct files for each object (or data) type and put everything together into an ZIP archive. Nevertheless, the files are not integrity checked. Another issue of CSV files is a missing search functionality. Microsofts OLEDB Jet-Provider enables simple SQL queries on CSV files but this is not a standard on other operational systems. Finally CSV files are not type safe. Your third column might be specified as an amount of goods, but you can never be sure that data in all rows are really stored in a decimal format and even if they are, you need to know the formatting information of the data. E.g. an amount of "1,234.56" in German format is "1.234,56". With date/time formats it becomes more funny ;-)

XML
Over the last some years XML became more and more the most common format for almost any kind of data exchange. Why? Here just a subset of reasons...

  • It supports references (links) between different objects. This could be direct parent-child relations or foreign-key like references to other XML-objects within or even without the current XML file.
  • With XPath it supports a full featured XML query language.
  • Almost all todays development kits provide build in XML support.
  • XML files can be validated via XSDs for data types, required and optional object properties and required relations.
  • It is human readable with any kind of text editor, web browsers and many IDEs.

A common approach to use XML for data exchange is object serialization and there are two generally different ways to handle XML serialization. In smaller projects it often makes sense to directly serialize the existing Domain Objects. In more complex domain models serialization is often done by creation of DTOs, since the domain model might be too complex and contain too much validation which might be not needed (or desired) while (de-)serialization. DTOs are a simplified and coarse grained simplified versions of fine grained domain objects.

Sure, XML can also be used without any build in serialization methods. Since this slightly eliminates the "easy to use" factor, this is usually only needed for very large amounts of data or very flexible requirements.

Independent of all the strength of XML, it also has some weaknesses which might be important for some requirements.

XML is a very verbose data format. Though, this can be handled by using any kind of compression (like ZIP, GZip and so on).

Another weakness of XML is XPath. Don't get me wrong, XPath is a powerful query language and I (almost) always get what I want. Nevertheless, most times original data storages are relational databases. When using XML you always have to either implement some kind of XPath based Mapper (when working with plain XML) or any object based way to find all de-serialized objects and their relations (when working with serialization).

A Relational EDI
Wouldn't it be cool to work with SQL when working with EDI, eh? In my opinion, yes! So, why don't do this?

Did you ever think about using embedded databases (like SQL Server Compact, SQLite or others)? Sounds strange to you? It did sound strange to me, too. However, who cares? After several days thinking about this approach, this appears to be a probably very new kind of data exchange.

When you work with an O/R-Mapper you might even get this feature as a gift. Even if you don't work with an O/R-Mapper, most usual SQL statements are supported by those new embedded database systems. When working with SQL Server Compact you might think about using Microsofts Sync Framework to load and update your data from your "database interchange".

Embedded databases support constraints primary keys, foreign keys, indexes (and some even more). Armed with these constraints, embedded databases provide a ensured relational integrity. You can also be sure that all your data are stored in correct format.

One potential issue of embedded databases is the file size. A new, empty SQL Server Compact database has a initial size of 20KB (for being empty :-D ). 20KB appear to be not too much for me, but what about file growing when data become loaded? After this initial size it grows less than XML files. I tried to create an XML and a SQL Server Compact database each containing all Customers (91 rows) and Orders (921 rows) from good old Northwind database. The XML file ended up with 437KB, the SQL Server Compact database ended up with 184KB (containing both primary keys and a foreign key from Orders to Customers). To be fair, I also tried to compress both resulting files with WinRAR. The XML file was compressed to 30KB, the database file was compressed to 31KB. Sure, "large" and "small" always depend on your environment, but in my opinion this compression appears to be okay.

Is missing human readability an issue of embedded databases? I'd say, yes ant no. You cannot open the files with a text editor. But you can use tools like SSMS (in case of SQL Server Compact) or any other database query tool. If you work with an O/R-Mapper or any other Data Access Layer which supports different database systems you can probably even use your tailor made business application to browse the data within the "interchange".

Nevertheless, the most important issue of using embedded databases as EDI is the fact that it is completely no standard today. For this reason this should only be used for in-project synchronization between different storages at the moment.

Conclusion
If network traffic is one of the issues you fight with when exchanging EDIs between systems, CSV is definitely the way to go.

If you need a standardized communication to any outside (of your project) module, XML is usually the way to go.

If you have to synchronize larger amounts of data between different parts of your own system, you might think about using embedded databases doing this job. At least, give it a try or some hours to think about ;-).

What's the Problem with List<T> Properties in BLL?

Today I'll speak about a practice for business logic layers (BLL).

The .NET List<T> is a nice and really handy class for in many destinations. However, this class is not the best solution for libraries like business logic layers (BLL) to provide related domain objects.

Above a very simple part of a fictive business logic layer (BLL).
class Order {
   public int Id { get; set; }
   // ...
}

class Customer {
   public int Id { get; set; }
   public List<Order> Orders { get; private set; }
   // ...
   internal void SetOrders(List<Order> orders) {
      this.Orders = orders;
   }
   // ...
}
Using the List<Order> in this solution can cause some problems.

Lacy Loading
You should always try to load all information needed for a specific business case. Though, there are reasons why this makes no sense. For instance if the don't know which related data are needed when you start your business transaction. Another reason might be a too huge count of related objects. Here is lazy loading a nice pattern.

In our fictive layer, there are cases when we need the orders of our customer, but there might be several other cases when we don't need the related orders. You can use a Builder to compose different customers for different business cases. However, sometimes the required relations to handle a business case are not clear when the business transaction starts.

Always to load too much information might become a performance problem. Lazy loading is a nice practice (when it is used correct) to ensure related information if they are needed. One way to handle lazy loading is AOP but sometimes you can end up in a ripple loading. For instance, if you have to handle thousand customers and their orders it makes more sense to load all (or a page of 100) orders in one database query. In this case you might prefer a special implementation which handles this loading.


Validation
You can derive from List<T> but it does not provide any way to validate the Order which become accessed/removed/added. Probably you want to validate newly added orders to ensure they are valid and/or not yet attached to any customer. What about deletion? It should not be possible to delete an order which is already billed. For these business cases, you need an observable list or a read-only list.

Presentation
This very depends on how flexible your BLL wants to be but think about your presentation layer. If you want to be able to bind your orders list to a data bindable control like the .NET DataGridView. The List<T> does not provide all required methods to enable add/remove handling for the grid.

Notice, I do not speak about any coupling between business and presentation layer but it might be neat if your BLL provides some binable objects which can be reused in win-, wpf- and/or web-applications.


Custom Behavior
As a last reason for now (there might be more) why List<T> should not be used here is customization of behavior. Probably your list needs to be sorted by any order information. Maybe you need an observer list which organizes this sorting, even if you change one or more of the contained orders.

Better Solution
A better approach to provide related 0...* objects is usually the IList<T> interface. This interface provides (almost) all methods which are available by using List<T> but it provides a very low coupled representation of the Order property. Feel free to use List<T> for common initialization within your BLL. If you ever need a special handling you are able to provide another list though.
class Customer {
   public int Id { get; set; }
   public IList<Order> Orders { get; private set; }
   // ...
   internal void SetOrders(IList<Order> orders) {
      this.Orders = orders;
   }
}
Another solution is to provide a completely custom list interface or base class (I always prefer interfaces). This also works fine and keeps you able to provide different lists with different behavior. Though, this has a larger startup cost for your project. IMO IList<T> fits fine for most consumers of a BLL.

Table-Valued Parameters - A Performance Comparison


As I wrote in some of my previous posts, table-valued parameters - which have been introduced with SQL Server 2008 - appear to be a great feature to enable client side bulk operations when working with SQL Server. They enable client data provider like ADO.NET to send a complete set of data to the server instead of calling serveral single row operations. Today we'll do some comparsions.

I'll start with a short (had to correct this...) description of the test environment. To use the best possible performance for both ways I'll use stored procedures.

The first test results focus the performance of single statements vs. bulk operations in a single client environemnt - means without any other concurrent activities. The second, more important tests cover a performance comparsion in a multi-client scenario. The last part will show a comparison for some larger bulk operations.

Test Environment

If you already read my previous posts you can skip this section. I try to keep this post uncoupled from my previous post since this covers another general topic (TVPs). So I have to repreat the test environment and the general way to use table-valued parameters from ADO.NET.

For my tests I'll used the following simple table to INSERT/UPDATE rows.

CREATE TABLE TestMany
(
   Id int IDENTITY(1,1) NOT NULL
      PRIMARY KEY CLUSTERED 
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL,
);

Whenever we intend to use table-valued parameters in SQL Server 2008, we need a table-valued type defined in SQL Server which can be used as a collection of Data Transfer Objects (DTOs) to transfer .NET native data to SQL Server.

CREATE TYPE TestManyMergeType AS TABLE(
   Id int NOT NULL
   ,Action char(1) NOT NULL
      CHECK (Action IN ('I', 'D', 'U'))
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
   
   ,PRIMARY KEY CLUSTERED 
      (Id, Action)
);

This type covers our destination table with some slight exceptions. Since we will use table-valued parameters to push all changed client data in one package to SQL Server, we need a column which tells SS2k8 how to handle a distinct row. Therefore, we'll use the Action column which specifies the current row to be Inserted, Updated or Deleted. We don't use an IDENTITY specification for our "Id" column since we need to be able to specify an destination "Id" to be handled. To insert new rows we can use negative, unique "Id" values which are required to map newly inserted rows on client-side to their new server ids. To get the best selectivity we use a primary key over the "Id" and "Action" columns.

To provide a stored procedure which gets a table-valued parameter, the parameter has to be specified as READONLY in SS2k8.

CREATE PROCEDURE SaveTestMany
   @data TestManyMergeType READONLY
AS
   SET NOCOUNT ON;
   
   -- update
   UPDATE trg SET 
         SomeInt = src.SomeInt
         ,SomeDate = src.SomeDate
         ,SomeText = src.SomeText
      FROM TestMany trg 
         JOIN @data src 
         ON trg.Id = src.Id AND src.Action = 'U';

   -- insert
   INSERT INTO TestMany
      SELECT SomeInt,SomeDate,SomeText 
      FROM @data 
      WHERE Action = 'I' 
      ORDER BY Id

   -- return last IDENTITY and count of added rows
   SELECT SCOPE_IDENTITY(), @@ROWCOUNT;
GO

The important most important part of this procedure is the ORDER BY clause in the INSERT statement. This ensures that we can use the information about the last generated identity value (by SCOPE_IDENTITY()) and the count of inserted rows (by @@ROWCOUNT) to map the provided surrogate ids to the generated server ids. Maybe you think this might be dangerous without a MAXDOP option or anything else and maybe you are right, I just can tell you that I tried different order conditions and all rows have always been inserted in an exact order. So feel free to trust this or not. If you don't trust it (what is a good thing) do your own tests (what is a good thing, too) or just feel free to leave this post (what would be a really bad thing).

To provide a table-valued parameter to a stored procedure from ADO.NET you have two possibilities. You can either provide your data by passing a System.Data.DataTable (which appears be too blown just to provide simple data records in my opinion) or a custom implementation of an IEnumerable<SqlDataRecord>. I'll use the enumerator in my tests since it is way faster and does not require to copy all domain objects into a data table.

class TestManyEnumerator : IEnumerable<SqlDataRecord>
{
   public TestManyEnumerator(IEnumerable<TestMany> data) {
      _data = data;
   }

   private IEnumerable<TestMany> _data;

   public IEnumerator<SqlDataRecord> GetEnumerator() {
      SqlMetaData[] metaData = {
         new SqlMetaData("Id", SqlDbType.Int),
         new SqlMetaData("MergeAction", SqlDbType.Char, 1),
         new SqlMetaData("SomeInt", SqlDbType.Int),
         new SqlMetaData("SomeDate", SqlDbType.DateTime),
         new SqlMetaData("SomeText", SqlDbType.VarChar, 100),
      };

      foreach (var item in _data) {
         SqlDataRecord record = new SqlDataRecord(metaData);

         record.SetValue(0, item.Id);
         record.SetValue(1, item.ChangeState);
         record.SetValue(2, item.SomeInt);
         record.SetValue(3, item.SomeDate);
         record.SetValue(4, item.SomeText);

         yield return record;
      }
   }

   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}

This enumerator handles domain objects from type TestMany which map exactly to our above specified table-valued T-SQL type. ChangeState specifies the state ('I', 'D', 'U') of a particular object.

To call our procedure with a set of data to be updated we have to specify a SqlParameter with SqlDbType.Structured, the name of the destination table-valued type as TypeName and provide the previous enumerator as Value.

using (SqlConnection cn = GetConnection(true))
using (SqlTransaction tran = cn.BeginTransaction()) {
   // get test data
   IList<TestMany> data = GetTestData();

   // create a new command as stored procedure
   SqlCommand cmd = new SqlCommand("MergeTestMany", cn);
   cmd.CommandType = CommandType.StoredProcedure;
   cmd.Transaction = tran;
   // create a new parameter ("Structured" defines a TVP)
   SqlParameter p = cmd.Parameters.Add("@data", SqlDbType.Structured);
   // specify the name of the TVP
   p.TypeName = "TestManyMergeType";
   p.Value = new TestManyEnumerator(data);

   // catch the returned IDENTITY information
   using (SqlDataReader reader = cmd.ExecuteReader()) {
      reader.Read();
      int lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      int count = reader.GetInt32(1);
      // synchronize all newly added keys
      SyncAddedIds(data, count, lastId);
   }

   tran.Commit();
}

For a more detailed explanatin and the implementation of GetTestData():IList<TestMany> and GetConnection(open):SqlConnection please have a look here. For a generic approach, have a look at A Generic BDP Approach for SQL Server 2008.

The procedures to insert single rows into our TestMany table are just simple as they can be.

CREATE PROCEDURE [dbo].[InsertOneTestMany]
   @id INT OUTPUT
   ,@someInt INT
   ,@someDate DATETIME
   ,@someText VARCHAR(100)
AS
   SET NOCOUNT ON;
   
   INSERT INTO TestMany (SomeInt, SomeDate, SomeText)
      SELECT @someInt, @someDate, @someText;
   
   SELECT @id = SCOPE_IDENTITY();
GO

CREATE PROCEDURE [dbo].[UpdateOneTestMany]
   @id INT
   ,@someInt INT
   ,@someDate DATETIME
   ,@someText VARCHAR(100)
AS
   SET NOCOUNT ON;
   
   UPDATE TestMany SET 
         SomeInt = @someInt
         ,SomeDate = @someDate
         ,SomeText = @someText
      WHERE Id = @id;
GO

I think there is nothing which needs to be explained.

We're done with our test enviroment.


Not part of this Post

Calling a single statement procedure from .NET straight forward and part of many tutorials. Therefore I kept this source code away from here. I also don't show how to use multi-threading to call the above procedures in a concurrency scenario.

I kept DELETE operations out of the party since they are often done by special purge processes instead of usual business applications.

Since I don't have a home network, I did my tests with a SQL Server running on a VM on my local host. Therefore I cannot show you any comparison differences depending on network latency.

Single Client Tests

Let's compare the performance of single INSERT/UPDATE statements vs. bulk operations when using a single client connection. To get (almost) linear test results I executed each test configuration 1,000 times. One important thing is that I always called the specified count of INSERT and UPDATE operations in one round trip. So, "Count Insert/Update" with value "1" calls one INSERT and one UPDATE operation, count "50" affects hundred rows. I did my tests with a destination table with just 100,000 rows and witha a table with 10,000,000 rows. Both table have been recreated after/before each test configuration and flushed by calling "DBCC DROPCLEANBUFFERS() WITH NO_INFOMSGS;". To ensure as many index read operations as possible I arranged all operations over the complete complete id range of the table.

Duration on a destination table with 100,000 rows after 1,000 executions.

Insert/UpdateDuration
CountSingle StatementsBulk Operation
11,299 ms3,285 ms
32,319 ms2,995 ms
53,201 ms3,007 ms
105,553 ms3,121 ms
2011,923 ms3,449 ms
5027,644 ms4,769 ms

Duration on a destination table with 10,000,000 rows after 1,000 executions.

Insert/UpdateDuration
CountSingle StatementsBulk Operation
12,631 ms4,076 ms
33,614 ms4,200 ms
54,608 ms4,470 ms
107,541 ms4,381 ms
2012,308 ms4,846 ms
5026,173 ms6,336 ms

As you can see, there is a point at about five INSERT/UPDATE operations (means 10 rows!) in one single transaction. With less than ten operations for the same table single statements are faster than bulk operations with table-valued parameters. If we hit this count of rows to be changed, table-valued parameters become way faster than classical single row operations.

A work load between one and fifty INSERT and UPDATE tasks shows that single row operations take about 10 times longer (what's a good thing), the duration of bulk operations increase only about 1.5 times (what appears to be a better thing ;-) ).

Multi-Client Tests

As initially written. SS2k8 is a server and no geeky feature has any worth if it works only in a single-client scenario. Therefore, now we'll have a look at some multi-client tests.

Since concurrent executions take more time I ran the following tests only 100 times. I also changed the count of rows within the destination table to 500,000 what was a convenient value to arrange the threads over the whole table - to get the highest count of reads).

The following results show the overall duration of 100 executions.

Insert/UpdateDuration
CountSingle StatementsBulk Operation
1 Thread
1159 ms298 ms
3237 ms358 ms
5344 ms371 ms
10562 ms391 ms
201,040 ms427 ms
502457 ms558 ms
2 Threads
1387 ms348 ms
3679 ms368 ms
5854 ms438 ms
101,597 ms421 ms
203,513 ms575 ms
506,839 ms830 ms
5 Threads
11,466 ms936 ms
32,528 ms943 ms
53,542 ms1,007 ms
105,837 ms1,097 ms
2010,995 ms1,293 ms
5026,092 ms2,251 ms
10 Threads
14,820 ms1,604 ms
313,290 ms1,936 ms
512,624 ms1,838 ms
1028,829 ms2,012 ms
2045,414 ms2,789 ms
5097,498 ms4,610 ms

Again, remember "1" INSERT/UPDATE means two operations within one transaction. Since single row operations usually scale almost linear they are still the faster solution for single operations. However, as we see bulk operations using with table-valued parameters perform really fine in a multi-threaded environment.

Some Larger Bulks

Okay, we speak about bulk operations so finally, let's do some real bulk operations.

The following results show the complete duration of 10 executions.

Insert/UpdateDuration
CountSingle StatementsBulk Operation
1,0004,845 ms370 ms
5,00023,190 ms2,033 ms
10,00046,539 ms4,854 ms
20,00095,086 ms7,209 ms

In my opinion 400,000 changed rows (remember we did 20,000 INSERT and UPDATE operations and we executed 10 times!) in 7 seconds appear to be a really great performance. ;-)

Conclusion

Apparently table-valued parameters might be a great feature for systems which work with bulk data manipulation. This might be some web or rich client applications but especially backend processes like data loader or synchronization processes.

What does this mean? Probably nothing for a general statement. If you think bulk operations with table-valued parameters might be a nice feature for your project. Always do your own tests! We did no tests with a real network and we did not test DELETE operations.

As a general advice, always try to stay innovative but validate everything. My intention was not to show you that you have to change all your running code, but to think about some new features which might be helpful.

A Generic BDP Approach for SQL Server 2008

As I promised in my previous post (SQL Server and Bulk Data Persistence (BDP), today I'll show a possible generic approach of how to integrate Bulk Data Persistence without writing a new enumerator for each domain object. In addition we will encapsulate the ID synchronization into a small framework.

If you did not yet read the previous posts, please read it before you goon here. I see you as a person who knows the idea of BDP and how to handle it with SQL Server 2008.

Our Task List
Let's reflect the steps of the previous post to figure out all tasks which have to be done:
  • We have to provide SqlMetaData objects to map our domain object properties
  • We have to provide an enumerator through the objects and create a SqlDataRecord
  • We have to call some T-SQL to do the persistence
  • We have to map all the new server ids to our domain objects
Sounds really complicated? It's not as hard as it sounds. It's more complicated than a command or simple strategy pattern but not much more than a usual Unit of Work pattern.

ObjectState Enumeration
Before we start with our tasks, we have to change a dirty non-OO cheat we previously used. We change the "pseudo enum" to represent the state of a domain object to a real enum ObjectState.
public enum ObjectState
{
   Added,
   Deleted,
   Changed,
   Unchanged,
}
Okay, back to our task list...

Column Mapping
The next thing we need is a general class to map the properties of a domain object into a SqlMetaData object. Above we can see a possible SqlMergeColumn<T> class which represents all information which are generally needed to create the meta data objects.
class SqlMergeColumn<T>
{
   internal SqlMergeColumn(string columnName, SqlDbType sqlType,
                           Func<T, object> getData) {
      ColumnName = columnName;
      SqlType = sqlType;
      _getData = getData;
   }

   // callback to get the value from a current item
   Func<T, object> _getData;
   
   public string ColumnName { get; private set; }
   public SqlDbType SqlType { get; private set; }
   public int MaxLength { get; internal set; }
   public byte Precision { get; internal set; }
   public byte Scale { get; internal set; }

   public virtual object GetData(T item) {
      // get the field value or DBNull
      return _getData(item) ?? DBNull.Value;
   }

   public SqlMetaData GetMetaData() {
      // get meta data for this property mapping
      if (MaxLength != 0)
         return new SqlMetaData(ColumnName, SqlType, MaxLength);
      if (Precision != 0)
         return new SqlMetaData(ColumnName, SqlType, Precision, Scale);
      return new SqlMetaData(ColumnName, SqlType);
   }
}
As you see, the larger part of the class is a simple representation of properties which are needed to create a SqlMetaData object. The MaxLength property is needed for data types with variable length like text or binary data-types. The Precision and Scale properties are required for some numeric types like decimal.

The GetData(T):object method gets an object of generic type of the class. (I always try to avoid reflection if possible, since it is quiet slow and always hard do maintain.) The GetData(T):object returns a value of the specified item by utilizing a delegate Func<T,TReturn> which was provided by the constructor of the class.

The last method GetMetaData():SqlMetaData is a typical strategy method which returns the meta data for a SqlDataRecord enumerator depending on the current configuration of an instance of the class.

We have to handle two special column types. First, we need the identity column to be able to join data for UPDATE and DELETE and synchronize the server identity values with our newly added objects. Second, we need column which can tell us the type of the BDP action (add, change, delete) to be executed. To stay OO geeky we use two additional classes which inherit from SqlMergeColumn<T>.
// represents an identity merge column
sealed class SqlMergeIdColumn<T> : SqlMergeColumn<T>
{
   public SqlMergeIdColumn(string columnName, Func<T, int> getIdData, 
                           Action<T, int> setIdData)
      : base(columnName, SqlDbType.Int, (t) => getIdData(t)) {
      _setIdData = setIdData;
   }

   // private fields
   Action<T, int> _setIdData;

   public int GetIdData(T item) {
      return (int)base.GetData(item);
   }

   public void SetIdData(T item, int id) {
      _setIdData(item, id);
   }
}

// represents a merge action column
sealed class SqlMergeActionColumn<T> : SqlMergeColumn<T>
{
   public SqlMergeActionColumn(string columnName, 
                               Func<T, ObjectState> getMergeAction)
      : base(columnName, SqlDbType.Char, (t) => getMergeAction(t)) {
      MaxLength = 1;
   }

   public override object GetData(T item) {
      switch (GetMergeAction(item)) {
         case ObjectState.Added: return 'I';
         case ObjectState.Changed: return 'U';
         case ObjectState.Deleted: return 'D';
         default: return 'X';
      }
   }

   public ObjectState GetMergeAction(T item) {
      return (ObjectState)base.GetData(item);
   }
}
The SqlMergeIdColumn<T> class provides two additional methods GetIdValue(T):int to return an System.Int32 id value and SetIdValue(T,int) to set a new id value after our database operation.

The SqlMergeActionColumn<T> overwrites the GetData(T):object method to transform the ObjectState enum into a System.Char for our data mapping.

GetMergeAction(T):ObjectState returns the enum value of a specified domain object.

Now, since we have column classes to specify a mapping to a table-valued parameter, we need a collection to store those mappings. I use a System.Collections.ObjectModel.Collection<T> inherited collection in this sample.
// represents a collection of SqlMergeColumn<T> objects
sealed class SqlMergeColumnCollection<T>
   : System.Collections.ObjectModel.Collection<SqlMergeColumn<T>>
{
   // add a data column
   public void AddDataColumn(string columnName, SqlDbType sqlType,
                             Func<T, object> getData) {
      this.AddDataColumn(columnName, sqlType, 0, getData);
   }

   // add a data column
   public void AddDataColumn(string columnName, SqlDbType sqlType,
                             int maxLength, Func<T, object> getData) {
      SqlMergeColumn<T> mapping = new SqlMergeColumn<T>(
         columnName, sqlType, getData);
      mapping.MaxLength = maxLength;
      base.Add(mapping);
   }

   // add a data column
   public void AddDataColumn(string columnName, SqlDbType sqlType,
                             byte precision, byte scale, 
                             Func<T, object> getData) {
      SqlMergeColumn<T> mapping = 
         new SqlMergeColumn<T>(columnName, sqlType, getData);
      mapping.Precision = precision;
      mapping.Scale = scale;
      base.Add(mapping);
   }

   // set the id column
   public void SetIdColumn(string columnName, 
                           Func<T, int> getIdData,
                           Action<T, int> setIdData) {
      base.Add(new SqlMergeIdColumn<T>(columnName, getIdData, setIdData));
   }

   // set the merge action column
   public void SetMergeColumn(string columnName, 
                              Func<T, ObjectState> getMergeAction) {
      base.Add(new SqlMergeActionColumn<T>(columnName, getMergeAction));
   }

   // get the id column
   internal SqlMergeIdColumn<T> GetIdColumn() {
      return this.OfType<SqlMergeIdColumn<T>>().First();
   }

   // get the merge column
   internal SqlMergeActionColumn<T> GetMergeColumn() {
      return this.OfType<SqlMergeActionColumn<T>>().First();
   }

   // get only the data columns
   internal IEnumerable<SqlMergeColumn<T>> GetDataColumns() {
      return this.Where(c => c.GetType() == typeof(SqlMergeColumn<T>));
   }
}
We've got three different implementations of AddDataColumn to add usual data column mappings. In addition we've got two methods (SetIdColumns and SetMergeColumn) to specify the special id and merge-action columns. At last we have three methods (GetIdColumn, GetMergeColumn and GetDataColumns) to access the different column types. I marked those methods as internal since they are only needed within the BDP framework and usually not important for a consumer of the framework.

Note, this framework is just a sample implementation. To keep it simple, I didn't add the possibility to specify the order of all columns. This means, currently all columns have to be specified in same order as defined in the table-valued type. Feel free to add an order parameter for all methods if you need this.

Data Enumerator
As we know, we have to provide a IEnumerable<SqlDataRecord> as value of a .NET SqlParameter which represents a table-valued parameter. The following listing shows a generic implementation of this class.
sealed class Sql2k8MergeEnumerable<T> : IEnumerable<SqlDataRecord>
{
   // generic class to handle field access by lambda expression
   // create a new enumerator
   internal Sql2k8MergeEnumerable(IEnumerable<T> data, 
                                  SqlMergeColumnCollection<T> mappings) {
      _data = data;
      _mappings = mappings;
   }

   // private fields
   IEnumerable<T> _data;
   SqlMergeColumnCollection<T> _mappings;

   // the enumerator which will be called from ADO.NET for a SqlParameter value
   public IEnumerator<SqlDataRecord> GetEnumerator() {
      // get all SqlMetaData for specified column mappings
      SqlMetaData[] metaData =
        _mappings.Select(mapping => mapping.GetMetaData()).ToArray();

      foreach (var item in _data) {
         // create a new record
         SqlDataRecord record = new SqlDataRecord(metaData);
         // fill the record with its values
         for (int i = 0; i < _mappings.Count; i++)
            record.SetValue(i, _mappings[i].GetData(item));
         // return the current record
         yield return record;
      }
   }

   IEnumerator IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}
The class gets the data to iterate through and an instance of our above implemented merge column collection.

The GetEnumerator():IEnumerator<SqlDataRecord> method utilizes the column mappings collection to create the meta data objects we need for a data record. After this it iterates through all data and returns the data records. To fill the records it uses the GetData method we already defined on our column classes.

Merge Command
A quick look back to our task list shows to missing tasks. We need something to generate the persisting T-SQL and we have to sync the client surrogate ids to the server side generated identity values. (I added both functionalities into one class, feel free to change this for a lower coupling.)
class Sql2k8MergeCommand<T>
{
   public Sql2k8MergeCommand(string tableName, string tableTypeName, 
                             IEnumerable<T> data) {
      _tableName = tableName;
      _tableTypeName = tableTypeName;
      _data = data;
      Columns = new SqlMergeColumnCollection<T>();
   }

   // private fields
   string _tableName;
   string _tableTypeName;
   IEnumerable<T> _data;

   public SqlMergeColumnCollection<T> Columns { get; private set; }
   public SqlConnection Connection { get; set; }
   public SqlTransaction Transaction { get; set; }

   // execute the merge command
   public void Execute() {
      // create an SqlCommand
      SqlCommand cmd = new SqlCommand();
      cmd.Connection = Connection;
      cmd.Transaction = Transaction;
      // get the dynamic SQL
      cmd.CommandText = GetCommandText();
      // create a parameter which gets all data as table-valued parameter
      SqlParameter data = cmd.Parameters.Add("@data", SqlDbType.Structured);
      data.TypeName = _tableTypeName;
      data.Value = new Sql2k8MergeEnumerable<T>(_data, Columns);

      // execute the command and read the IDENTITY and INSERT row count
      using (SqlDataReader reader = cmd.ExecuteReader()) {
         SyncSurrogateIds(reader);
      }
   }

   // get the sql command text to be executed
   private string GetCommandText() {
      StringBuilder update = new StringBuilder(0x100);
      StringBuilder insert = new StringBuilder(0x100);

      // build columns for INSERT and UPDATE
      foreach (var column in Columns.GetDataColumns()) {
         insert.AppendFormat("{0},", column.ColumnName);
         update.AppendFormat("trg.{0}=src.{0},", column.ColumnName);
      }
      // remove the trailing ","
      insert.Remove(insert.Length - 1, 1);
      update.Remove(update.Length - 1, 1);

      string id = Columns.GetIdColumn().ColumnName;
      string merge = Columns.GetMergeColumn().ColumnName;
      StringWriter sql = new StringWriter(new StringBuilder(0x400));

      // delete
      sql.WriteLine("DELETE trg FROM {0} trg WHERE EXISTS (", _tableName);
      sql.WriteLine("   SELECT * FROM @data src");
      sql.WriteLine("   WHERE src.{0} = trg.{0} AND src.{1} = 'D');", 
         id, merge);
      // update
      sql.WriteLine("UPDATE trg SET {0} FROM {1} trg", update, _tableName);
      sql.WriteLine("   JOIN @data src ON trg.{0} = src.{0} AND src.{1} = 'U';",
         id, merge);
      // insert
      sql.WriteLine("INSERT INTO {0} ({1}) SELECT {1} FROM @data", 
         _tableName, insert);
      sql.WriteLine("   WHERE {0} = 'I' ORDER BY {1};", merge, id);
      // return identity information
      sql.WriteLine("SELECT SCOPE_IDENTITY(), @@ROWCOUNT;");

      return sql.ToString();
   }

   private void SyncSurrogateIds(SqlDataReader reader) {
      int lastId;
      int count;

      // get identity information from reader
      reader.Read();
      lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      count = reader.GetInt32(1);
      // no new rows added
      if (count == 0)
         return;
      
      SqlMergeActionColumn<T> mergeCol = Columns.GetMergeColumn();
      SqlMergeIdColumn<T> idCol = Columns.GetIdColumn();
      // get client surrogate ids
      SortedList<int, T> idLookup = new SortedList<int, T>(count);
      foreach (var item in _data)
         if (mergeCol.GetMergeAction(item) == ObjectState.Added)
            idLookup.Add(idCol.GetIdData(item), item);

      // map client ids to server ids
      IList<int> clientIds = new List<int>(idLookup.Keys);
      foreach (var clientId in clientIds)
         idCol.SetIdData(idLookup[clientId], lastId - --count);
   }
}
The merge command class orchestrates all other classes and represents the consumer interface for this simple Bulk Data Persistence framework.

The constructor gets the name of the database table, the name of the table-valued type and the data to be persisted.

The properties Connection and Transaction are used to configure the destination database.

The Columns:SqlMergeColumnCollection<T> property enables the consumer to configure the mapping between a domain object and a database table.

The Execute() method performs the database operation. It creates an ADO.NET SqlCommand, specifies the dynamic T-SQL by calling the GetCommandText():string method, executes the command and utilizes the SyncSurrogateIds(SqlDataReader) method to synchronize the ids of added rows.

As I wrote in my previous post, probably the SQL Server 2008 MERGE statement performs better than classical separate INSERT/UPDATE/DELETE operations. I have to investigate this and will share my results in a separate post.

How to use the Framework
Okay. Now we've got a shiny new framework. How to use?

I recycle the test environment we've already seen in the previous post to persist TestMany objects.

Below a T-SQL DDL which specifies the destination table and the table-valued type DTO.
-- the database table
CREATE TABLE TestMany (
   Id int NOT NULL IDENTITY(1,1)
      PRIMARY KEY CLUSTERED
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
);

-- the table-valued type DTO
CREATE TYPE TestManyMergeType AS TABLE(
   Id int NOT NULL
   ,Action char(1) NOT NULL
      CHECK (Action IN ('D', 'U', 'I'))
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
   
   ,PRIMARY KEY CLUSTERED 
      (Id, Action)
);
And our domain object which maps to the database table.
public class TestMany
{
   public TestMany(int id, int someInt, DateTime someDate,
                   string someText, ObjectState changeState) {
      Id = id;
      SomeInt = someInt;
      SomeDate = someDate;
      SomeText = someText;
      ChangeState = changeState;
   }

   public int Id { get; set; }
   public int SomeInt { get; set; }
   public DateTime SomeDate { get; set; }
   public string SomeText { get; set; }
   public ObjectState ChangeState { get; set; }
}
Anyway, what do we need to consume our framework?

We need one class which inherits our SqlMergeCommand<T> class to specify the column mapping.
class TestManyMergeCommand : Sql2k8MergeCommand<TestMany>
{
   public TestManyMergeCommand(IEnumerable<TestMany> data)
      : base("TestMany", "TestManyMergeType", 
             data.Where(t => t.ChangeState != ObjectState.Unchanged)) {
      // map the identity column
      Columns.SetIdColumn("Id", (t) => t.Id, (t, id) => t.Id = id);
      // map the merge action column
      Columns.SetMergeColumn("Action", (t) => t.ChangeState);
      // map the data columns
      Columns.AddDataColumn("SomeInt", SqlDbType.Int, 
                            (t) => t.SomeInt);
      Columns.AddDataColumn("SomeDate", SqlDbType.DateTime, 
                            (t) => t.SomeDate);
      Columns.AddDataColumn("SomeText", SqlDbType.VarChar, 100, 
                            (t) => t.SomeText);
   }
}
The class calls the base constructor to specify the destination table and the name of the table-valued type. The body of the constructor does the mapping. Done.

(At this point I have to say "Thanks god for Lambda Expressions" to provide the callback methods. :-P )

Last but not least a sample how to use this class to persist data.
public void GenericMergeTest() {
   using (SqlConnection cn = GetConnection(true))
   using (SqlTransaction tran = cn.BeginTransaction()) {
      // get test data
      IList<TestMany> data = GetTestData(0);

      TestManyMergeCommand mergeCommand = new TestManyMergeCommand(data);
      mergeCommand.Connection = cn;
      mergeCommand.Transaction = tran;
      mergeCommand.Execute();

      tran.Commit();
   }
}
As you see, we just have to initialize our custom command with data to be persisted, specify the database information and execute the command.

Conclusion
As you see, the complete framework takes about 350 lines of code what sounds quiet fair to me.

The merge command class can be plugged in to any existing O/R-Mapper framework which provides a "Unit of Work" functionality and a method/event to catch when data become persisted. If you like to use Bulk Data Persistence you don't have to rewrite your existing persistence layer.

Note, I did not add any error handling to this sample. If you like to use this in your production environment you have to add this. (E.g. to avoid duplicate column names for the mapping.)

If you already use an O/R-Mapper, you should change the way how to map your properties to columns. In this case you don't need a custom command for each of your domain objects, but you need a way to map the meta data of your O/R-Mapper to a merge command.