Florian Reischl

Monday, December 21, 2009

A Strange Kind of a Relational EDI

Today I'll speak about a very different kind of data exchange. The following approach might sound strange for the first moment (it did to me!). However, after some days thinking about, I think this might be a new way which probably provides some powerful new solutions.

First as first. Let's start up with a general discussion about different types and directions of EDIs (Electronic Data Interchange).

Many of us work with EDIs to exchange data between different systems and/or locations. I think there are two main types of data interchange directions.

The first direction of EDIs are data-loader routines or processes which load data from 3rd level applications. Especially todays EAI (Enterprise Application Integration) applications are often related to many different other applications. This might be an order system which takes customer information from a CRM tool and some employee information from a HR-software. The important attribute here for this kind of data exchange is the communication with other applications which are not part of your own project.

The second kind of EDIs are synchronization routines or processes which are used to keep systems in sync. These processes are often used to exchange data between different servers (e.g. locations) of a distributed system. Another sample for sync routines is a offline support of a product catalog on a sales notebook. Last but not least synchronization processes are used for staging areas between operational and analysis systems like SSAS. The important attribute of this kind of data exchange in this post is the communication between systems of the same project or company.

As you see, I want to point the difference of communication between your own project modules and external systems.

Next we'll go on with a short listing of the most common available EDI formats and their strengths and weaknesses.

An EDI Format Inventory

Todays most common EDI formats seem to be CSV, XML. Sure there are several other - more specific - EDI formats like EDIFACT, BORD128, binary serialization and so forth. Though, for sake of brevity those formats are not scope of this post.

CSV

CSV files (separated by comma, tab, semicolon or whatever) are especially used to exchange huge amounts of data. The main strength of this format is its compactness. CSV files contain almost no overhead. The files are human readable with tools like Notepad (or any other text editor) or Excel. Especially data storage systems like SQL Server, many other RDBMSs or SSIS provide build in functionality to write and read CSV files.

On the other hand, CSV files have some weaknesses. First, the files are not relational, which means you cannot specify links between different kinds of data. Sure you can use distinct files for each object (or data) type and put everything together into an ZIP archive. Nevertheless, the files are not integrity checked. Another issue of CSV files is a missing search functionality. Microsofts OLEDB Jet-Provider enables simple SQL queries on CSV files but this is not a standard on other operational systems. Finally CSV files are not type safe. Your third column might be specified as an amount of goods, but you can never be sure that data in all rows are really stored in a decimal format and even if they are, you need to know the formatting information of the data. E.g. an amount of "1,234.56" in German format is "1.234,56". With date/time formats it becomes more funny ;-)

XML

Over the last some years XML became more and more the most common format for almost any kind of data exchange. Why? Here just a subset of reasons...

It supports references (links) between different objects. This could be direct parent-child relations or foreign-key like references to other XML-objects within or even without the current XML file.
With XPath it supports a full featured XML query language.
Almost all todays development kits provide build in XML support.
XML files can be validated via XSDs for data types, required and optional object properties and required relations.
It is human readable with any kind of text editor, web browsers and many IDEs.

A common approach to use XML for data exchange is object serialization and there are two generally different ways to handle XML serialization. In smaller projects it often makes sense to directly serialize the existing Domain Objects. In more complex domain models serialization is often done by creation of DTOs, since the domain model might be too complex and contain too much validation which might be not needed (or desired) while (de-)serialization. DTOs are a simplified and coarse grained simplified versions of fine grained domain objects.

Sure, XML can also be used without any build in serialization methods. Since this slightly eliminates the "easy to use" factor, this is usually only needed for very large amounts of data or very flexible requirements.

Independent of all the strength of XML, it also has some weaknesses which might be important for some requirements.

XML is a very verbose data format. Though, this can be handled by using any kind of compression (like ZIP, GZip and so on).

Another weakness of XML is XPath. Don't get me wrong, XPath is a powerful query language and I (almost) always get what I want. Nevertheless, most times original data storages are relational databases. When using XML you always have to either implement some kind of XPath based Mapper (when working with plain XML) or any object based way to find all de-serialized objects and their relations (when working with serialization).

A Relational EDI

Wouldn't it be cool to work with SQL when working with EDI, eh? In my opinion, yes! So, why don't do this?

Did you ever think about using embedded databases (like SQL Server Compact, SQLite or others)? Sounds strange to you? It did sound strange to me, too. However, who cares? After several days thinking about this approach, this appears to be a probably very new kind of data exchange.

When you work with an O/R-Mapper you might even get this feature as a gift. Even if you don't work with an O/R-Mapper, most usual SQL statements are supported by those new embedded database systems. When working with SQL Server Compact you might think about using Microsofts Sync Framework to load and update your data from your "database interchange".

Embedded databases support constraints primary keys, foreign keys, indexes (and some even more). Armed with these constraints, embedded databases provide a ensured relational integrity. You can also be sure that all your data are stored in correct format.

One potential issue of embedded databases is the file size. A new, empty SQL Server Compact database has a initial size of 20KB (for being empty :-D ). 20KB appear to be not too much for me, but what about file growing when data become loaded? After this initial size it grows less than XML files. I tried to create an XML and a SQL Server Compact database each containing all Customers (91 rows) and Orders (921 rows) from good old Northwind database. The XML file ended up with 437KB, the SQL Server Compact database ended up with 184KB (containing both primary keys and a foreign key from Orders to Customers). To be fair, I also tried to compress both resulting files with WinRAR. The XML file was compressed to 30KB, the database file was compressed to 31KB. Sure, "large" and "small" always depend on your environment, but in my opinion this compression appears to be okay.

Is missing human readability an issue of embedded databases? I'd say, yes ant no. You cannot open the files with a text editor. But you can use tools like SSMS (in case of SQL Server Compact) or any other database query tool. If you work with an O/R-Mapper or any other Data Access Layer which supports different database systems you can probably even use your tailor made business application to browse the data within the "interchange".

Nevertheless, the most important issue of using embedded databases as EDI is the fact that it is completely no standard today. For this reason this should only be used for in-project synchronization between different storages at the moment.

Conclusion

If network traffic is one of the issues you fight with when exchanging EDIs between systems, CSV is definitely the way to go.

If you need a standardized communication to any outside (of your project) module, XML is usually the way to go.

If you have to synchronize larger amounts of data between different parts of your own system, you might think about using embedded databases doing this job. At least, give it a try or some hours to think about ;-).

Friday, November 27, 2009

What's the Problem with List<T> Properties in BLL?

Today I'll speak about a practice for business logic layers (BLL).

The .NET List<T> is a nice and really handy class for in many destinations. However, this class is not the best solution for libraries like business logic layers (BLL) to provide related domain objects.

Above a very simple part of a fictive business logic layer (BLL).

class Order {
   public int Id { get; set; }
   // ...
}

class Customer {
   public int Id { get; set; }
   public List<Order> Orders { get; private set; }
   // ...
   internal void SetOrders(List<Order> orders) {
      this.Orders = orders;
   }
   // ...
}

Using the List<Order> in this solution can cause some problems.

Lacy Loading

You should always try to load all information needed for a specific business case. Though, there are reasons why this makes no sense. For instance if the don't know which related data are needed when you start your business transaction. Another reason might be a too huge count of related objects. Here is lazy loading a nice pattern.

In our fictive layer, there are cases when we need the orders of our customer, but there might be several other cases when we don't need the related orders. You can use a Builder to compose different customers for different business cases. However, sometimes the required relations to handle a business case are not clear when the business transaction starts.

Always to load too much information might become a performance problem. Lazy loading is a nice practice (when it is used correct) to ensure related information if they are needed. One way to handle lazy loading is AOP but sometimes you can end up in a ripple loading. For instance, if you have to handle thousand customers and their orders it makes more sense to load all (or a page of 100) orders in one database query. In this case you might prefer a special implementation which handles this loading.

Validation

You can derive from List<T> but it does not provide any way to validate the Order which become accessed/removed/added. Probably you want to validate newly added orders to ensure they are valid and/or not yet attached to any customer. What about deletion? It should not be possible to delete an order which is already billed. For these business cases, you need an observable list or a read-only list.

Presentation

This very depends on how flexible your BLL wants to be but think about your presentation layer. If you want to be able to bind your orders list to a data bindable control like the .NET DataGridView. The List<T> does not provide all required methods to enable add/remove handling for the grid.

Notice, I do not speak about any coupling between business and presentation layer but it might be neat if your BLL provides some binable objects which can be reused in win-, wpf- and/or web-applications.

Custom Behavior

As a last reason for now (there might be more) why List<T> should not be used here is customization of behavior. Probably your list needs to be sorted by any order information. Maybe you need an observer list which organizes this sorting, even if you change one or more of the contained orders.

Better Solution

A better approach to provide related 0...* objects is usually the IList<T> interface. This interface provides (almost) all methods which are available by using List<T> but it provides a very low coupled representation of the Order property. Feel free to use List<T> for common initialization within your BLL. If you ever need a special handling you are able to provide another list though.

class Customer {
   public int Id { get; set; }
   public IList<Order> Orders { get; private set; }
   // ...
   internal void SetOrders(IList<Order> orders) {
      this.Orders = orders;
   }
}

Another solution is to provide a completely custom list interface or base class (I always prefer interfaces). This also works fine and keeps you able to provide different lists with different behavior. Though, this has a larger startup cost for your project. IMO IList<T> fits fine for most consumers of a BLL.

Monday, November 16, 2009

Table-Valued Parameters - A Performance Comparison

As I wrote in some of my previous posts, table-valued parameters - which have been introduced with SQL Server 2008 - appear to be a great feature to enable client side bulk operations when working with SQL Server. They enable client data provider like ADO.NET to send a complete set of data to the server instead of calling serveral single row operations. Today we'll do some comparsions.

I'll start with a ~~short~~ (had to correct this...) description of the test environment. To use the best possible performance for both ways I'll use stored procedures.

The first test results focus the performance of single statements vs. bulk operations in a single client environemnt - means without any other concurrent activities. The second, more important tests cover a performance comparsion in a multi-client scenario. The last part will show a comparison for some larger bulk operations.

Test Environment

If you already read my previous posts you can skip this section. I try to keep this post uncoupled from my previous post since this covers another general topic (TVPs). So I have to repreat the test environment and the general way to use table-valued parameters from ADO.NET.

For my tests I'll used the following simple table to INSERT/UPDATE rows.

CREATE TABLE TestMany
(
   Id int IDENTITY(1,1) NOT NULL
      PRIMARY KEY CLUSTERED 
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL,
);

Whenever we intend to use table-valued parameters in SQL Server 2008, we need a table-valued type defined in SQL Server which can be used as a collection of Data Transfer Objects (DTOs) to transfer .NET native data to SQL Server.

CREATE TYPE TestManyMergeType AS TABLE(
   Id int NOT NULL
   ,Action char(1) NOT NULL
      CHECK (Action IN ('I', 'D', 'U'))
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
   
   ,PRIMARY KEY CLUSTERED 
      (Id, Action)
);

This type covers our destination table with some slight exceptions. Since we will use table-valued parameters to push all changed client data in one package to SQL Server, we need a column which tells SS2k8 how to handle a distinct row. Therefore, we'll use the Action column which specifies the current row to be Inserted, Updated or Deleted. We don't use an IDENTITY specification for our "Id" column since we need to be able to specify an destination "Id" to be handled. To insert new rows we can use negative, unique "Id" values which are required to map newly inserted rows on client-side to their new server ids. To get the best selectivity we use a primary key over the "Id" and "Action" columns.

To provide a stored procedure which gets a table-valued parameter, the parameter has to be specified as READONLY in SS2k8.

CREATE PROCEDURE SaveTestMany
   @data TestManyMergeType READONLY
AS
   SET NOCOUNT ON;
   
   -- update
   UPDATE trg SET 
         SomeInt = src.SomeInt
         ,SomeDate = src.SomeDate
         ,SomeText = src.SomeText
      FROM TestMany trg 
         JOIN @data src 
         ON trg.Id = src.Id AND src.Action = 'U';

   -- insert
   INSERT INTO TestMany
      SELECT SomeInt,SomeDate,SomeText 
      FROM @data 
      WHERE Action = 'I' 
      ORDER BY Id

   -- return last IDENTITY and count of added rows
   SELECT SCOPE_IDENTITY(), @@ROWCOUNT;
GO

The important most important part of this procedure is the ORDER BY clause in the INSERT statement. This ensures that we can use the information about the last generated identity value (by SCOPE_IDENTITY()) and the count of inserted rows (by @@ROWCOUNT) to map the provided surrogate ids to the generated server ids. Maybe you think this might be dangerous without a MAXDOP option or anything else and maybe you are right, I just can tell you that I tried different order conditions and all rows have always been inserted in an exact order. So feel free to trust this or not. If you don't trust it (what is a good thing) do your own tests (what is a good thing, too) or just feel free to leave this post (what would be a really bad thing).

To provide a table-valued parameter to a stored procedure from ADO.NET you have two possibilities. You can either provide your data by passing a System.Data.DataTable (which appears be too blown just to provide simple data records in my opinion) or a custom implementation of an IEnumerable<SqlDataRecord>. I'll use the enumerator in my tests since it is way faster and does not require to copy all domain objects into a data table.

class TestManyEnumerator : IEnumerable<SqlDataRecord>
{
   public TestManyEnumerator(IEnumerable<TestMany> data) {
      _data = data;
   }

   private IEnumerable<TestMany> _data;

   public IEnumerator<SqlDataRecord> GetEnumerator() {
      SqlMetaData[] metaData = {
         new SqlMetaData("Id", SqlDbType.Int),
         new SqlMetaData("MergeAction", SqlDbType.Char, 1),
         new SqlMetaData("SomeInt", SqlDbType.Int),
         new SqlMetaData("SomeDate", SqlDbType.DateTime),
         new SqlMetaData("SomeText", SqlDbType.VarChar, 100),
      };

      foreach (var item in _data) {
         SqlDataRecord record = new SqlDataRecord(metaData);

         record.SetValue(0, item.Id);
         record.SetValue(1, item.ChangeState);
         record.SetValue(2, item.SomeInt);
         record.SetValue(3, item.SomeDate);
         record.SetValue(4, item.SomeText);

         yield return record;
      }
   }

   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}

This enumerator handles domain objects from type TestMany which map exactly to our above specified table-valued T-SQL type. ChangeState specifies the state ('I', 'D', 'U') of a particular object.

To call our procedure with a set of data to be updated we have to specify a SqlParameter with SqlDbType.Structured, the name of the destination table-valued type as TypeName and provide the previous enumerator as Value.

using (SqlConnection cn = GetConnection(true))
using (SqlTransaction tran = cn.BeginTransaction()) {
   // get test data
   IList<TestMany> data = GetTestData();

   // create a new command as stored procedure
   SqlCommand cmd = new SqlCommand("MergeTestMany", cn);
   cmd.CommandType = CommandType.StoredProcedure;
   cmd.Transaction = tran;
   // create a new parameter ("Structured" defines a TVP)
   SqlParameter p = cmd.Parameters.Add("@data", SqlDbType.Structured);
   // specify the name of the TVP
   p.TypeName = "TestManyMergeType";
   p.Value = new TestManyEnumerator(data);

   // catch the returned IDENTITY information
   using (SqlDataReader reader = cmd.ExecuteReader()) {
      reader.Read();
      int lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      int count = reader.GetInt32(1);
      // synchronize all newly added keys
      SyncAddedIds(data, count, lastId);
   }

   tran.Commit();
}

For a more detailed explanatin and the implementation of GetTestData():IList<TestMany> and GetConnection(open):SqlConnection please have a look here. For a generic approach, have a look at A Generic BDP Approach for SQL Server 2008.

The procedures to insert single rows into our TestMany table are just simple as they can be.

CREATE PROCEDURE [dbo].[InsertOneTestMany]
   @id INT OUTPUT
   ,@someInt INT
   ,@someDate DATETIME
   ,@someText VARCHAR(100)
AS
   SET NOCOUNT ON;
   
   INSERT INTO TestMany (SomeInt, SomeDate, SomeText)
      SELECT @someInt, @someDate, @someText;
   
   SELECT @id = SCOPE_IDENTITY();
GO

CREATE PROCEDURE [dbo].[UpdateOneTestMany]
   @id INT
   ,@someInt INT
   ,@someDate DATETIME
   ,@someText VARCHAR(100)
AS
   SET NOCOUNT ON;
   
   UPDATE TestMany SET 
         SomeInt = @someInt
         ,SomeDate = @someDate
         ,SomeText = @someText
      WHERE Id = @id;
GO

I think there is nothing which needs to be explained.

We're done with our test enviroment.

Not part of this Post

Calling a single statement procedure from .NET straight forward and part of many tutorials. Therefore I kept this source code away from here. I also don't show how to use multi-threading to call the above procedures in a concurrency scenario.

I kept DELETE operations out of the party since they are often done by special purge processes instead of usual business applications.

Since I don't have a home network, I did my tests with a SQL Server running on a VM on my local host. Therefore I cannot show you any comparison differences depending on network latency.

Single Client Tests

Let's compare the performance of single INSERT/UPDATE statements vs. bulk operations when using a single client connection. To get (almost) linear test results I executed each test configuration 1,000 times. One important thing is that I always called the specified count of INSERT and UPDATE operations in one round trip. So, "Count Insert/Update" with value "1" calls one INSERT and one UPDATE operation, count "50" affects hundred rows. I did my tests with a destination table with just 100,000 rows and witha a table with 10,000,000 rows. Both table have been recreated after/before each test configuration and flushed by calling "DBCC DROPCLEANBUFFERS() WITH NO_INFOMSGS;". To ensure as many index read operations as possible I arranged all operations over the complete complete id range of the table.

Duration on a destination table with 100,000 rows after 1,000 executions.

Insert/Update	Duration
Count	Single Statements	Bulk Operation
1	1,299 ms	3,285 ms
3	2,319 ms	2,995 ms
5	3,201 ms	3,007 ms
10	5,553 ms	3,121 ms
20	11,923 ms	3,449 ms
50	27,644 ms	4,769 ms

Duration on a destination table with 10,000,000 rows after 1,000 executions.

Insert/Update	Duration
Count	Single Statements	Bulk Operation
1	2,631 ms	4,076 ms
3	3,614 ms	4,200 ms
5	4,608 ms	4,470 ms
10	7,541 ms	4,381 ms
20	12,308 ms	4,846 ms
50	26,173 ms	6,336 ms

As you can see, there is a point at about five INSERT/UPDATE operations (means 10 rows!) in one single transaction. With less than ten operations for the same table single statements are faster than bulk operations with table-valued parameters. If we hit this count of rows to be changed, table-valued parameters become way faster than classical single row operations.

A work load between one and fifty INSERT and UPDATE tasks shows that single row operations take about 10 times longer (what's a good thing), the duration of bulk operations increase only about 1.5 times (what appears to be a better thing ;-) ).

Multi-Client Tests

As initially written. SS2k8 is a server and no geeky feature has any worth if it works only in a single-client scenario. Therefore, now we'll have a look at some multi-client tests.

Since concurrent executions take more time I ran the following tests only 100 times. I also changed the count of rows within the destination table to 500,000 what was a convenient value to arrange the threads over the whole table - to get the highest count of reads).

The following results show the overall duration of 100 executions.

Insert/Update	Duration
Count	Single Statements	Bulk Operation
1 Thread
1	159 ms	298 ms
3	237 ms	358 ms
5	344 ms	371 ms
10	562 ms	391 ms
20	1,040 ms	427 ms
50	2457 ms	558 ms
2 Threads
1	387 ms	348 ms
3	679 ms	368 ms
5	854 ms	438 ms
10	1,597 ms	421 ms
20	3,513 ms	575 ms
50	6,839 ms	830 ms
5 Threads
1	1,466 ms	936 ms
3	2,528 ms	943 ms
5	3,542 ms	1,007 ms
10	5,837 ms	1,097 ms
20	10,995 ms	1,293 ms
50	26,092 ms	2,251 ms
10 Threads
1	4,820 ms	1,604 ms
3	13,290 ms	1,936 ms
5	12,624 ms	1,838 ms
10	28,829 ms	2,012 ms
20	45,414 ms	2,789 ms
50	97,498 ms	4,610 ms

Again, remember "1" INSERT/UPDATE means two operations within one transaction. Since single row operations usually scale almost linear they are still the faster solution for single operations. However, as we see bulk operations using with table-valued parameters perform really fine in a multi-threaded environment.

Some Larger Bulks

Okay, we speak about bulk operations so finally, let's do some real bulk operations.

The following results show the complete duration of 10 executions.

Insert/Update	Duration
Count	Single Statements	Bulk Operation
1,000	4,845 ms	370 ms
5,000	23,190 ms	2,033 ms
10,000	46,539 ms	4,854 ms
20,000	95,086 ms	7,209 ms

In my opinion 400,000 changed rows (remember we did 20,000 INSERT and UPDATE operations and we executed 10 times!) in 7 seconds appear to be a really great performance. ;-)

Conclusion

Apparently table-valued parameters might be a great feature for systems which work with bulk data manipulation. This might be some web or rich client applications but especially backend processes like data loader or synchronization processes.

What does this mean? Probably nothing for a general statement. If you think bulk operations with table-valued parameters might be a nice feature for your project. Always do your own tests! We did no tests with a real network and we did not test DELETE operations.

As a general advice, always try to stay innovative but validate everything. My intention was not to show you that you have to change all your running code, but to think about some new features which might be helpful.

Saturday, November 14, 2009

A Generic BDP Approach for SQL Server 2008

As I promised in my previous post (SQL Server and Bulk Data Persistence (BDP), today I'll show a possible generic approach of how to integrate Bulk Data Persistence without writing a new enumerator for each domain object. In addition we will encapsulate the ID synchronization into a small framework.

If you did not yet read the previous posts, please read it before you goon here. I see you as a person who knows the idea of BDP and how to handle it with SQL Server 2008.

Our Task List

Let's reflect the steps of the previous post to figure out all tasks which have to be done:

We have to provide SqlMetaData objects to map our domain object properties
We have to provide an enumerator through the objects and create a SqlDataRecord
We have to call some T-SQL to do the persistence
We have to map all the new server ids to our domain objects

Sounds really complicated? It's not as hard as it sounds. It's more complicated than a command or simple strategy pattern but not much more than a usual Unit of Work pattern.

ObjectState Enumeration

Before we start with our tasks, we have to change a dirty non-OO cheat we previously used. We change the "pseudo enum" to represent the state of a domain object to a real enum ObjectState.

public enum ObjectState
{
   Added,
   Deleted,
   Changed,
   Unchanged,
}

Okay, back to our task list...

Column Mapping

The next thing we need is a general class to map the properties of a domain object into a SqlMetaData object. Above we can see a possible SqlMergeColumn<T> class which represents all information which are generally needed to create the meta data objects.

class SqlMergeColumn<T>
{
   internal SqlMergeColumn(string columnName, SqlDbType sqlType,
                           Func<T, object> getData) {
      ColumnName = columnName;
      SqlType = sqlType;
      _getData = getData;
   }

   // callback to get the value from a current item
   Func<T, object> _getData;
   
   public string ColumnName { get; private set; }
   public SqlDbType SqlType { get; private set; }
   public int MaxLength { get; internal set; }
   public byte Precision { get; internal set; }
   public byte Scale { get; internal set; }

   public virtual object GetData(T item) {
      // get the field value or DBNull
      return _getData(item) ?? DBNull.Value;
   }

   public SqlMetaData GetMetaData() {
      // get meta data for this property mapping
      if (MaxLength != 0)
         return new SqlMetaData(ColumnName, SqlType, MaxLength);
      if (Precision != 0)
         return new SqlMetaData(ColumnName, SqlType, Precision, Scale);
      return new SqlMetaData(ColumnName, SqlType);
   }
}

As you see, the larger part of the class is a simple representation of properties which are needed to create a SqlMetaData object. The MaxLength property is needed for data types with variable length like text or binary data-types. The Precision and Scale properties are required for some numeric types like decimal.

The GetData(T):object method gets an object of generic type of the class. (I always try to avoid reflection if possible, since it is quiet slow and always hard do maintain.) The GetData(T):object returns a value of the specified item by utilizing a delegate Func<T,TReturn> which was provided by the constructor of the class.

The last method GetMetaData():SqlMetaData is a typical strategy method which returns the meta data for a SqlDataRecord enumerator depending on the current configuration of an instance of the class.

We have to handle two special column types. First, we need the identity column to be able to join data for UPDATE and DELETE and synchronize the server identity values with our newly added objects. Second, we need column which can tell us the type of the BDP action (add, change, delete) to be executed. To stay OO geeky we use two additional classes which inherit from SqlMergeColumn<T>.

// represents an identity merge column
sealed class SqlMergeIdColumn<T> : SqlMergeColumn<T>
{
   public SqlMergeIdColumn(string columnName, Func<T, int> getIdData, 
                           Action<T, int> setIdData)
      : base(columnName, SqlDbType.Int, (t) => getIdData(t)) {
      _setIdData = setIdData;
   }

   // private fields
   Action<T, int> _setIdData;

   public int GetIdData(T item) {
      return (int)base.GetData(item);
   }

   public void SetIdData(T item, int id) {
      _setIdData(item, id);
   }
}

// represents a merge action column
sealed class SqlMergeActionColumn<T> : SqlMergeColumn<T>
{
   public SqlMergeActionColumn(string columnName, 
                               Func<T, ObjectState> getMergeAction)
      : base(columnName, SqlDbType.Char, (t) => getMergeAction(t)) {
      MaxLength = 1;
   }

   public override object GetData(T item) {
      switch (GetMergeAction(item)) {
         case ObjectState.Added: return 'I';
         case ObjectState.Changed: return 'U';
         case ObjectState.Deleted: return 'D';
         default: return 'X';
      }
   }

   public ObjectState GetMergeAction(T item) {
      return (ObjectState)base.GetData(item);
   }
}

The SqlMergeIdColumn<T> class provides two additional methods GetIdValue(T):int to return an System.Int32 id value and SetIdValue(T,int) to set a new id value after our database operation.

The SqlMergeActionColumn<T> overwrites the GetData(T):object method to transform the ObjectState enum into a System.Char for our data mapping.

GetMergeAction(T):ObjectState returns the enum value of a specified domain object.

Now, since we have column classes to specify a mapping to a table-valued parameter, we need a collection to store those mappings. I use a System.Collections.ObjectModel.Collection<T> inherited collection in this sample.

// represents a collection of SqlMergeColumn<T> objects
sealed class SqlMergeColumnCollection<T>
   : System.Collections.ObjectModel.Collection<SqlMergeColumn<T>>
{
   // add a data column
   public void AddDataColumn(string columnName, SqlDbType sqlType,
                             Func<T, object> getData) {
      this.AddDataColumn(columnName, sqlType, 0, getData);
   }

   // add a data column
   public void AddDataColumn(string columnName, SqlDbType sqlType,
                             int maxLength, Func<T, object> getData) {
      SqlMergeColumn<T> mapping = new SqlMergeColumn<T>(
         columnName, sqlType, getData);
      mapping.MaxLength = maxLength;
      base.Add(mapping);
   }

   // add a data column
   public void AddDataColumn(string columnName, SqlDbType sqlType,
                             byte precision, byte scale, 
                             Func<T, object> getData) {
      SqlMergeColumn<T> mapping = 
         new SqlMergeColumn<T>(columnName, sqlType, getData);
      mapping.Precision = precision;
      mapping.Scale = scale;
      base.Add(mapping);
   }

   // set the id column
   public void SetIdColumn(string columnName, 
                           Func<T, int> getIdData,
                           Action<T, int> setIdData) {
      base.Add(new SqlMergeIdColumn<T>(columnName, getIdData, setIdData));
   }

   // set the merge action column
   public void SetMergeColumn(string columnName, 
                              Func<T, ObjectState> getMergeAction) {
      base.Add(new SqlMergeActionColumn<T>(columnName, getMergeAction));
   }

   // get the id column
   internal SqlMergeIdColumn<T> GetIdColumn() {
      return this.OfType<SqlMergeIdColumn<T>>().First();
   }

   // get the merge column
   internal SqlMergeActionColumn<T> GetMergeColumn() {
      return this.OfType<SqlMergeActionColumn<T>>().First();
   }

   // get only the data columns
   internal IEnumerable<SqlMergeColumn<T>> GetDataColumns() {
      return this.Where(c => c.GetType() == typeof(SqlMergeColumn<T>));
   }
}

We've got three different implementations of AddDataColumn to add usual data column mappings. In addition we've got two methods (SetIdColumns and SetMergeColumn) to specify the special id and merge-action columns. At last we have three methods (GetIdColumn, GetMergeColumn and GetDataColumns) to access the different column types. I marked those methods as internal since they are only needed within the BDP framework and usually not important for a consumer of the framework.

Note, this framework is just a sample implementation. To keep it simple, I didn't add the possibility to specify the order of all columns. This means, currently all columns have to be specified in same order as defined in the table-valued type. Feel free to add an order parameter for all methods if you need this.

Data Enumerator

As we know, we have to provide a IEnumerable<SqlDataRecord> as value of a .NET SqlParameter which represents a table-valued parameter. The following listing shows a generic implementation of this class.

sealed class Sql2k8MergeEnumerable<T> : IEnumerable<SqlDataRecord>
{
   // generic class to handle field access by lambda expression
   // create a new enumerator
   internal Sql2k8MergeEnumerable(IEnumerable<T> data, 
                                  SqlMergeColumnCollection<T> mappings) {
      _data = data;
      _mappings = mappings;
   }

   // private fields
   IEnumerable<T> _data;
   SqlMergeColumnCollection<T> _mappings;

   // the enumerator which will be called from ADO.NET for a SqlParameter value
   public IEnumerator<SqlDataRecord> GetEnumerator() {
      // get all SqlMetaData for specified column mappings
      SqlMetaData[] metaData =
        _mappings.Select(mapping => mapping.GetMetaData()).ToArray();

      foreach (var item in _data) {
         // create a new record
         SqlDataRecord record = new SqlDataRecord(metaData);
         // fill the record with its values
         for (int i = 0; i < _mappings.Count; i++)
            record.SetValue(i, _mappings[i].GetData(item));
         // return the current record
         yield return record;
      }
   }

   IEnumerator IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}

The class gets the data to iterate through and an instance of our above implemented merge column collection.

The GetEnumerator():IEnumerator<SqlDataRecord> method utilizes the column mappings collection to create the meta data objects we need for a data record. After this it iterates through all data and returns the data records. To fill the records it uses the GetData method we already defined on our column classes.

Merge Command

A quick look back to our task list shows to missing tasks. We need something to generate the persisting T-SQL and we have to sync the client surrogate ids to the server side generated identity values. (I added both functionalities into one class, feel free to change this for a lower coupling.)

class Sql2k8MergeCommand<T>
{
   public Sql2k8MergeCommand(string tableName, string tableTypeName, 
                             IEnumerable<T> data) {
      _tableName = tableName;
      _tableTypeName = tableTypeName;
      _data = data;
      Columns = new SqlMergeColumnCollection<T>();
   }

   // private fields
   string _tableName;
   string _tableTypeName;
   IEnumerable<T> _data;

   public SqlMergeColumnCollection<T> Columns { get; private set; }
   public SqlConnection Connection { get; set; }
   public SqlTransaction Transaction { get; set; }

   // execute the merge command
   public void Execute() {
      // create an SqlCommand
      SqlCommand cmd = new SqlCommand();
      cmd.Connection = Connection;
      cmd.Transaction = Transaction;
      // get the dynamic SQL
      cmd.CommandText = GetCommandText();
      // create a parameter which gets all data as table-valued parameter
      SqlParameter data = cmd.Parameters.Add("@data", SqlDbType.Structured);
      data.TypeName = _tableTypeName;
      data.Value = new Sql2k8MergeEnumerable<T>(_data, Columns);

      // execute the command and read the IDENTITY and INSERT row count
      using (SqlDataReader reader = cmd.ExecuteReader()) {
         SyncSurrogateIds(reader);
      }
   }

   // get the sql command text to be executed
   private string GetCommandText() {
      StringBuilder update = new StringBuilder(0x100);
      StringBuilder insert = new StringBuilder(0x100);

      // build columns for INSERT and UPDATE
      foreach (var column in Columns.GetDataColumns()) {
         insert.AppendFormat("{0},", column.ColumnName);
         update.AppendFormat("trg.{0}=src.{0},", column.ColumnName);
      }
      // remove the trailing ","
      insert.Remove(insert.Length - 1, 1);
      update.Remove(update.Length - 1, 1);

      string id = Columns.GetIdColumn().ColumnName;
      string merge = Columns.GetMergeColumn().ColumnName;
      StringWriter sql = new StringWriter(new StringBuilder(0x400));

      // delete
      sql.WriteLine("DELETE trg FROM {0} trg WHERE EXISTS (", _tableName);
      sql.WriteLine("   SELECT * FROM @data src");
      sql.WriteLine("   WHERE src.{0} = trg.{0} AND src.{1} = 'D');", 
         id, merge);
      // update
      sql.WriteLine("UPDATE trg SET {0} FROM {1} trg", update, _tableName);
      sql.WriteLine("   JOIN @data src ON trg.{0} = src.{0} AND src.{1} = 'U';",
         id, merge);
      // insert
      sql.WriteLine("INSERT INTO {0} ({1}) SELECT {1} FROM @data", 
         _tableName, insert);
      sql.WriteLine("   WHERE {0} = 'I' ORDER BY {1};", merge, id);
      // return identity information
      sql.WriteLine("SELECT SCOPE_IDENTITY(), @@ROWCOUNT;");

      return sql.ToString();
   }

   private void SyncSurrogateIds(SqlDataReader reader) {
      int lastId;
      int count;

      // get identity information from reader
      reader.Read();
      lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      count = reader.GetInt32(1);
      // no new rows added
      if (count == 0)
         return;
      
      SqlMergeActionColumn<T> mergeCol = Columns.GetMergeColumn();
      SqlMergeIdColumn<T> idCol = Columns.GetIdColumn();
      // get client surrogate ids
      SortedList<int, T> idLookup = new SortedList<int, T>(count);
      foreach (var item in _data)
         if (mergeCol.GetMergeAction(item) == ObjectState.Added)
            idLookup.Add(idCol.GetIdData(item), item);

      // map client ids to server ids
      IList<int> clientIds = new List<int>(idLookup.Keys);
      foreach (var clientId in clientIds)
         idCol.SetIdData(idLookup[clientId], lastId - --count);
   }
}

The merge command class orchestrates all other classes and represents the consumer interface for this simple Bulk Data Persistence framework.

The constructor gets the name of the database table, the name of the table-valued type and the data to be persisted.

The properties Connection and Transaction are used to configure the destination database.

The Columns:SqlMergeColumnCollection<T> property enables the consumer to configure the mapping between a domain object and a database table.

The Execute() method performs the database operation. It creates an ADO.NET SqlCommand, specifies the dynamic T-SQL by calling the GetCommandText():string method, executes the command and utilizes the SyncSurrogateIds(SqlDataReader) method to synchronize the ids of added rows.

As I wrote in my previous post, probably the SQL Server 2008 MERGE statement performs better than classical separate INSERT/UPDATE/DELETE operations. I have to investigate this and will share my results in a separate post.

How to use the Framework

Okay. Now we've got a shiny new framework. How to use?

I recycle the test environment we've already seen in the previous post to persist TestMany objects.

Below a T-SQL DDL which specifies the destination table and the table-valued type DTO.

-- the database table
CREATE TABLE TestMany (
   Id int NOT NULL IDENTITY(1,1)
      PRIMARY KEY CLUSTERED
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
);

-- the table-valued type DTO
CREATE TYPE TestManyMergeType AS TABLE(
   Id int NOT NULL
   ,Action char(1) NOT NULL
      CHECK (Action IN ('D', 'U', 'I'))
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
   
   ,PRIMARY KEY CLUSTERED 
      (Id, Action)
);

And our domain object which maps to the database table.

public class TestMany
{
   public TestMany(int id, int someInt, DateTime someDate,
                   string someText, ObjectState changeState) {
      Id = id;
      SomeInt = someInt;
      SomeDate = someDate;
      SomeText = someText;
      ChangeState = changeState;
   }

   public int Id { get; set; }
   public int SomeInt { get; set; }
   public DateTime SomeDate { get; set; }
   public string SomeText { get; set; }
   public ObjectState ChangeState { get; set; }
}

Anyway, what do we need to consume our framework?

We need one class which inherits our SqlMergeCommand<T> class to specify the column mapping.

class TestManyMergeCommand : Sql2k8MergeCommand<TestMany>
{
   public TestManyMergeCommand(IEnumerable<TestMany> data)
      : base("TestMany", "TestManyMergeType", 
             data.Where(t => t.ChangeState != ObjectState.Unchanged)) {
      // map the identity column
      Columns.SetIdColumn("Id", (t) => t.Id, (t, id) => t.Id = id);
      // map the merge action column
      Columns.SetMergeColumn("Action", (t) => t.ChangeState);
      // map the data columns
      Columns.AddDataColumn("SomeInt", SqlDbType.Int, 
                            (t) => t.SomeInt);
      Columns.AddDataColumn("SomeDate", SqlDbType.DateTime, 
                            (t) => t.SomeDate);
      Columns.AddDataColumn("SomeText", SqlDbType.VarChar, 100, 
                            (t) => t.SomeText);
   }
}

The class calls the base constructor to specify the destination table and the name of the table-valued type. The body of the constructor does the mapping. Done.

(At this point I have to say "Thanks god for Lambda Expressions" to provide the callback methods. :-P )

Last but not least a sample how to use this class to persist data.

public void GenericMergeTest() {
   using (SqlConnection cn = GetConnection(true))
   using (SqlTransaction tran = cn.BeginTransaction()) {
      // get test data
      IList<TestMany> data = GetTestData(0);

      TestManyMergeCommand mergeCommand = new TestManyMergeCommand(data);
      mergeCommand.Connection = cn;
      mergeCommand.Transaction = tran;
      mergeCommand.Execute();

      tran.Commit();
   }
}

As you see, we just have to initialize our custom command with data to be persisted, specify the database information and execute the command.

Conclusion

As you see, the complete framework takes about 350 lines of code what sounds quiet fair to me.

The merge command class can be plugged in to any existing O/R-Mapper framework which provides a "Unit of Work" functionality and a method/event to catch when data become persisted. If you like to use Bulk Data Persistence you don't have to rewrite your existing persistence layer.

Note, I did not add any error handling to this sample. If you like to use this in your production environment you have to add this. (E.g. to avoid duplicate column names for the mapping.)

If you already use an O/R-Mapper, you should change the way how to map your properties to columns. In this case you don't need a custom command for each of your domain objects, but you need a way to map the meta data of your O/R-Mapper to a merge command.

Tuesday, November 10, 2009

SQL Server and Bulk Data Persistence

In my last post I presented Bulk Data Persistence (BDP) as a general concept to send bulks of data data from a client to a database server.

Today I'll show you ways how to use BDP with SQL Server.

Sample Environment

To avoid duplicate source code I'll start with a short description of the environment we will work with.

The database actually contains one table "TestMany" which is described by the following listing and contains about 1,000,000 rows. The following DDL statement describes the structure of the "TestMany" table.

CREATE TABLE dbo.TestMany (
   Id int IDENTITY(1,1) NOT NULL
      PRIMARY KEY CLUSTERED
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL,
);

We use a domain objects from type TestMany which represents a 1:1 mapping to our database table with an additional property to represent the state of the object.

public class TestMany
{
   public TestMany(int id, int someInt, DateTime someDate, 
                   string someText, char changeState) {
      Id = id;
      SomeInt = someInt;
      SomeDate = someDate;
      SomeText = someText;
      ChangeState = changeState;
   }

   public int Id { get; set; }
   public int SomeInt { get; set; }
   public DateTime SomeDate { get; set; }
   public string SomeText { get; set; }
   public char ChangeState { get; set; }
}

Most likely you noticed the type char of the ChangeState property. Usually the change state of an objects should be represented as enum, I choose this just to keep the following samples simple. To maintain the value of ChangeState I'll use the following pseudo enum.

public static class ObjectState
{
   public const char Added = 'I';
   public const char Deleted = 'D';
   public const char Changed = 'U';
   public const char Unchanged = 'X';
}

Since we'll send a complete package of data to the database we have to tell it how to handle specific rows. Therefore I'll use the simple characters "I" (for INSERT), "U" (for UPDATE) and "D" (for DELETE). The Unchanged is not really needed in our samples, but added for completeness.

To avoid writing a connection string always and always again I use a simple helper method which gets a connection string from app.config and opens the connection automatically.

private SqlConnection GetConnection(bool open) {
   SqlConnection cn;
   cn = new SqlConnection(Properties.Settings.Default.SandboxOnSql2k801);
   if (open)
      cn.Open();
   return cn;
}

To create TestMany objects to be persisted we'll use the following GetTestData() method.

private IList<testmany> GetTestData() {
   List<testmany> data = new List<testmany>();

   // to update
   for (int i = 1; i <= 1000; i++)
      data.Add(new TestMany(
         i, 
         i, 
         DateTime.Now.AddSeconds(i),
         string.Format("Updated {0}", i),
         ObjectState.Changed));

   // to delete
   for (int i = 20000; i <= 21000; i++)
      data.Add(new TestMany(
         i, 
         0, 
         DateTime.Now,
         null,
         ObjectState.Deleted));

   // to insert
   for (int i = -1; i >= -1000; i--)
      data.Add(new TestMany(
         i, 
         i, 
         DateTime.Now.AddSeconds(i),
         string.Format("Added {0}", i),
         ObjectState.Added));

   return data;
}

Important: To get BDP work we need a unique surrogate id for new objects which will be send to the database and mapped to the new generated Ids in the client. I work with negative integer numbers in the previous method.

Since we will persist a complete bulk of objects in one database round trip we need a slightly different way to get the server side generated IDENTITY values for new inserted objects. When persisting single objects we can use the SCOPE_IDENTITY() (or @@IDENTITY) function to get the id of the last inserted row. With BDP we send all new rows in one operation, so we have to synchronize all new ids after this operation. The following method gets all data which have been handled in a BDP operation, the count of inserted rows (returned by the database) and the id of the last inserted row (returned by the database). Later in this post, we'll see how we get those information and how to keep them valid. For the moment I just want to show you how to synchronize the client ids with the specified last id and the count of added objects. Probably it makes sense to jump over this method for the moment and come back when we did the first BDP operations.

private void SyncAddedIds(IEnumerable<TestMany> data, int count, int lastId) {

   // get surrogate client ids for new added items
   SortedList<int, TestMany> idLookup = new SortedList<int, TestMany>(count);
   foreach (var item in data)
      if (item.ChangeState == ObjectState.Added)
         idLookup.Add(item.Id, item);

   // map client ids to server ids
   IList<int> clientIds = new List<int>(idLookup.Keys);
   foreach (var clientId in clientIds)
      idLookup[clientId].Id = lastId - --count;
}

SQL Server 2008 Configuration

To use BDP in SQL Server 2008, we need to define table-valued types for the data to be persisted in bulk. We will use these types as simple Data Transfer Objects (DTOs) to transfer structural data from the client to the database.

Notice: We are not needed to use stored procedures for BDP. We will see a sample using a SP but we will also see another sample which work without any procedure.

The table-valued type DTO for our TestMany class and table looks just like our class, containing a column Action which represents the kind of operation to be executed with each row.

CREATE TYPE TestManyMergeType AS TABLE (
   Id int NOT NULL
   ,Action char(1) NOT NULL
      CHECK (Action IN ('D', 'U', 'I'))
   ,SomeInt int NULL
   ,SomeDate datetime NULL
   ,SomeText varchar(100) NULL
   ,PRIMARY KEY CLUSTERED (Id, Action)
);

The PRIMARY KEY on the Id and Action columns is important to ensure integrity and optimizes the lookup performance when we merge data. I also defined a CHECK constraint for the "Action" column for my personal paranoia :-) - to avoid wrong data.

As a little hint: Keep all lookup columns (Id and Action in this case) at the beginning of the table. Try to avoid them at the end, especially after any columns with variable length (as SomeText). This could be another paranoia statement in case of our DTO, but is important with real tables.

With this type our database is ready to handle BDP operations. Let's start with some samples.

SQL Server 2008 and Stored Procedures

We start with the stored procedure sample because it separates the two parts of the party and makes it easier to discuss each of them separately.

Let's start with the stored procedure definition.

ALTER PROCEDURE MergeTestMany
   @data TestManyMergeType READONLY
AS
   -- delete
   DELETE trg 
      FROM TestMany trg 
      WHERE EXISTS (
         SELECT * 
         FROM @data src 
         WHERE src.Id = trg.Id AND src.Action = 'D'
         );

   -- update
   UPDATE trg SET 
         SomeInt = src.SomeInt
         ,SomeDate = src.SomeDate
         ,SomeText = src.SomeText
      FROM TestMany trg 
         JOIN @data src 
         ON trg.Id = src.Id AND src.Action = 'U';

   -- insert
   INSERT INTO TestMany
      SELECT SomeInt,SomeDate,SomeText 
      FROM @data 
      WHERE Action = 'I' 
      ORDER BY Id -- ## IMPORTANT to match ids in client

   -- return last IDENTITY and count of added rows
   SELECT SCOPE_IDENTITY(), @@ROWCOUNT;

The procedure gets one table-valued parameter with the melodic name @data (feel free to change this to a more specific name). Table-valued parameters are a new and very powerful feature of SQL Server 2008 to send a complete set of data to the server. The procedure body represents four separate parts.

As first we use a DELETE statement to delete all provided data which are marked with a "D"-action and matched by a specified Id. (At this point I have to say thank's to Paul White - a real SQL guru - from SQLServerCentral.com for his explanation of the left semi join when using EXISTS here.)

The second part of the procedure uses a joined UPDATE to update all existing data with their new values which are marked a "U"-action.

The third part simply inserts all provided DTOs which are marked with "I". Spot the ORDER BY for the source data. This ordering is very important to ensure the rows to be inserted in the order of the specified surrogate ids from client to be able to calculate the generated server IDENTITY values in the client application.

Last but not least we return the last generated IDENTITY value and the count of rows which have been newly inserted into our table. Feel free to scroll up at to the SyncAddedIds method which should be more comprehensible now since this is the most tricky and most important part to get a well working bulk insert operation.

Let's move to the client side implementation.

To provide a table-valued parameter from ADO.NET to SQL Server 2008, there are two different ways. You can provide a DataTable or an implementation of an IEnumerable<Microsoft.SqlServer.Server.SqlDataRecord> which returns all data to be transfered to the parameter. Personally I prefer the enumerator since a DataTable always represents a not needed overhead and requires relatively much memory.

The following listing shows a quiet straight forward implementation of this enumerable for our TestMany domain object.

class TestManyEnumerator : IEnumerable<SqlDataRecord>
{
   public TestManyEnumerator(IEnumerable<TestMany> data) {
      _data = data;
   }

   private IEnumerable<TestMany> _data;

   public IEnumerator<SqlDataRecord> GetEnumerator() {
      SqlMetaData[] metaData = {
         new SqlMetaData("Id", SqlDbType.Int),
         new SqlMetaData("MergeAction", SqlDbType.Char, 1),
         new SqlMetaData("SomeInt", SqlDbType.Int),
         new SqlMetaData("SomeDate", SqlDbType.DateTime),
         new SqlMetaData("SomeText", SqlDbType.VarChar, 100),
      };

      foreach (var item in _data) {
         SqlDataRecord record = new SqlDataRecord(metaData);

         record.SetValue(0, item.Id);
         record.SetValue(1, item.ChangeState);
         record.SetValue(2, item.SomeInt);
         record.SetValue(3, item.SomeDate);
         record.SetValue(4, item.SomeText);

         yield return record;
      }
   }

   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}

As you see, the GetEnumerator() first creates an array of SqlMetaData which is required for the constructor of a SqlDataRecord objects. After this we have to set the values of the current TestMany object and return it.

The next sample shows how to call our stored procedure from ADO.NET.

using (SqlConnection cn = GetConnection(true))
using (SqlTransaction tran = cn.BeginTransaction()) {
   // get test data
   IList<TestMany> data = GetTestData();

   // create a new command as stored procedure
   SqlCommand cmd = new SqlCommand("MergeTestMany", cn);
   cmd.CommandType = CommandType.StoredProcedure;
   cmd.Transaction = tran;
   // create a new parameter ("Structured" defines a TVP)
   SqlParameter p = cmd.Parameters.Add("@data", SqlDbType.Structured);
   // specify the name of the TVP
   p.TypeName = "TestManyMergeType";
   p.Value = new TestManyEnumerator(data);

   // catch the returned IDENTITY information
   using (SqlDataReader reader = cmd.ExecuteReader()) {
      reader.Read();
      int lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      int count = reader.GetInt32(1);
      // synchronize all newly added keys
      SyncAddedIds(data, count, lastId);
   }

   tran.Commit();
}

As you see, to call a procedure which receives a table-valued parameter we have to specify a SqlParameter as SqlDbType.Structured, the name of the our T-SQL table-valued type as TypeName and provide the above shown enumerator as value of the parameter. This enables ADO.NET to use a single RPC call to utilize the procedure with all changes which have been made on our objects.

A SqlDataReader gets the IDENTITY information, returned by the procedure, and calls the SyncAddedIds method to set the newly generated ids to all our added domain objects.

Sadly, the current version of JDBC apparently not yet supports table-valued parameters. However, since there are many voices in the web (and on the Microsoft JDBC Driver Team Blog) I really hope this great feature will be added to the next version.

SQL Server 2008 without Procedures

Sure, if we don't work with stored procedures (probably we use an O/R-Mapper) in our environment, we are still able to realize BDP.

Whenever we changed domain objects in memory for a specific business case there will come the point where we have to persist the data to the database. At this position we usually generate some dynamic SQL statements to send the changes to the database. This is the position where we just have to replace the existing single-row CRUD statements with bulk operation statements as we've already seen in the procedural approach. ADO.NET is able to bind table-valued parameters to any custom SQL command.

Important: As in the procedural approach, we need a table-valued type which can be used as DTO when communicating with the database and an implementation of a IEnumerable<SqlDataRecord> (or a DataTable which can be provided as value for the SqlParameter.

When we compare the following snippet with the stored procedure solution we'll see, the only difference is the custom SQL instead of a stored procedure call.

using (SqlConnection cn = GetConnection(true))
using (SqlTransaction tran = cn.BeginTransaction()) {
   // get test data
   IList<TestMany> data = GetTestData();

   // create the complete DELETE/UPDATE/INSERT sql command
   String sql = @"
      DELETE trg FROM TestMany trg WHERE EXISTS(
         SELECT * FROM @data src WHERE src.Id = trg.Id AND src.Action = 'D'
         );

      UPDATE trg SET SomeInt=src.SomeInt, 
                     SomeDate=src.SomeDate, 
                     SomeText=src.SomeText
         FROM TestMany trg 
         JOIN @data src ON trg.Id = src.Id AND src.Action='U';

      INSERT INTO TestMany
         SELECT SomeInt,SomeDate,SomeText 
         FROM @data 
         WHERE Action='I' 
         ORDER BY Id -- <- IMPORTANT to match ids in client

      SELECT SCOPE_IDENTITY(), @@ROWCOUNT;
      ";

   SqlCommand cmd = new SqlCommand(sql, cn);
   cmd.Transaction = tran;
   // create a new table-valued parameter for our TestManyMergeType type
   SqlParameter p = cmd.Parameters.Add("@data", SqlDbType.Structured);
   p.TypeName = "TestManyMergeType";
   // provide the data
   p.Value = new TestManyEnumerator(data);

   // execute the sql command and map the returned server ids
   using (SqlDataReader reader = cmd.ExecuteReader()) {
      reader.Read();
      int lastId = reader.IsDBNull(0) ? -1 : (int)reader.GetDecimal(0);
      int count = reader.GetInt32(1);
      SyncAddedIds(data, count, lastId);
   }

   tran.Commit();
}

Indeed, usually we will never put such dynamic SQL statements to our client application. Those statements will be generated by a generic implementation within an O/R-Mapper or any other data access layer.

BDP and SQL Server 2005

Sadly, table-valued types have just been introduced with SQL Server 2008, what means they are not available in SQL Server 2005 and previous version. The DTOs could be realized with user defined CLR types but SQL Server does not support arrays like other database systems. Therefore BDP is not possible to use with SQL Server 2005.

One workaround might be to create a XML document which contains the data to be persisted but this is a hard way to go and requires deep performance tests. I successfully used this for larger operations (more than 10,000 rows in one batch) but I have no experiences with general solutions for client layers.

Conclusion

We have seen how we can realize Bulk Data Persistence when using SQL Server 2008. We need a table-valued type defined in SQL Server which can be used as Data Transer Object and an enumerator for SqlDataRecord objects.

Hence, if you do not want to use user-defined table-valued types in your database BDP is not the right way to go in your system.

Outstanding

Probably the idea of BDP sounds nice for you. Nevertheless, one important part is missing here. A comparison with a single statement based solution. I'll provide this in a following post and reference it here.
Edit: Table-Valued Parameters - A Performance Comparison

Since it would be way too much source to code for lazy people like me, I'll also provide a possible implementation of a generic approach which can be used for any kind of domain object.
Edit: A Generic BDP Approach for SQL Server 2008

For the database gurus out there I'll provide an special post where I'll compare the classical INSERT/DELETE/UPDATE functions to the shiny new MERGE function which was introduced with SQL Server 2008. Since now my tests are not stressful enough to provide a real statement about.

Saturday, November 7, 2009

Bulk Data Persistence Pattern

"Bulk Data Persistence" (BDP) covers a problem I've seen many times in while doing my business or speaking with people all over the world. BDP is an extension of the "Unit of Work" (UoW) pattern, defined by by Martin Fowler in his great book Patterns of Enterprise Application Architecture (PoEAA). BDP describes a different way how clients communicate with their persistence layer to save changes.

A Real-Life Scenario

Guess it's Sunday morning. You stand up, go to bath for oral hygiene and taking a shower. After that you dress up.

Now you leave your house/flat, go to the baker and buy one biscuit (a bab if you live in England or a Semmel if you live in Bavaria - like I do :-D ); back at home you put your biscuit (bab/Semmel) onto the kitchen table. Now you leave your house again, go to the baker to buy the biscuit (bab/Semmel) for your husband or wife; back to home and put it onto the kitchen table. Since you have two children and mom is currently visiting you end up with five trips to the baker and back.

Sure, you'd never do this. You go to the backer and take all biscuits (babs/Semmeln) for the whole family in one round trip. However this is a good example to see how software interacts with databases.

A Technical Scenario

Guess a system is made to enter orders.

The user creates a new order and adds ten order positions. The implementation of this might be realized with a UoW design. When hitting the save button the business logic layer (BLL) validates the order and all oder positions against existing business rules. When everything is fine the client opens a connection to the persistence tier (usually a database). The data access layer (DAL) maps the order object to a "Orders" table and sends row to the database. When the order was inserted the DAL maps the first order position object to the OrderPositions table and sends a row to the database. When this operation returns it maps the next oder position and sends the next row to the database. And so on...

Hence, quiet same scenario as the real-live baker sample.

Tip to the Balance

Todays systems become larger and larger and the count of data to be persisted heavily grows, too. Scalability is a very important design for applications like web applications or service endpoints. A scalable designed application enables the production provider to use load balancing technologies to host the applications on several boxes.

Unfortunately, the operational persistence layer is almost never scalable. It is possible to use replication or data-warehouse architectures to move reporting processes away form the operational system. Nevertheless, business C(R)UD operations always have to work on the one and only operational database and this database is a common bottleneck of several system.

A short look into the Unit of Work (UoW)

Since Bulk Data Persistence (BDP) pattern is a extension of the Unit of Work (UoW) pattern I'll start with a short description of UoW pattern.

The UoW describes a client side transaction object which registers all object changes during one complete business operation. In our sample of the order system, an order only makes sense with all of it's order positions. The UoW transaction object holds the order and all order positions until everything becomes stored in the database.

The following diagram shows the most important part of the UoW pattern needed to understand BDP.

I abstracted the data persistence part since this is the important we will look at in the following text.

A classic implementation of the data persistence would be single database operations for each object which was changed within the UoW. The following diagram describes this approach.

In our sample order system we would generate 11 single database operations. One to insert a row into the Orders table and ten to insert the rows into the OrderPositions table.

The Bulk Data Persistence Pattern

Bulk Data Persistence means to aggregate all data operations of one object type and send them as one single database operation to the database system.

The idea of BDP is to enable the database system to do what it is really optimized for, work with bulks of data. This usually causes way faster statement execution and even better responsibility for concurrent applications since one larger lock works faster than many small locks caused by many single persistence operations. Last but not least the network round trips of the clients are reduced to a minimum what also causes a faster execution.

In our order system sample there would be just two database operations. One to insert the new Order row and one to insert all the OrderPosition rows.

In this sample 11 single statements seem to be no problem but consider the number of concurrent users.

Another scenario can be a process which synchronizes data between a distributed database system or a process which imports data from another system. Both these scenarios might work with thousands of rows to be saved.

What BDP does not cover

Please Don't misunderstand. The idea of BDP is not to move business logic into the database! Don't try to move all changed object types of one UoW in one package to the database. When your database handles complete messages of business objects, it becomes a data service and that's usually not the intention in todays software development. Instead of this send one data package per type. BDP covers the same functionality and contract as any single data persisting data access layer and persistence tier.

Outstanding

I'll provide some different samples in the next days to show how it works and what's the profit. Links to these samples will be added here.

SQL Server and Bulk Data Persistence
PostgreSQL and Bulk Data Persistence - pending
Oracle and Bulk Data Persistence - pending
A Generic BDP Approach for SQL Server 2008
SQL Server Table-Valued Parameters - A Performance Comparison

Saturday, October 10, 2009

Factory Pattern

The factory pattern is probably of the most common creational pattern.

What is a "creational" pattern?

A creational pattern is pattern which describes ways to create objects. You can call it an abstraction of the new operator. The gain of factory classes is to encapsulate the creation of objects to a central position - the factory - and ensure that an object becomes correct initialized. A well known factory is the static XmlWriter.Create() method. It gets either a file name or an output stream and an optional XmlWriterSettings parameter which specifies things like formatting behavior of the returned XmlWriter.

Sample Scenario

Before we start to show how the factory works I'll define a sample scenario. Our system is a ordering system which handles "Order" and "Customer" objects and we got the following classes.

public class Order : ICreationInformation
{
   public DateTime CreationDate { get; set; }
   public String CreationUser { get; set; }
   public String OrderNo { get; set; }
   public Decimal Amount { get; set; }
   public Customer Customer { get; set; }
}

public class Customer : ICreationInformation
{
   public DateTime CreationDate { get; set; }
   public String CreationUser { get; set; }
   public String Name { get; set; }

   public override string ToString()
   {
      return this.Name;
   }
}

As you can see the "Order" as well as the "Customer" contain the system fields "CreationDate" and "CreationUser". In addition the order only becomes valid if it has a related "Customer".

Problem of using the "new" operator

Our ordering system might have more than one position where orders become created. If you work with new operators to create a new "Order" wherever it is possible to create a new order you always have to specify the "CreationDate" and the "CreationUser". As first you might forget to set this information at one point. Remember Murphys law "Anything that can go wrong will go wrong.". As second you duplicate your source code. If you start with a "CreationDate" from local system it might be a future requirement to change this to use a server date since many computers don't have a correct configured local time. In this case you have to fly all over your code to find all position where you worked with local system time to set a "CreationDate".

Factory

A factory class is a class which encapsulates the creation and correct initialization of objects.

The following listing shows a sample factory to create "Customer" and "Order" objects.

public class Factory
{
   public Customer CreateCustomer()
   {
      Customer customer = new Customer();

      customer.CreationDate = DateTime.Now;
      customer.CreationUser =
         System.Security.Principal.WindowsIdentity.GetCurrent().Name;

      return customer;
   }

   public Order CreateOrder()
   {
      Order order = new Order();

      order.CreationDate = DateTime.Now;
      order.CreationUser =
         System.Security.Principal.WindowsIdentity.GetCurrent().Name;

      return order;
   }
}

Once created we can use our factory wherever we need a new "Order" or "Customer" object.

Factory factory = new Factory();

Customer customer = factory.CreateCustomer();
customer.Name = "ALFKI";

Order order = factory.CreateOrder();
order.Customer = customer;

To ensure that your objects are not created by using their constructor from outside the factory it makes sense to define an internal constructor - if your objects are stored in a separate assembly. Now it is impossible to create one of these business objects without a correct initialization.

In our simple example one factory fits to create both types of objects. In a real world scenario it might make sense to use specific factory classes for each type of a business object.

Abstract Factory

An abstract factory is a factory which can create different types of objects with a single implementation. In .NET a abstract factory can be realized with a generic method (or class).

Maybe you noticed the ICreationInformation interface which was added to our "Order" and "Customer" classes. Here is the definition of our this interface which provides the properties "CreationDate" and "CreationUser".

interface ICreationInformation
{
   DateTime CreationDate { get; set; }
   String CreationUser { get; set; }
}

In addition we don't specify any constructor, so both objects have an implicit public constructor. This enables us to create a generic function which can handles the creation of both objects with one implementation.

public class AbstractFactory
{
   public T Create<T>()
      where T : ICreationInformation, new()
   {
      T instance = new T();

      instance.CreationDate = DateTime.Now;
      instance.CreationUser =
         System.Security.Principal.WindowsIdentity.GetCurrent().Name;

      return instance;
   }
}

As you see, the method "Create" works with a generic type "T" with two constraints. The specified type has to be an implementation of ICreationInformation and it needs an empty public constructor. If your business objects don't support an empty constructor you can use the not generic static method System.Activator.Create() which creates objects and enables you to specify constructor arguments.

The usage of this factory is quiet equal to the usual factory, you just have only one method and have to define the generic type.

AbstractFactory factory = new AbstractFactory();

Customer customer = factory.Create<Customer>();
customer.Name = "ALFKI";

Order order = factory.Create<Order>();
order.Customer = customer;

Now that's cool. First, because we have just one implementation for probably hundreds of different objects which implement our ICreationInformation interface and provide an empty public constructor. Second, if you a little bit like me, you are still feeling like a kid sometimes and have to say these generic approaches are always cool! :-)

I'm afraid to say, these abstract factories are not as cool as they look at the first moment. The generic approach is a huge restriction for a factory. Suggest you would be more restrictive as we have been since now. Eventually you wouldn't like to enable the user of your factory to create any "Order" without specifying a "Customer". In a usual factory you just change the "CreateOrder" function to get an instance of a "Customer" object. An abstract factory cannot handle special business cases like this.

Combine Both

As we saw an abstract factory seems to be nice to do some general initializations for objects, but it's a week solution for specific requirements. Anyway, it helps to avoid duplicated code. So why not combining both? You should use a non abstract factory to create new business objects within an application which consumes your business layer but you can use the abstract factory within other factories.

Here is a new version of our factory classes.

class AbstractFactory
{
   internal T Create<T>()
      where T : ICreationInformation, new()
   {
      T instance = new T();

      instance.CreationDate = DateTime.Now;
      instance.CreationUser =
         System.Security.Principal.WindowsIdentity.GetCurrent().Name;

      return instance;
   }
}

public class Factory
{
   AbstractFactory _abstract = new AbstractFactory();

   public Customer CreateCustomer()
   {
      Customer customer = _abstract.Create<Customer>();
      return customer;
   }

   public Order CreateOrder(Customer customer)
   {
      if (customer == null)
         throw new ArgumentNullException("customer");

      Order order = _abstract.Create<Order>();
      order.Customer = customer;

      return order;
   }
}

As you see, the abstract factory became a private class. So it is not possible to use it from outside of our assembly. The usual factory has an instance of an abstract factory to use it's gain to avoid duplicate code for the ICreationInformation interface members. The "CreateOrder" method became more restrictive to ensure that no order can be created without a specified instance of a "Customer".

Now the consumer can use a secure and handy factory and everything is ensured to be done correct.

Factory factory = new Factory();

Customer customer = factory.CreateCustomer();
customer.Name = "ALFKI";

Order order = factory.CreateOrder(customer);

Conclusion

As you saw, the factory pattern is a helpful way to ensure correct initialized objects. The abstract factory is more a tool to be used for less complex requirements but it can be very helpful to do basic initializations.

Monday, October 5, 2009

Easy Working with Threads

.NET 4.0 introduces the new Parallelism namespace which will simplify working with threads. So, as long as we are just .NET 3.5 coders we have to wait and keep our hands of these complicated multi-threading, don't we? We don't! I'll show you some approaches how to work clean and simple in a multi-threaded environment.

If you just need to start threads which have to do some work without returning any information back to the caller you are usually out of any problems. Start your threads and let them go.

Unfortunately, most times this is not the way how to work with threads usually you have an application thread which dispatches a set of tasks but needs the result of those tasks to work with. We'll simulate this with a "Task" class which will be provided to the worker methods within the following samples.

Our Task Class

In the following samples we will work with a "Task" class which will be provided to our threads to simulate a asynchronous work. Here is the definition of this class.

// represents a task which shall be handled by 
// asynchronous working thread
class Task
{
   public int Id { get; set; }
   public int ReturnValue { get; set; }
}

Classic Thread Approach

Let's start with the old-school solution, which uses lock to ensure thread safe working.

// monitor the count of working threads
static int _workingTasks;

static void Main(string[] args)
{
   // count of tasks to be simulated
   int taskCount = 10;
   // hold the dispated tasks to work with the results
   List<task> tasks = new List<task>();

   // first set the complete count of working tasks
   _workingTasks = taskCount;

   for (int i = 0; i < taskCount; i++)
   {
      // create a new thread
      Thread thread = new Thread(new ParameterizedThreadStart(DoWork));

      // create and remember the task
      Task task = new Task { Id = i };
      tasks.Add(task);
      // start the thread
      thread.Start(task);
   }

   while (_workingTasks != 0)
   {
      // wait until all tasks have been done
      Thread.Sleep(1);
   }

   // show the return values after all threads finished
   tasks.ForEach(t => 
      Console.WriteLine("Thread {0} returned: {1}", t.Id, t.ReturnValue));

   Console.ReadKey();
}

// work method
static void DoWork(object o)
{
   Task task = (Task)o;
   int id = task.Id;

   for (int i = 0; i < 10; i++)
   {
      Console.WriteLine("Thread {0} is working", id);
      // simulate some long work
      Thread.Sleep(200);

      task.ReturnValue++;
   }

   // we have to lock the monitoring variable to 
   // ensure nobody else can work with until we decremented it
   lock (typeof(Program))
   {
      _workingTasks--;
   }
}

As you see, there are quiet a lot of things to keep in mind.

We have to use the C# lock statement to ensure a thread safe work with our monitoring variable. Usually you don't have to lock a complete System.Type; you can use lock(this) or any other reference type. I just used lock(typeof(Program)) because I worked with static methods.

We use Thread.Sleep(1) to to poll the state of the dispatched tasks.

Now we'll start to simplify this work.

Using volatile

The first thing we can do to slightly simplify our method is to define our monitoring variable as


static volatile int _workingTasks;

The volatile keyword can be used to tell .NET that a variable might be accessed by many threads for write access. If you define a member variable as volatile, you don't have to use lock to ensure thread safeness.

Since we used lock just once in our previous sample this just changes the call of

lock (typeof(Program))
{
   _workingTasks--;
}

to this


_workingTasks--;

Seems to be senseless to work with volatile? Keep in mind, this is a very very simple sample. Not to need lock becomes really handy if you have to do different things with shared member fields.

Avoid the explicit Polling

The next thing to simplify your multi-threading is to remove the explicit polling. What does this mean? Since now we worked with a member field "_workingTasks" which monitored the state of our working threads. Suggest a larger class with several multi-threading implementations. In this case you would need several member fields to monitor the different threading activities. Another way to wait for the execution of a thread is to Join() it.

static void Main(string[] args)
{
   // count of tasks to be simulated
   int taskCount = 10;
   // hold the dispated tasks to work with the results
   List<task> tasks = new List<task>(taskCount);
   // remember all threads
   List<thread> threads = new List<thread>(taskCount);

   for (int i = 0; i < taskCount; i++)
   {
      // create and remember a new thread
      Thread thread = new Thread(new ParameterizedThreadStart(DoWork));
      threads.Add(thread);

      // create and remember the task
      Task task = new Task { Id = i };
      tasks.Add(task);
      // start the thread
      thread.Start(task);
   }

   // --&gt; HERE &lt;--
   // wait until all threads are finished
   threads.ForEach(thread => thread.Join());

   // show the return values after all threads finished
   tasks.ForEach(task =>
      Console.WriteLine(
         "Thread {0} returned: {1}", 
         task.Id, 
         task.ReturnValue));

   Console.ReadKey();
}

// work method
static void DoWork(object o)
{
   Task task = (Task)o;
   int id = task.Id;

   for (int i = 0; i < 10; i++)
   {
      Console.WriteLine("Thread {0} is working", id);
      // simulate some long work
      Thread.Sleep(200);

      task.ReturnValue++;
   }
}

As you see, we don't need our monitoring variable any more. The usage of Join() makes it possible to implement a much more encapsulated multi-threading.

Working with the ThreadPool

Keep in mind, threads are a system resource which are important to create. If you have to do many smaller tasks it can be more expensive to create all the threads instead work single-threaded. If you have complex tasks which probably have to handle their thread state in conjunction with your application thread or other threads, you should use the System.Threading.Thread class because it gives you the largest flexibility. If you just have to dispatch tasks and wait for them (as in our sample) it can be the wrong way always to create a new thread. A good approach to work with threads in this scenario is to reuse them for the next tasks.

Q: Okay, cool!! Let's start to code a custom thread manager!

A: Er... nope. It's already available.

For this kind of work we have to do with our "Task" objects you can use the System.Threading.ThreadPool class and it's especially made for this kind of work. It's a pool of threads which can be used to schedule working tasks within.

To schedule means you can store a larger count of small jobs within. Whenever a pooled thread is available the next scheduled (queued) task will be started. By default .NET determines the count of initially available threads by your environment information. It also starts new threads if they seem to be useful. However, you can customize all this information by static methods (e.g. ThreadPool.GetMaxThreads, ThreadPool.SetMinThreads).

So let's use the ThreadPool to do our tasks. For this propose I extended our "Task" class with an additional property.

class Task
{
   public int Id { get; set; }
   public int ReturnValue { get; set; }
   public AutoResetEvent AutoResetEvent { get; set; }
}

The AutoResetEvent is a special type of a WaitHandle which provides some special methods for multi-threading tasks.

static void Main(string[] args)
{
   // count of tasks to be simulated
   int taskCount = 10;
   // hold the dispated tasks to work with the results
   List<task> tasks = new List<task>(taskCount);

   for (int i = 0; i < taskCount; i++)
   {
      // create and remember the task
      Task task = 
         new Task 
         { 
            Id = i, 
            AutoResetEvent = new AutoResetEvent(false) 
         };
      tasks.Add(task);

      // queue the task in ThreadPool
      ThreadPool.QueueUserWorkItem(new WaitCallback(DoWork), task);
   }

   // wait until all queued tasks are finished
   tasks.ForEach(task => task.AutoResetEvent.WaitOne());

   // show the return values after all threads finished
   tasks.ForEach(task =>
      Console.WriteLine(
         "Thread {0} returned: {1}", 
         task.Id, 
         task.ReturnValue));

   Console.ReadKey();
}

// work method
static void DoWork(object o)
{
   Task task = (Task)o;
   int id = task.Id;

   for (int i = 0; i < 10; i++)
   {
      Console.WriteLine("Thread {0} is working", id);
      // simulate some long work
      Thread.Sleep(200);
      
      task.ReturnValue++;
   }

   // notify the application thread that the task finished
   task.AutoResetEvent.Set();
}

As you see, we don't need the monitoring member fields any more. We even don't need a Thread object any more. The AutoResetEvent class provides a WaitOne() method which can be used by our application thread to wait until all tasks finished. To release the wait handle of a AutoResetEvent you can use the Set() method.

Conclusion

I hope I could show you some tricks how to simplify the work with multiple threads in your application.

Almost every new computer has two or more CPUs, start to use them ;-).