Florian Reischl

Wednesday, June 6, 2012

C# Quiz: Working with Value Types

A while ago, I read Jefferey Richter's book CLR via C#. I can say this book is a really great resource for everyone who wants to know how CLR internally works.

One really interesting topic was the difference between reference types (classes) and value types (structs) in .NET.

Here's a little quiz. Give yourself a few seconds to try to figure out what the console output will be. When you think you got it, scroll down and compare your assumptions with the actual outputs.

namespace ConsoleApplication1 {
   interface IValueProvider {
      void SetValue(int i);
   }

   struct MyValueProvider : IValueProvider {
      private int _value;

      public MyValueProvider(int value) {
         _value = value;
      }

      public void SetValue(int value) {
         _value = value;
      }

      public override string ToString() {
         return _value.ToString();
      }
   }

   class Program {
      static void Main(string[] args) {
         MyValueProvider v = new MyValueProvider(1);
         Console.WriteLine(v);

         MyValueProvider v2 = v;
         v2.SetValue(2);
         Console.WriteLine(v);

         IValueProvider i = v;
         i.SetValue(3);
         Console.WriteLine(v);

         object o = v;
         ((MyValueProvider)o).SetValue(4);
         Console.WriteLine(o);

         ((IValueProvider)o).SetValue(5);
         Console.WriteLine(o);
      }
   }
}

Give yourself a few seconds before you scroll down to read the disclosure.

You think you got it?

Sure?

Disclosure

Here is the output to the console:

Kudos to you if you have been 100% correct! If you have been surprised of one or more outputs, let me explain.

Boxing and Unboxing

The main reason for the above output is how CLR does boxing and unboxing.

Reference types (classes) like strings, database connections and most other types in .NET library always live on the application domains managed heap, that is managed by the garbage collector. Value types (structs) like Guids, ints and all other primitives live on the threads local call stack.

Whenever CLR needs a reference of a value type, it needs to create a copy of the local stack memory on the heap (what is called boxing). Whenever a heap reference is cast into a value type on stack, it becomes copied from global heap into a local piece of memory (what is called unboxing).

Now find the next chapter for explanations of the initial quiz.

Explanations

First section:

MyValueProvider v = new MyValueProvider(1);
Console.WriteLine(v);

Shows "1", nothing special no surprise.

Second section:

MyValueProvider v2 = v;
v2.SetValue(2);
Console.WriteLine(v);

Shows "1". MyValueProvider is a struct what means that "v" is no reference to an instance of MyValueProvider on the heap but the instance itself and allocated on the stack. The assignment "v2 = v" causes a copy of MyValueProvider into a new instance on the stack. As a result, any changes of "v2" will not affect "v".

Third section:

IValueProvider i = v;
i.SetValue(3);
Console.WriteLine(v);

Shows "1". MyValueProvider implements the interface IValueProvider. While MyValueProvider is a value type, interfaces are always handled as reference types and so "i" causes a boxing of our value onto the heap. Any changes of the interface reference will not affect the original instance "v".

Fourth section:

object o = v;
((MyValueProvider)o).SetValue(4);
Console.WriteLine(o);

Shows "1". The assignment of "v" with a variable of type Object causes a boxing of the value type onto the heap. "((MyValueProvider)o)" causes a local (stack) de-referencing (unboxing) of the previously heap copied version of our value. Since we now have another, new instance of our value type the rest of the line ".SetValue(4)" will only affect the unboxed instance but will not affect the boxed value on the heap. (If you would try to assign a property instead of calling a method, you would actually get a compiler error, since C# compiler knows that the property assignment would never affect the boxed value type.)

Fifth section:

((IValueProvider)o).SetValue(5);
Console.WriteLine(o);

Shows "5". As explained for third output line, interfaces do always need a reference on the heap. But this time we "o" already represents a heap reference. CLR does not need to create a copy of the boxed value but can use the existing one. The boxed value type becomes cast into IValueProvider and calls the method on it. Therefore a call of the interface method will change the value of our boxed value type instance.

Little Conclusion

Some of the results might look a little spooky for the first glance. However, we use to work with value types all the time and everything usually works fine. There is a general guideline to keep in mind that helps to avoid running into one of the above shown issues.

Value types should always be immutable, what means they should not change their internal values, once they are created. If you keep this in mind you are protected against unexpected boxing/unboxing behavior.

Tuesday, May 15, 2012

Simplify Null-Reference Handling of Properties in C#

A few months ago I wrote about a simple design rule that helped us to get almost rid of this annoying "Object reference not set to an instance of an object" exception, here.

In this post I want to show another way to simplify the handling of null-references. In particular, collection properties that can be null.

Let's start with a small piece of code:

class Customer {
   public string Name { get; set; }
   public ICollection<Contact> Contacts { get; set; }
}

// ...

Customer customer = GetCustomer(1);
foreach (var contact in customer.Contacts)
   Console.WriteLine(contact.FirstName);

Looks fine and there seem to be nothing special to talk about. Due to the rule I showed in the other post, GetCustomer will never return null. Unfortunately, the Contacts property is still not guaranteed not to be null. If Contacts is null we are again in front of our old "friend" NullReferenceException.

The straight forward approach is to always add an if-condition to ensure that our property is set to any value before we can iterate through.

Customer customer = GetCustomer();
if (customer.Contacts != null) {
   foreach (var contact in customer.Contacts)
      Console.WriteLine(contact.FirstName);
}

My problem with this is that developers (like me) tend to be lazy. We don't want to always add the condition and the additional scope.

Wouldn't it be nice to be able to write something like this?

Customer customer = GetCustomer();
foreach (var contact in customer.Contacts.NotNull()) // <= notice the NotNull()
   Console.WriteLine(contact.FirstName);

Luckily .NET 4.0 makes this possible with a simple extension method. Extension methods are not only static at design-time, but also behave as any other static methods at run-time. Any call of an extension method on a null reference will be routed into the extension method without any reference validation.

Due to this behavior of .NET, we are able to create this very simple extension method.

public static IEnumerable<T> NotNull<T>(this IEnumerable<T> collection) {
   return collection ?? new T[0];
}

Yes, this only helps when iterating through the list. But, from my experience, this covers about 90% of use cases.

Attention, keep in mind to always return an empty, immutable collection - like an array. If you would return an instance of a List<T> you might run into unexpected behaviors. The list would not become assigned to the parent component and any changes would not take any effect to the parent.

Monday, May 14, 2012

Yet Another Dependency Injection Container

Well, this might sound odd to some of you. However, in this post I want to present yet another dependency injection (DI) container.

If you are doing fine with your's feel free to stop reading here. If there are some things that could do better or you are just interested in new options give me a chance.

Due to the fact that the containers feature set isn't that small anymore, this post is only a rough overview of some of its features. Please find a complete documentation attached at the bottom of this post.

Why Yet Another Dependency Injection Framework?

Yes, at first glance FluffyContainer (don't ask why this name) is yet another dependency container. However, FluffyContainer comes across with a combination of two aspects that I've been missing in other dependency injection (DI) frameworks until now.

By my experience there are two different kinds of DI frameworks out there. The one type is feature rich, but relatively slow in creation of objects. The other type are those that are really fast, but poor in features. FluffyContainer combines those two aspects.

Overview of Features

FluffyContainer supports all common DI framework features, like:

Type to type mappings (with and without named mappings)
Mapping to custom factories
Singleton behavior for mappings
Calling of parameterized constructors
Field injection
Property injection
XML based mapping configuration
In code mapping configuration

In addition to this, it also supports some more advanced features, like:

About 20 times faster than other containers
Caching of, once initialized, mappings to make the creation of the container very cheap
Two types of internal factories. One with focus on high performance and another one that focuses on low-trust environments (like in some ASP.NET environments)
Thread Singletons
The container and most of its related components are thread safe; including management of singletons
Custom type initialization by automatically calling a configurable init method
No need of referencing .NET DLLs like System.Web that would require a full .NET framework installation and wouldn't work with the client edition
Customization with builders (aka plug-ins)

Performance

One of the main reasons for the development of FluffyContainer was the bad performance of other DI frameworks. If DI can only be used to create the large components but has to be skipped for the smaller once it did not yet reach its target yet.

Whenever people argue that DI is slow, there is somebody else - often somebody who calls himself a matured Software Architect who explains that DI has nothing to do with performance. It is all about abstraction, coupling and composition. I disagree with those Architects! As long as I'm not able to abstract (almost) every component initialization with DI, there are always points of tight coupling that will need to be refactored when a new type hierarchy is required.

Nevertheless, there are always some massively performance critical parts of a system, where DI will most likely never become the shiny hammer. When it comes to arrays instead of lists, emit instead of reflection byte buffers and cost of stack allocation, you are on a point where DI will probably never the way to go. In this scenario, abstraction and composition in general is often the wrong decision. But in all other scenarios, performance should not the reason not to use it.

By default, FluffyContainer is about 20 times faster than the fastest other well-known DI frameworks. On top of this, it provides the possibility to create a factory for a specific mapping. With this factory the creation of components is about 90 times faster than other DI frameworks, while it still provides all features the container. The factory is about 80% (not times) slower than a hardcoded factory, what is a difference I can live with in majority of a system.

Here are some results of my performance tests with some other feature full DI frameworks. The X axis shows the count of items created in my tests, the Y axis shows the duration in milliseconds it took to create the instances.

If you wonder why you see only 3 lines, this is because the lines of "Hardcoded", "FluffyContainer" and "Fluffy Factory" are all overlaying at the bottom of the diagram.

Since this diagram doesn't tell you very much about the three lines on the bottom, here is another diagram that shows only those three competitors.

As you see, out of the box FluffyContainer is about 8.5 times slower than a hardcoded factory. If you know that you will need to create larger amounts of the same component it can be useful to take advantage of the available factories. Especially because the creation of a factory is really cheap.

If you prefer numbers over diagrams, here are my test results.

Count	Hardcoded	FluffyContainer	Fluffy Factory	Other Fast DI	Other Slow DI
500000	10	92	18	1580	15106
1000000	20	175	35	3220	30281
1500000	28	252	52	4638	44591
2000000	47	345	74	6172	61123

I'm not going to talk about the names of the other DI frameworks I used in this test since this is no bashing for any other framework. But trust me, the fast one is one of the fastest, serious DI frameworks.

Configuration

Here is a quick overview of how to configure the container in XML or in code.

FluffyContainer supports two different kinds of configuration; XML and a expression based in-code mapping syntax. Here is a short sample of both configurations.

XML
<?xml version="1.0" encoding="utf-8" ?>
<dependencies>
   <assemblies>
      <assembly name="System.Data, 
                      Version=4.0.0.0,
                      Culture=neutral,
                      PublicKeyToken=b77a5c561934e089"/>
      <assembly name="DependencyInjectionSamples"/>
   </assemblies>
   <namespaces>
      <namespace name="System.Data.Common"/>
      <namespace name="System.Data.SqlClient"/>
      <namespace name="DependencyInjectionSamples"/>
   </namespaces>
   <mappings>
      <map from="IMyComponent" to="MyComponent"/>
      <map from="DbConnection" to="SqlConnection" name="db1"/>
      <map from="IMyComponent" toFactory="MyComponentFactory" name="factory"/>

      <map from="IMyComponent" to="MyComponent" 
           singleton="appDomain" name="singleton"/>
      <map from="IMyComponent" to="MyComponent" 
           singleton="thread" name="threadSingleton"/>

      <map from="MyComponent" to="MyComponent" name="fpConst">
         <field name="_foo" value="123"/>
         <property name="Id" value="1"/>
      </map>
 
      <map from="MyComponent" to="MyComponent" name="fpMapped">
         <property name="Database" from="DbConnection"/>
      </map>
   </mappings>
</dependencies>
<!--
Read XML in C#
var resolver = XmlDependencyResolver.FromFile("DependencyInjection.xml");
FluffyContainer container = new FluffyContainer(resolver);
-->

C#
using FR.DependencyInjection;

...

FluffyContainer container = new FluffyContainer();
container.Map<IMyComponent>().To<MyComponent>();
container.Map<DbConnection>().Named("db1").To<SqlConnection>();
container.Map<IMyComponent>().Named("factory").ToFactory<MyComponentFactory>();

container.Map<IMyComponent>().Named("singleton").AsSingleton().To<MyComponent>();
container.Map<IMyComponent>().Named("thread").AsThreadSingleton().To<MyComponent>();

container.Map<IMyComponent>()
         .Named("fpConst")
         .WithField("_foo", 123)
         .WithProperty("Id", 1)
         .To<MyComponent>();

container.Map<IMyComponent>()
         .Named("fpMapped")
         .WithProperty("Database", typeof(DbConnection), null);

When using XML mappings, you are still able to add additional in-code mappings in code, after the container was initialized from an XML.

Usage

After configuring the container, you can use it in your project source code by using the Resolve method to create new instances. Use the Release method to ensure a correct clean-up of IDisposable implementing components, while taking singleton configurations into account.

var comp = container.Resolve<IMyComponent>();
// use comp
container.Release<IMyComponent>(comp);

To simplify error handling for disposable components, especially when working with more than one resource, consider using ActivationScope when resolving several components. The activation scope will automatically manage a disposing of all resolved instances when it runs out of scope.

using (ActivationScope scope = container.CreateScope()) {
   var comp2  = scope.Resolve<IMyComponent>();
   // do work
   scope.Release<IMyComponent>(comp2);
   // disposing of comp2 will be skipped when the scope becomes disposed
 
   // ..
 
   var mydb = scope.Resolve<DbConnection>();
   // do work
   // mydb becomes automatically disposed when the scope becomes disposed
}

Conclusion

As I stated at the beginning of this post; yes, this is yet another dependency injection (DI) framework. Nevertheless it is relatively grown in its set of features and its performance blows all other frameworks (that I know) out of the building. If you already successfully work with DI in your project, stay with whatever you use. If you start a new project or if you are completely new in DI, consider to give it a chance.

Don't forget to check the documentation related to this post that will describe all features of the framework in deep.

Download

FluffyContainer is part of my core libraries that I've published in another post, FR.Core Libraries.

Here you can find all files in the version that was used in this post:

Download the documentation
Download the binaries
Download the source code

Sunday, April 22, 2012

FR.Core Libraries

Okay, this becomes a special post. Fist, due to time pressure I have to start with saying sorry, because this might become a little sloppy written.

Since 1st March I'm working in a software consulting company. Until now, I'm working as software developer, architect now, since about 14 years (what slightly shocked me as I realized it). I've started with VB6 and C++ and I'm part of the .NET development community since version 1.0 - ten years ago.

Since I always loved my job I've always spent some of my spare time with testing new features and developing components. My focus has always been on framework design and back-end development. Over the years my private frameworks became quite large and I realized more often that many requirements in projects at work are already covered by my private libraries. Every now and then I allowed myself to copy parts of my library into my business projects. Now, as a consultant, there might be quite a bunch of different projects over the next years and caused by the nature of consulting projects there might be copyright issues when I keep doing as in past.

To avoid copyright issues, I decided to post my private components here. For now I share my "Core" library that covers the most common and basic requirements for software like:

Compilation of custom C# for scripting purposes in dynamic parts of an application
A very fast factory framework that supports different kinds of factories, from Emit code generation to .NET Activator usage, depending on the current environments trust level
Some collections like:
- A ConcurrentHashSet
- A RingBuffer (that I've already published a few years ago)
- A writable version of a ILookup
- Special dictionaries and others
A configuration framework that replaces the .NET configuration and supports different kinds of sources (like XML files, SQL databases, WCF services or stubs for testing).
A fast dependency injection framework that works much faster than Microsoft Enterprise Libraries (MEL) Unity. Since the whole IoC is based on interfaces it can easily be replaced by other frameworks like MEL or Ninject
A logging framework with custom devices like SQL Server and a integration of log4net.
Several other components

Apart to this "Core" library, I will most likely share my other libraries, like "Data" and "Enterprise Services" over the next weeks. But at the moment I will start up with the "Core".

Disclaimer

I've invested much time and wrote hundreds of unit tests to find potential issues, nevertheless software might never be free of bugs but only be free of known bugs. If you decide to use the library or parts of it and you find a bug I'd enjoy a feedback about the bug but no bashing, thanks. If I find the time, I'll fix the bug but this is not guaranteed.

License Decision

I decided to set the library under MIT License to enable everybody to use parts or the whole library in any project without any license issues. Feel free to merge parts, or even the whole library, into your projects without any restrictions to your own project.

License

Copyright (C) 2012 Florian Reischl

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Open Source Platforms

In future this library might become moved to one of the common open source platforms like Codeplex. Since I don't know about the restrictions and requirements I decided to start with a simple post here.

If I move the source code to one of the open source platforms I'll add an information here.

Download

Here you can find the currently published components:

FR.Core, the main "Core" library
FR.Core.Test, the NUnit test cases for the FR.Core.dll
FR.Test.Core, a library that provides some unit testing helpers. You will need this library if you want to execute the unit tests of the other libraries.
FR.Logging.Log4Net, a log4net based implementation of my logging interfaces
FR.Logging.Log4Net.Test, the NUnit tests for the FR.Logging.Log4Net library

Tuesday, March 13, 2012

Performance Comparison of SQL Server Bulk Select Operations

After some mainly architecture related posts it's time to get back into some more technical stuff.

Sometimes we find ourselves in a requirement where we need to match a larger list of entities, from a file or any other data exchange interface, against our database.

Possible scenarios could be loading data of a OLTP database into a data warehouse, synchronizing data between databases in a distributed environment, or getting master data from an external system (like customer or product information).

Whenever we need to import data from another source there are two different kinds of scenarios. Either we only need to insert new rows, or also need to merge updated information into existing rows. In this post we will focus on the second scenario, especially the part of how to select large amounts of data by specific criteria.

While selecting single rows by one or more criteria is straight forward, it can become a performance bottleneck if we need to load hundreds or thousands of rows. In this post we talk about different ways to implement bulk load operations and compare their performances.

Test Data

In the following sections I will work with a simple table called "TestData100k" that contains 100,000 rows.

-- Create table
CREATE TABLE TestData100K (
 Id int IDENTITY(1,1) NOT NULL,
 SomeGuid UNIQUEIDENTIFIER NOT NULL,
   PRIMARY KEY CLUSTERED ([Id] ASC)
);

-- Fill 100k rows
INSERT INTO TestData100K
SELECT TOP(100000)
   NEWID()
FROM master.sys.all_columns a
   CROSS JOIN master.sys.all_columns b

For sake of simplicity we will use a text file containing only IDs that we match against our test table:

1
2
3
...
n

We will select rows by their ID and return the "SomeGuid" column as a pseudo information to be processed in subsequent operations.

Single Select Statements

The simplest solution is to process the process the data row by row.

Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
 
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand("SELECT SomeGuid FROM TestData100k WHERE Id = @id", cn, tx))
using (var ids = CreateIdReader()) {
   cmd.Parameters.Add("@id", SqlDbType.Int);
   foreach (var id in ids) {
      cmd.Parameters[0].Value = id;
      var guid = (Guid)cmd.ExecuteScalar();
      guids.Add(id, guid);
   }
   tx.Commit();
}

However, as we will see in the performance comparison section, this solution might not be the best one when we need to work with larger amounts of data.

Bulk Copy (BCP)

When talking to database developers or DBAs, they will most likely tell you to use BCP. BCP is a powerful SQL Server API that allows copying of large amounts into an SQL Server. BCP does not work like usual INSERT statements, instead, it copies data in large streams to a table. This means you cannot use custom TSQL to implement logic while copying data to the server. Therefore the the common approach to work with BCP is to use a staging table to copy all data into and then use TSQL to implement business logic that uses the data from the staging table.

The probably most common way to use BCP is by utilizing the bcp.exe, what is a command line tool that is installed with SQL Server. The usage of this tool is described in thousands of good articles all over the internet, so I will not add a 1001th description here.

What we are going to do here is to talk about using the .NET implementation of the BCP API, provided by the SqlBulkCopy class. The nice present we get, when using this class, is that we are able to reuse existing .NET business logic while importing. The class provides two different ways to write data to the server. We can import an instance of a DataTable or we can import an instance of a IDataReader. My personal advice is to consider to use a IDataReader if you need to import many data since we usually don't need the considerable overhead of the DataTable and the data table based approach requires all data to be loaded into the clients memory before it can be written to the server.

// write data to server
// The IdDataReader represents an implementation of the IDataReader interface that 
// streams over the file and reads the IDs row by row
using (var idReader = new IdDataReader()) {
   SqlBulkCopy bcp = new SqlBulkCopy(_cnStr, SqlBulkCopyOptions.TableLock);
   bcp.BatchSize = _count + 1;
   bcp.DestinationTableName = "StageBulkIds";
   bcp.ColumnMappings.Add(new SqlBulkCopyColumnMapping(0, 0));
   bcp.WriteToServer(idReader);
}
 
// select guids
const string sql = @"SELECT td.Id, td.SomeGuid"
                     + " FROM StageBulkIds ids"
                     + " JOIN TestData100k td ON ids.Id = td.Id";
using (var cn = CreateOpenConnection())
using (var cmd = new SqlCommand(sql, cn)) {
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   using (var reader = cmd.ExecuteReader()) {
      while (reader.Read()) {
         guids.Add((int)reader[0], (Guid)reader[1]);
      }
   }
}

IN Statements (and their Issues )

Another commonly used solution is to build a SQL statement that contains a large list of IN arguments to be matched.

var lines = File.ReadLines(_fileName);
string sql = "SELECT Id, SomeGuid FROM TestData100k WHERE Id IN (" 
             + string.Join(",", lines)
             + ");";
         
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand(sql, cn, tx))
using (var reader = cmd.ExecuteReader()) {
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   while (reader.Read()) {
      guids.Add((int)reader[0], (Guid)reader[1]);
   }
   reader.Dispose();
   tx.Commit();
}

Unfortunately this solution brings up some issues.

IN statements only work if we need to match one single column. If we need to match data by more than one column, this approach does not work
Since it is not possible to send the list of IDs as parameter, we need to create our SQL statement by string concatenation, what opens the door for SQL injection.
Since every statement looks different, due to the different arguments within the IN clause, SQL server cannot reuse existing execution plans, and needs to compile each statement
Due to the fact that we send a SQL statement with thousands of arguments within the IN clause, the whole statement becomes really large. This requires much CPU time for the SQL Servers compile engine to prepare the execution. If the list becomes too large we can even get a out of memory exception from the compiler (not the SQL Server). The screenshot below shows a CPU utilization of almost 2 seconds for a list of 10,000 INT IDs

Server Side String Split Functions

Another approach is to create a list of concatenated IDs, send them as text parameter to the server and split the IDs into a temp table on server side. After this the temp table can be used for further processing.

You can use one of the functions that I posted at High Performance String Split Functions or any other, preferably fast, one from the internet.

For the performance tests in this post I used a CLR based split function, since they are the fastest split functions, as Jeff Moden and Paul White just showed again in a great article on SQLServerCentral.com, "Tally OH! An Improved SQL 8K “CSV Splitter” Function".

string sql = @"CREATE TABLE #Ids (Id INT NOT NULL PRIMARY KEY CLUSTERED);"
 
            + " INSERT INTO #Ids"
            + " SELECT item"
            + " FROM clr_SplitString (@ids, ',');"
 
            + " SELECT td.Id, td.SomeGuid"
            + " FROM #Ids ids"
            + "    JOIN TestData100k td ON ids.Id = td.Id";
 
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand(sql, cn, tx)) {
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   var lines = File.ReadLines(_fileName);
   var ids = string.Join(",", lines);
 
   cmd.Parameters.Add("@ids", SqlDbType.VarChar, -1).Value = ids;
   using (var reader = cmd.ExecuteReader()) {
      while (reader.Read()) {
         guids.Add((int)reader[0], (Guid)reader[1]);
      }
   }
   tx.Commit();
}

As a site hint, even if this concatenation looks like another open door for SQL injection, since the IDs are sent as VARCHAR parameter that is not executed as SQL statement on server side, this cannot cause SQL injection issues. However, parsing the list of arguments on server side can, for sure, still cause type conversion errors.

One disadvantage of this solution is, like IN statements, it supports only ID matching for single column criteria. If we need to match our data by more than one column, this approach becomes tricky, since we would need to send and parse a hierarchical list of IDs.

Table Valued Parameters

As I've already shown in other posts, table valued parameters are a really powerful tool of SQL Server. They provide the possibility to not only send scalar values to the server but represent a native way to send sets of data in one network round trip and can be directly used within TSQL.

First thing we need to do is to create an appropriate user defined table type on SQL Server.

CREATE TYPE IntIDs AS TABLE (
 Id INT NOT NULL PRIMARY KEY CLUSTERED
)

Second, we need to implement an IEnumerable of SqlDataRecord that will be used by ADO.NET to create the rows to be sent to SQL Server. (Like SqlBulkCopy it is also possible to send DataTables, although with the same issues.)

class IntRecordEnumerator : IEnumerable {
   public IEnumerator GetEnumerator() {
      SqlMetaData metaData = new SqlMetaData("Id", SqlDbType.Int);
      StreamReader sr = new StreamReader(_fileName);
 
      string line;
      while (null != (line = sr.ReadLine())) {
         SqlDataRecord record = new SqlDataRecord(metaData);
         record.SetInt32(0, int.Parse(line));
         yield return record;
      }
   }
 
   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}

Now we can use a SqlCommand to send the whole data in one bunch as SqlParameter.

string sql = @"SELECT td.Id, td.SomeGuid"
               + " FROM @Ids ids"
               + "    JOIN TestData100k td ON ids.Id = td.Id";
 
using (var cn = CreateOpenConnection())
using (var tx = cn.BeginTransaction())
using (var cmd = new SqlCommand(sql, cn, tx)) {
   // parameter for our IDs. SqlDbType.Structured tells ADO.NET that we
   // use a table valued parameter
   var p = cmd.Parameters.Add("@ids", SqlDbType.Structured);
   // type of the Table Valued Type on SQL Server
   p.TypeName = "IntIDs";
   // the enumerator that provides the values to be filled 
   // into the table type variable on server side
   p.Value = new IntRecordEnumerator();
 
   Dictionary<int, Guid> guids = new Dictionary<int, Guid>(_count);
   using (var reader = cmd.ExecuteReader()) {
      while (reader.Read()) {
         guids.Add((int)reader[0], (Guid)reader[1]);
      }
   }
   tx.Commit();
}

Performance Comparison

Now that we have seen several different ways to meet our requirements , it's time to do some performance considerations.

First, let's compare all options we've seen in this post.

For sake of correctness I have to admit that the performance of the single statements, compared to the IN statements, might not 100% valid for a production scenario. I used a local instance of SQL Server, what means that network latency is almost zero. In another environment the single statements might take more time, depending on the network architecture. Nevertheless, compared with other approaches, the IN statements are by far out performed - and we should keep in mind all the other issues when using IN statements.

Since there are three techniques that are so close to each other at the bottom of the above diagram, let's do some more performance tests. This time with higher amounts of IDs and with a focus on those three approaches (String Split, BCP and Table Valued Parameters).

As we see, while the string split approach is really fast, compared with the IN statements or single selects, it comes to its limits compared with the other solutions. The BCP way to match data shows, as expected, a great performance, even when working with really large amounts of data. However, the surprise (even to me) is, that table valued parameters are not just fast, but even 30% faster than using SQL Servers BCP API.

Thursday, February 9, 2012

Tables are no Domain Objects: Data Type Transformations

This is the last post of a design basics series that focuses on circumstances where it can be useful to refine the structure of domain objects (aka entities) when they are dematerilized from a database (or other persistence storage). If you are interested you can find the whole list of posts on the bottom of the Introduction post.

In the previous part, Field Aggregations, I talked about cases where groups of two or more table columns can represent a logical collaboration. In this post we will have a look at opposed scenarios, where single table columns contain structured information. Depending on the purpose of the current project it can be useful to transform those columns into an object that provides a richer, more natural than simple strings or other base data types.

Binary Data

Whenever we find a column that stores binary data we should think about a transformation into something that is more reasonable in an object model. One reason is that binary data are often not wanted to be eager loaded, when the rest of the row is retrieved from the database, but are lazy loaded, when they are really needed. Therefore another interface than a simple byte array can provide a mechanism that loads the data only when they are accessed the first time.

Another reason, that makes binary data a good candidate for a transformation, is the fact that applications rarely really need a byte array but usually store and retrieve a specific kind of structured data. Often binary columns store pictures of products or people, documents like PDFs or text documents and so forth. (If there is another column that describes the documents extension or mime type you also found a candidate for a Field Aggregation.)

In case of pictures we should consider to return a System.Drawing.Image and encapsulate the code to initialize the image from the byte array into a distinct method. This not only provides a much more sophisticated interface but also avoids duplication of code if the image is needed at more than one position in our system.

When our column stores documents it depends on how the system works with them. If those documents are shown or edited by an embedded 3rd party component we could return something that is native supported by this library, what is often a System.IO.Stream or a file name. In case of a file name we could provide a class that writes the document byte array into a temp file and returns the name of the file. If the documents are meant to be shown or edited by their Windows default program we will usually need to write them into a temp file and do a shell-execute system call to open the default program.

Last but not least, the size of data we store can become really large. When working with a byte array interface we can run into major client and server resource issues if we hold, store, and retrieve the whole bulk of data in a single operation. A different interface can implement a streaming read/write functionality that helps to minimize those problems. (If you are working with SQL Server you might be interested in my SqlBinaryData class, which provides this functionality.)

Special Types Strings

Many entities consist of several string properties like names, addresses, descriptions and so forth and most of them are (technically) nothing more than text data to be stored. However, sometimes strings might need special validation or represent structures of sub-information that need to be addressed.

Here is a short list of string types that are candidates to be represented in a different way.

Type	Existing Class	Description
Email Address	System.Net.Mail. MailAddress	Email addresses can contain the address and a display name and the host name can be useful for analysis or to aggregate customers or contacts.
Web Site	System.Uri	Using the Uri class instead of a simple string not only gives several additional information like host name but also gives the advantage of a built in validation.
Phone Number	None. Happy RegEx	A phone number can consist of country codes, regional codes the number and extensions for direct access. Cleaning phone numbers is a common and awkward(!) task when synchronizing systems or loading warehouses. If you find yourself in a project where you need to clean phone numbers start with a Google search for existing regular expressions.
Date/Time Offset or Duration	System.TimeSpan	In our current project we had to store the duration of media content. Depending on the required accuracy, a duration can be stored as seconds/milliseconds/... but when shown to a user it is usually required to show time in a richer way instead of a 10 digit millisecond number.
Composite Keys	None.	When working with decentralized databases it is a common approach to create object keys (identifiers) that consist of a division/location part and an incremental id that is only unique for the local database.

XML Property Bags

These days domain objects tend to specify more and more attributes. Ever seen a table with 100+ columns where at least 50 columns don't contain anything in 90% of the rows? Customers and business analysts can be really creative in defining tons of optional attributes for their orders, articles, contacts, employees, and so on. If those attributes are rarely available and not (or rare) search criteria they can be moved into a single XML column that represents a property bag. This not only makes the rest of the table much easier to read but also can cause a increased performance and eliminates the restriction of database page sizes.

However, while XML can be a good solution for the DBMS, it does not represent a very sophisticated interface for a domain object. The consumer of a business layer should not need to bother if a property is a table column or is stored in a XML property bag. Therefore the XML can be held inside of an object is responsible to easy get/set the properties.

Conclusion

As I stated in the introduction post of this series, it is impossible to give a halfway complete picture of things to consider when designing entities in a small blog series.

Nevertheless, I hope I've been able to show you a few things to bear on mind when designing the interface of your entities. The point is, the more natural and easy to use your domain objects are, the more you and your team can concentrate on developing functionality that provides a real business value. Due to the fact that domain objects tend to be excessively reused over broad parts of information systems an adequate effort in their initial design (or a refactoring) will return its invest quite soon.

Just keep in mind, the effort needs to be appropriate to its expected benefit. Use more time for often reused types of entities than less often used ones. Don't waste time in a (never) perfect designing if it most likely not give you a value.

One thing that is often called as conflict when talking about rich object designs and abstracted architectures is performance. I say this is (most of the time) incorrect, but this might become part of another post.

Friday, February 3, 2012

Tables Are No Domain Objects: Field Aggregations

This is the fourth part of the series about "Tables Are No Domain Objects". After an Introduction the other parts covered Relation Transformations and Many To Many Relations.

In this post I want to talk about field aggregations. By nature table columns are arranged side by side on same hierarchy level. Sometimes there are two or more columns that are tightly coupled with each other, where one (or more) is useless with out the others. Whenever we find collaborations of columns in a table we also found something to consider to aggregate into own classes when dematerializing into our object model. This can avoid duplicate code, centralize logic, and effect more natural object interfaces.

Person Tables

Tables containing person information often contain columns that are more or less related to each other. I'm sure many of you have already seen databases with one or more tables like those shown here.

When designing the domain objects we could create two classes where each contains all columns of the based table, but there are two collaborations of columns that could be considered to be aggregated into own classes.

This not only enables us to reuse helper methods like those I denoted in the PersonName class, but also enables us to easily reuse other methods without designing methods that take many parameters (one for each property). Consider a method that creates a letterhead for a payment check (for employees) or a delivery note (for customers).

Money

At least since Martin Fowlers' legendary book Patterns Of Enterprise Application Architecture where he introduced the Money Pattern we know that calculating with money can be really awkward .

Since Mr. Fowler did a great job in his book I'll not try to explain the whole pattern but only give you a short summary. Apart from all the issues with multiplication and division of money values the pattern describes the relation between an amount column and a currency column. The following diagram shows a possible table structure.

By applying the Money pattern we will get the class structure shown in the diagram below, where all calculation logic is moved into a separate Money class. (BTW: You can find a really good .NET implementation at Code Project: A Money type for the CLR.)

If we would need to calculate with money values of different currencies we would even need a further class like a CurrencyCalculator, but this is far out of scope of this post.

System Information

Often database tables not only store business data but also contain different kinds of additional system information that might be less (or not) business relevant but might be needed for support, analysis or other investigations.

Here is a short list of possible system information columns.

Name	Short Description
CreationDate	The date/time where the row was inserted into the table.
LastUpdate	The date/time of the rows last update. If there have not been any updates yet this value is often either NULL or equal to the creation date.
Creator	A key that describes the employee who caused the insertion of the row.
LastUpdater	A key that describes the employee who caused the last update of the row.
CreationHost	The host computer that caused the insertion of the row.
LastUpdateHost	The computer that caused the last update of the row.
Version	The current version of the row, if the data in the table are version controlled.

And here are two possible database tables.

Moving the system information into an own class will give us entities that are easier to use, because they really focus on the business, not on the system.

In addition, keeping the properties in each of our entities can cause a remarkable amount of duplicated code or reflection based solutions that are hard to maintain, slow at runtime and not compiler checked.

Conclusion

I hope this post showed you another reason why it is not always the best way to design a domain object model with exactly the same structure as a database.

The easiest way to find candidates for field aggregations is to keep reviewing the table columns or class properties and search for columns/properties that are related to each other.