Wednesday, September 28, 2011

Domain Objects And Many To Many Relations

Today I want to discuss different kinds of many to many relations and cases where it can make sense to transform them into different structures when they become loaded from a database into an object graph of a business layer. For sake of straightforwardness this post focuses on Microsoft Entity Framework and O/R-Mappers in general. Let me know, if one is interested in how to handle many to many relations in native ADO.NET.

This is the third post of a blog series about designing domain objects in a business layer and the second part that gives attention to transformation of table structures into object structures. The first post "Tables Are No Domain Objects" gave an introduction to this series and showed some reasons why it can make sense to abstract O/R-Mappers in layers above the data access layer (DAL). In the second part "Tables Are No Domain Objects: Table Relation Transformations Part 1" we discussed foreign key fields, aggregations and status objects.

Basics

A many to many relation is given when two objects are related to each other and each of them can be referenced to more than one object (rows in the database) on the other side.

An example for a many to many relation is the relation between articles and their categories. Each category can be related to many articles, like a category "food" that references apples, pies and meat. On the other hand, an article "apple" can be categorized as "food" and "healthy".

In an object model a many to many relation is exposed by two domain objects where each contains a collection of objects of objects of the other type. Since databases don't provide complex column types, like lists many to many relations are realized by putting an intermediate link table between the tables.

Simple Many To Many Relations

A simple many to many relation is given whenever the link table consists of nothing but the foreign keys which point to the rows of the two tables to be related to each other.


When working with native ADO.NET many to many relations are always a bit tricky but can, for sure, be handled. I'll focus on O/R-Mappers for now.

When working with a common O/R-Mapper simple many to many relations are usually automatically transformed by the mapper. The link table stays hidden inside of the mapper each of our two objects can provide a list of objects of the other type.

public partial class Article {
   public IList<Category> Categories { get; set; }
}

public partial class Category {
   public IList<Article> Articles { get; set; }
}
The ORM knows all values to be inserted into our link table and doesn't need to annoy clients of our business layer with this table. If you are at the beginning of a project and your O/R-Mapper does not support simple many to many relations, I'd suggest to consider another mapper.

Complex Many To Many Relations

A complex many to many relation is given when the link table contains any additional columns which are not the foreign keys of our domain objects base tables.


With this link table an O/R-Mapper like Entity Framework run into trouble. It is unable to fill our creation date column without an intermediate domain object that does nothing but hold the additional column. Our two domain objects will look like this.

public partial class Article {
   public List<ArticleCategory> ArticleCategories { get; set; }
}

public partial class Category {
   public List<ArticleCategory> ArticleCategories { get; set; }
}
This might be fine for EF but usually that's not how we want to work with our objects in the main part of our system. Often columns like a creation date are only used for support or reporting purposes and we don't want to think about the odd ArticleCategory object when adding new operations features.

Without some refining of our domain objects we will be forced to implement every access of an articles categories like this.
Article article = GetArticle();
var categories = from acl in article.ArticleCategories
                 select acl.Category;

// process the article and its categories
It is not only unnatural to need to always access the intermediate object to get what we are really looking for but also a causes a tight coupling between our domain objects and the underlying database table structure. Worst thing would be if we started up with a simple many to many relation between articles and categories and a new requirement causes the need of the creation date column - and the resulting ArticleCategory object. Without some architectural effort we might have to refactor larger parts of our existing source code. Luckily, there are a few things we can do.

The easiest way to hide the relation object is to define the ArticleCategories property as private and provide a few methods that give us the opportunity to directly work with the referenced entities.
public partial class Article {
   public IEnumerable<Category> GetCategories() {
      return ArticleCategories.Select(acl => acl.Category);
   }

   public void AddCategory(Category category) {
      ArticleCategory categoryLink = new ArticleCategory();
      categoryLink.CreationDate = DateTime.Now;
      categoryLink.Article = this;
      categoryLink.Category = category;
      ArticleCategories.Add(categoryLink);
   }

   public void RemoveCategory(Category category) {
      var categoryLink = ArticleCategories.Where(
                           item => category.Equals(item.Category)).FirstOrDefault();
      if (categoryLink != null)
         ArticleCategories.Remove(categoryLink);
   }
}
// =========================================
// sample usage
Article article = GetArticle();

var categories = article.GetCategories();
// process categories

article.AddCategory(GetCateory());

Apart from the fact that we provide a more natural access to our categories, this also causes an architecture that is robuster for possible future changes - like additional fields in our link table.

If we want to go one step further we can provide an even more sophisticated interface to access our (indirectly) referenced domain objects. Unfortunately we cannot use a simple List<T> and copy all categories into it because our ArticleCategories list would not become affected by any add/remove calls. This makes also impossible to use a simple LINQ query that transforms the ArticleCategory objects into categories.

However, what we can do is implement a custom IList<T> that transforms a list of objects of one type into other objects by utilizing a provided delegate. In our case we need to transform a list of ArticleCategory objects into categories.

The following snipped shows how such a list could work.

public class TransformationList<T, TResult> : IList<TResult> {
   private IList<T> _list;
   private Func<T, TResult> _transform;
   private Func<TResult, T> _factory;

   // Constructor that creates a read-only version of the list
   public TransformationList(IList<T> list, 
                             Func<T, TResult> transformation)
      : this(list, transformation, null) {
   }
   // Constructor that creates a writable version of the list
   public TransformationList(IList<T> list, 
                             Func<T, TResult> transformation, 
                             Func<TResult, T> factory) {
      _list = list;
      _transform = transformation;
      _factory = factory;
   }

   // Indexer access
   public TResult this[int index] {
      get { return _transform(_list[index]); }
      set {
         EnsureWritable();
         _list[index] = _factory(value);
      }
   }

   // Count property works like a proxy
   public int Count { get { return _list.Count; } }

   // The list is read-only if no factory method provided
   public bool IsReadOnly { get { return _factory != null; } }

   // Ensures that the list is writable and uses the factory method to create a new item
   public void Insert(int index, TResult item) {
      EnsureWritable();
      _list.Insert(index, _factory(item));
   }

   // Read-only method uses the transformation method
   public bool Contains(TResult item) {
      return _list.Where(i => item.Equals(_transform(i))).Any();
   }

   // ensure that the list is writable
   private void EnsureWritable() {
      if (IsReadOnly)
         throw new InvalidOperationException("List is read only");
   }

   // and so forth...
}

The second constructor, which gets a second delegate as factory method makes the list writable and enables us to add new objects from outside without knowing that another, hidden object becomes materialized inside of our transformation list.

This (reusable!) class makes us able to provide a our articles categories with a nice IList<Category> property.
public partial class Article {
   private IList<Category> _categories;

   public IList<Category> Categories {
      get {
         if (_categories == null)
            _categories = 
               new TransformationList<ArticleCategory, Category>(
                     ArticleCategories, 
                     (acl) => acl.Category,
                     (c) => AddCategory(c));
         return _categories;
      }
      set { _categories = value; }
   }

   public ArticleCategory AddCategory(Category category) {
      ArticleCategory acl = new ArticleCategory();
      acl.CreationDate = DateTime.Now;
      acl.Article = this;
      acl.Category = category;
      ArticleCategories.Add(acl);
      return acl;
   }
}
// =========================================
// sample usage
Article article = GetArticle();

foreach (var category in article.Categories) {
   // process categories
}

article.Categories.Add(GetCateory());

Conclusion

Simple many to many relations are usually easy to work with, but even if an O/R-Mapper shows some weakness in its mapping features, we are still able to provide a reasonable interface to clients of our business layer and its domain objects.

Outlook

In the next part of this series we will look at version controlled data, what challenges they can could cause and ways to get them handled.

Tuesday, September 6, 2011

Tables Are No Domain Objects: Table Relation Transformations Part 1

This is the second part of a blog series 'Tables Are No Domain Objects'. In this post we will discuss where database relations are good candidates to be accessed in a different manner when they are represented by domain objects of our business layer.

The most obvious kind of a database table relation is a one (A) to many (B) relation where rows in table B hold a foreign key column that points to a unique key (usually the primary key) of table A. Most data access layers, based on an O/R-Mapper or custom mappings, do a good job to map this kind of database relations into objects, but there are some cases where it can make sense to transform those relations into a different structure or provide a different access than given by our database.

Foreign Key Fields And Transparent Database Relations

Before we step into more specific types of relations, there is one very basic thing where each of us should think about when starting to design a new business layer. In a database relations are always represented by foreign keys but when data are loaded into a object structure we can use object references, so we don't really need those foreign key fields as part of our objects. For instance, a sales order line object does not need to hold the ID of its parent sales order, it can hold a object reference of the sales order. One good reason to keep foreign key fields present our in domain objects is to have some additional logging and debugging information. However, we should never use those fields to implement any business logic on them, instead all business logic should always be implemented on the corresponding object references. (Very rare exceptions prove the rule though.)

This was already discussed in the previous blog post but should be recalled for sake of completeness. O/R-Mapper like Entity Framework or NHibernate provide a powerful query interface to access data, but using those queries in our business layer or presentation layers will cause a tight coupling between our source code and the database structure. Apart from other issues, discussed in the other post, queries like this can become a issue if we ever need to refactor our database structure or domain objects.

var orders = from o in efContext.Orders
             where o.CustomerId == currentCustomer.Id
             select o;

foreach (var order in orders) {
   // process order
}
Instead of this it is usually much safer to provide strong typed access methods out of our data access layer.
var orders = myDataContext.Orders.GetForCustomer(currentCustomer);

foreach (var order in orders) {
   // process order
}
Please read the previous post (Tables Are No Domain Objects Part 1) to see further issues, especially when using Entity Framework.

Aggregations

In general I'm not as restrictive as other architects, who say it is always a bad solution to access any related objects of a current object reference, but when it comes to aggregations it can sometimes be dangerous to be done from outside of the class that holds the objects to be aggregated.

One of the most common examples for a an aggregation that we should consider to encapsulate is when we have to calculate the price of a sales order that is based on the price of its line items.

SalesOrder order = GetOrder();
decimal orderPrice = 
   order.SalesOrderLines.Sum(line => line.ArticlePrice * line.ItemCount);
From the very beginning of a new system this could work pretty nice. The problem is, what if the calculation of the sales orders price ever changes? Salespeople are creative to find new ways to sell the companies products and usually it is only a matter of time when discount features become required. Discounts can be a special offer for specific articles or article categories, a graduated discount depending on the orders all round price or many other types. Now we can run into trouble if we do a outside calculation of a sales orders price. A better solution is to put the aggregation into the sales order class.

public partial class SalesOrder {
   public decimal GetPrice() {
      return SalesOrderLines.Sum(line => line.ArticlePrice * line.ItemCount);
   }
}
For now we only encapsulated the calculation that we have done from outside (what already avoids a duplication of logic to multiply with the sales lines item count) but when it comes to discounting we don't need to scan our whole source code to find all places where an orders price is calculated. We only have to adapt the body of our SalesOrder.GetPrice method and the rest of the system doesn't even notice the new calculation.

(The approach to do money calculations with decimal becomes part of a subsequent blog of this series.)

Status Objects

Status objects are special kinds of domain objects that describe the current status of their parent objects. They usually exist in a collaboration of their parent domain object and a description object that describes the current status.


In addition to providing a current status of their parent object, status objects are often used to log an operational history of an object since they are usually not deleted or updated after their first creation. Since each new status object can change the state of its parent they are often very important rules for the processing of an object.

As an example, say we have a parent SalesOrder domain object that can hold a list of SalesOrderStats objects where each of the status is described by a referenced SalesOrderStatusDescription object. Now what if we want to know if an orders current status is "Closed"? Without some design effort we would have to do something like this.
public partial class SalesOrderStatusDescription {
   // status description code constants
   public const string ClosedCode = "Closed";
   // ...
}
// =========================================
// sample usage
SalesOrder order = GetOrder();

bool isClosed = (from status in order.SalesOrderStatus
                 where status.OrderId = order.Id
                 orderby status.CreationDate descending
                 select status.SalesOrderStatusDescription.Code)
                 .First()
                 == SalesOrderStatusDescription.ClosedCode;
Apart from the fact that this causes is tight coupling between three different domain objects and their base tables, it would be crap if we always would have to do so in upper layers, just to get an objects current state.

A first thing we can do to prettify this is introduce an enum that either represents the possible codes of our SalesOrderStatusDescription objects or represents the possible foreign key values pointing to the SalesOrderStatusDescription primary keys. Since we would need to always load the descriptions to parse the code fields, we will do the foreign key solution, what causes a lower database utilization. Yes, I know we should try to never base any functionality on foreign key values but I tend to see this as one of the valid exceptions. Our descriptions IDs are usually immutable and it does not make a big difference if our source code is coupled to the Code column of the status description or its primary key.

public enum SalesOrderStatusCode : int {
   Created = 1,
   Approved = 2,
   Delivered = 3,
   Payed = 4,
   Closed = 5,
}
Next step we can do is add a new property to our sales order status that represents the value of the enum. Unfortunately Entity Framework does not provide native support for enums, so we need to do workaround by casting the foreign keys value.
public partial class SalesOrderStatus {
   public SalesOrderStatusCode Code {
      get { return (SalesOrderStatusCode)SalesOrderStatusDescriptionId; }
      set { SalesOrderStatusDescriptionId = (int)value; }
   }
}
Okay, now we are able to shorten the previous snippet a little bit, but without one more method we would still need to traverse the list of all existing status whenever we want to know the current one. Since the current status of an object is usually a widely needed information we should add a method to our sales order that encapsulates the traversing returns the code of the current status.

public partial class SalesOrder {
   public SalesOrderStatusCode GetCurrentStatusCode() {
      return (from status in SalesOrderStatus
              orderby status.CreationDate descending
              select status.Code)
              .First();
   }
}
// =========================================
// sample usage
SalesOrder order = GetOrder();
bool isClosed = order.GetCurrentStatusCode() == SalesOrderStatusCode.Closed;
This interface is much niftier and will make life much easier in client code.

As an optional, last step we could add a IsClosed method to our order. I use to do this only for the most important states of an object though.
public partial class SalesOrder {
   public bool IsClosed() {
      return GetCurrentStatusCode() == SalesOrderStatusCode.Closed;
   }
}
// =========================================
// sample usage
SalesOrder order = GetOrder();
bool isClosed = order.IsClosed();
Now our sales order provides a really handy interface that helps us to concentrate on other things when implementing features that need to work with the sales order status.

Last but not least, we should add a corresponding method set the new status of an order.
public partial class SalesOrder {
   public void SetStatus(SalesOrderStatusCode code) {
      SalesOrderStatus status = new SalesOrderStatus();
      status.CreationDate = DateTime.Now;
      status.Code = code;
      status.SalesOrder = this;
      SalesOrderStatus.Add(status);
   }
}
There is one line in this method that could cause problems. Setting the CreationDate with by using the local hosts time is only safe if we are sure that all client PCs are configured with the same time server, otherwise we can get deflections of the creation date of new status. Since the creation date is essential for these objects this could cause issues in production. One thing we can do is to use the time from a central server, like the database server, instead of trusting the clients.

Since there are usually much more places where we need to know the current status of an object than places where a status of an object becomes changed I tend to add less strong typed set methods like SetPayed().

As we have seen, due to their importance and the complicated access, status objects are usually good candidates to be handled in a very different way than they are stored in our database and some architectural effort to get them into a more fashionable, object-oriented structure can be a good investment.

Performance Tuning. Since this series concentrates on designing our domain objects I kept this until now, but our current solution requires to always retrieve all existing status objects from the database server, what causes a unneeded network traffic and database utilization. We should consider to add a method to our data access layer that loads only the current status, description ID or description code, instead of loading all status objects if not yet loaded anyway. This is another important reason why we should encapsulate the get method, since we need to change only one place.

Outlook

In the next post we will continue the discussion of table relation transformations.

We will have a look at many to many relations where we might need to handle the weakness of O/R-Mappers.

As last part of the discussion about table relation transformations we will have a look at versioned data.

Thursday, September 1, 2011

Tables Are No Domain Objects Part 1

This is the first part of a blog series where I want to discuss ways of how to design rich domain objects provided by the Business Logic Layer of a software system.

In most business applications we have to work with data stored in a database or provided by web services. For sake of straightforwardness, we will focus on database based storages. However, when working with a skinny data web service layer that provides its data almost (or completely) in a database table style, some of the things we'll discuss here might still fit.

Today where we are armed with shiny O/R-Mappers like NHibernate, Active Record or Microsoft Entity Framework. Some of those even provide a nice wizard that only needs a database connection to create a full set of domain objects. It seems like we can immediately start to create our first front end dialog/view/page and use those classes.

Cool, this "software architecture" and layering, people are talking about, is a wizard. Now we've got it. Thanks for participating the discussion, we are finished. Or, probably not... ;-)

Sounds funny? Unfortunately quite a few applications are based on this kind of "software architecture" and "business layers". When talking about smaller projects, this might even work, but if a system becomes more complex, say with with some thousands lines of code (how about millions of lines?) this approach can become an early dead approach. A table-like domain object is often not perfect structured when we need to implement our business logic or when displaying in a rich presentation layer.

Sure, we cannot cover everything that might need be considered when designing domain objects and structuring a business layer in a blog series, so we will concentrate on a few topics that I find more important and discuss some examples for each of them. At current stadium of software industry I don't even think that anybody is at a point that (s)he knows everything about architecting a business layer - or any other layer. If we look at the last 10 years of software industry we see way to many changes of minds. Every new pattern, technology or practice preaches itself to be the golden calf. I'd say, as long as there is not a full decade without big changes in architectural and technical approaches, we will not know that we reached the end - and I guess we are far away from the beginning of this decade.

With this knowledge about imperfection, we, still, should always do best to design a powerful and usable business layer. The business layer is the heart of all applications. It should encapsulate as much business logic as possible to avoid duplication of source code. After this encapsulation of logic, the highlight becomes to provide this logic by an easy to use interface for those (us?) who will consume it in a front end application like a windows UI or web UI, but also a web or windows services.

Here is a short overview of the topics we will cover in this series.

Before we dig deeper into designing of domain objects we will startup in this post with a look at O/R-Mappers, if actually used, and reasons why we should consider to keep them inaccessible to other layers than our Data Access Layer.

In a following post we will talk about Table Relations and Class References. We will look at database relations like aggregates or status objects that might be candidates to be represented in a different manner when they are loaded into a domain object hierarchy.

When talking about Field Aggregations I will try to show groups of database columns that can make sense to be arranged in referenced helper classes or structs instead of keeping them directly in our entities.

We will discuss Data Type Transformations and see some cases where it can be helpful when data, received from a database, are transformed into a different type or one database column can contain structural information.

In the last part of this series we will talk about Reusable Entities. There are some types of domain objects that appear in many projects. Some of them are good candidates to be implemented in base libraries reuse them in other projects, other types often appearing entities are harder or (almost) impossible to be reused.

How to Safely Use an O/R-Mapper

I know, this topic is not directly related to designing domain objects. However, since O/R-Mapper are often used as central component to access an underlying database this topic makes much sense to me to start with.

We should always consider to abstract the native instance of the O/R-Mapper and I will try to show a few reasons here. First thing, when talking about designing domain objects, is we might run into architectural restrictions caused by restrictions of the ORM. If we directly use the mapper class in our whole system might not be able to design our objects as we would like to.

Apart from the architectural reasons we should look a little closer at some other reasons why we should consider to avoid being tightly coupled with the O/R-Mapper.

(Examples and issues here are based on Microsoft Entity Framework. Some of them will also apply to other ORMs and others might not.)

Let's do a jump start with a very simple domain model that contains only one entity SalesOrder.


Here is a very simple example to use Microsoft Entity Framework with this domain model that already contains a few pitfalls.
using (var ctx = new Model1Container()) {
   DateTime since = DateTime.Today.AddMonths(-1);
   var orders = from o in ctx.SalesOrders
                where o.CreationDate >= since
                select o;

   foreach (var order in orders) {
      // set approved
   }

   // do something else

   foreach (var order in orders) {
      // create invoices
   }
}

The worst issue of this piece of code is one that is especially related to the architecture of Entity Framework. The used LINQ query to get sales orders does actually not return a real result of objects, instead it only returns an instance of an EF implementation of an IQueriable<SalesOrder>. This IQueriable can be seen as an object version of a SQL statement, whenever we start to traverse through the result we will fire another database query. This causes unnecessary database load and (worse) can cause different results for each loop. If a new order becomes created between the first and the second foreach-loop, we will get this order in our second loop and the order becomes invoiced before it was approved. A simple solution is to put the LINQ query into braces and call the IEnumerable<T>.ToList() extension method. This copies the results into a List<T> and all list operations become offline from now. Problem is, if we ever forget the call of ToList() we will again run into this trap.

A much safer way to use an O/R-Mapper is to create an own data storage class that wraps the EF container and provide custom methods for the needed data. At this point we stand on a crossroad where we have to decide between implementing a Gateway (Martin Fowler, Patterns of Enterprise Application Architecture) and provide all data access methods directly from our data storage class or implementing Repositories (Martin Fowler, PoEAA) for each of our domain object types. When working in smaller projects a Gateway is usually the better solution because it is easier to implement. Drawback of this pattern is, a Gateway class can become large when there are dozens or hundreds of access methods. The Repository approach needs more effort to be set up, since we need one for each type of domain object but keeps classes more well-arranged. Due to our large domain model I decided to use a Repository.
// Custom data store class that wraps the EF container
public class DataStore : IDisposable {
   private Model1Container _efContainer;
   private SalesOrderRepository _salesOrders;
      
   public DataStore() {
      _efContainer = new Model1Container();
   }

   public SalesOrderRepository SalesOrders {
      get { return _salesOrders 
                   ?? (_salesOrders = new SalesOrderRepository(_efContainer)); }
   }

   public void Dispose() {
      _efContainer.Dispose();
   }
}
// ========================================================================
// custom repository to provide access to sales orders
class SalesOrderRepository {
   Model1Container _efContainer;

   internal SalesOrderRepository(Model1Container efContainer) {
      _efContainer = efContainer;
   }

   public IEnumerable<SalesOrder> GetCreatedSince(DateTime date) {
      return (from o in _efContainer.SalesOrders
               where o.CreationDate >= date
               select o).ToList();
   }
}
// ========================================================================
// ========================================================================
// sample usage
using (var data = new DataStore()) {
   DateTime since = DateTime.Today.AddMonths(-1);
   var orders = data.SalesOrders.GetCreatedSince(since);

   foreach (var order in orders) {
      // set approved
   }

   // will always be the same results
   foreach (var order in orders) {
      // create invoices
   }
}

Now we need to remember only once that we need to add this odd ToList() method to avoid the previously described issues. Unfortunately this was only one, EF related, issue of directly accessing a O/R-Mapper from higher layers.

Another issue is the SalesOrder.IsCanceled property. In most parts of our software we might not want/need to work with canceled orders. Alike fields in other objects could be IsActive, IsDeleted and so forth. By encapsulating our O/R-Mapper we only need to change a few access methods to eliminate unwanted access to canceled orders. Sure, there are still parts of the system that might want to work with those orders but we should consider to provide them by more descriptive methods like GetCreatedWithCanceledSince or simply GetCanceledSince, depending on what is required.

A non-abstracted O/R-Mapper access does usually also mean a tight coupling between our database and source code. If we ever need to change the structure of tables we can only hope that our ORM is able to handle the new structure without changing the domain model and, if it is able to handle the new mapping, we can hope that this will still give us the advantage that we aimed with our database changes. There can be database changes that will most likely make it impossible for most O/R-Mappers to stay able to handle them. Say we need to introduce some EAV/CR tables to move some rarely needed columns out of our main tables. When using a custom data store class we can relatively easy adopt an internal hybrid of keep using the ORM while doing other mappings with native ADO.NET features.

What if not all data we need to work with come from our own database? I've just seen a thread MSDN Software Architecture Forums "Centralizing Duplicate Functionality and Data", where employees data had to be taken from one single source for all software systems of the company. Most O/R-Mapper support only one database but we can hold several different versions of them when working with a wrapping data storage class. Even if some of the data we work with are not provided by a database, say by a web service, we can still provide one homogeneous source of our data to the other layers.

As a last reason for now why to abstract an O/R-Mapper we should keep the possibility of horizontal scaling in mind. With one database server we have only one option to get better performance from it, buying a bigger box. Though, there is a point where vertical scaling reaches its end. Most software systems consist of say 80% read and only 20% write operations. With a strong data access layer we are able to set up several read-only accessed database servers to handle all the read operations while all write operations are directed to the master server.

There are still more reasons why we should consider to keep our O/R-Mapper inaccessible for main parts of the system but this might become part of another blog entry.

Upcoming Posts

I will add a link to all related posts here.

Monday, August 29, 2011

Streaming SQL Server VARBINARY(MAX) In Chunks With ADO.NET

A few years ago it was unthinkable to store BLOB data like large documents directly inside of a database. The most common way to store those large objects in combination with a database was to save the data to the file system and just store the UNC path in our database. Today storing BLOB data directly inside of a database became a common requirement in many projects. This new way to save binary data brings up a some advantages and drawbacks to think about.

The probably most important advantage is security. If saving BLOB data in file system we not only have to manage security on database level but also restrict access to the file system, to avoid unauthorized access. Another advantage is integrity. A UNC file path, stored in a database becomes easily invalid if someone delete the file. This can happen by maleware attacks but also by bugs in a software that changes one side but not the other one. A file name doesn't provide any way to enforce referential integrity. Sometimes it is tricky to keep things synchronized, especially when working with transactions.

On the other hand there is especially one drawback, when storing BLOBs inside of the database. If data are really large, say larger than 10MB or 100MB, it becomes tricky to save and read them without allocating huge amounts of data on server and client side. Where smaller data can easily be pushed or pulled in one chunk we don't want the client or server to handle 100MB as one big bunch of memory.

Missing Class in ADO.NET Framework?

On SQL Server 2008 when working with VARBINARY(MAX) based on FILESTREAM we can use the ADO.NET SqlFileStream class to stream data from and to SQL Server. Unfortunately this class does not work if data are stored inside of a VARBINARY(MAX) column and not everyone wants to activate FILESTREAM - out of other reasons because of the integrity problem.

SqlBinaryData Class

To become able to stream data stored in a VARBINARY(MAX) column, I wrote a helper class called SqlBinaryData that provides a simple but powerful interface.
public class SqlBinaryData : IBinaryData {
   //...
   
   /// <summary>
   /// Creates a new readable instance of a <see cref="Stream"/> for the BLOB column
   /// </summary>
   /// <returns>A readable stream.</returns>
   public Stream OpenRead() {
      return new SqlBinaryReader( // ...
   }

   /// <summary>
   /// Creates a new writable instance of a <see cref="Stream"/> for the BLOB column
   /// </summary>
   /// <returns>A writable stream.</returns>
   public Stream OpenWrite(bool append) {
      return new SqlBinaryWriter( // ...
   }

   // ...
}

Since the class works with four additional internal classes (about 700 lines of code) I will only focus on the most important parts here. You can find a ZIP archive containing the whole source code and my unit tests downloadable at the end of this post.

The OpenRead method creates an instance of an internal SqlBinaryReader class that implements a readable version of a System.IO.Stream and returns it to the consumer. The OpenWrite method creates an instance of an internal SqlBinaryWriter class that implements a writable version of System.IO.Stream and returns it. The other two classes, SqlBinaryInfo and SqlBinaryMetaData are primary used to provide database connections, handle possible SQL injections and provide metadata.

Metadata Evaluation And Caching

Metadata, like primary key column and table names are established from SQL Server when first needed. The first time when this data is needed is when a stream becomes read or written. Until this there will be no database connection opened to avoid unnecessary resource allocation and database utilization.

All metadata are cached for the current application domain. If an instance of a SqlBinaryData needs database metadata that have already been established for the same table, binary column and connection string this information will be reused from an internal cache. If metadata for an unknown binary column is requested from two threads it is possible that one of the threads becomes suspended until metadata are allocated by the second thread. This will likely not cause any problems but should be known.

If a database schema might change over the life time of a process and the primary key column becomes changed by name or type it is possible to clear the internal cache by calling the static method SqlBinaryData.ClearMetaDataCache(). A reading or writing stream that is already active while the cache becomes cleared will keep working with its previously requested metadata what might cause an error, any new creation of a stream, even from an already existing SqlBinaryData instance causes a reload of metadata from database.

Metadata are allocated by utilizing SQL Server INFORMATION_SCHEMA views, so should be accessible with most common database user rights.

Table and schema name are taken, and so validated, from INFORMATION_SCHEMA.TABLES view.
-- If table schema is specified
SELECT QUOTENAME(TABLE_NAME), QUOTENAME(TABLE_SCHEMA), TABLE_SCHEMA
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = @tableName
   AND TABLE_SCHEMA = @schemaName

-- If table schema is not specified
SELECT QUOTENAME(TABLE_NAME), QUOTENAME(TABLE_SCHEMA), TABLE_SCHEMA
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = @tableName
   AND TABLE_SCHEMA = OBJECT_SCHEMA_NAME(OBJECT_ID(@tableName))
Primary key column information is taken from INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE view.
SELECT QUOTENAME(COLUMN_NAME)
FROM INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE
WHERE TABLE_NAME = @tableName AND TABLE_SCHEMA = @schemaName
AND OBJECTPROPERTY(OBJECT_ID(QUOTENAME(CONSTRAINT_SCHEMA)
         + '.' + QUOTENAME(CONSTRAINT_NAME))
      ,'IsPrimaryKey') = 1
If no primary key was found an exception will be thrown. If the primary key consists of more than one column an exception will be thrown since this version of SqlBinaryData class does not support composite primary keys.

Binary column information is taken from INFORMATION_SCHEMA.COLUMNS view.
SELECT QUOTENAME(COLUMN_NAME), DATA_TYPE, CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @tableName
   AND TABLE_SCHEMA = @schemaName
   AND COLUMN_NAME = @columnName
If the returned column is not of type VARBINARY(MAX) an exception will be thrown.

SQL Injection

To avoid possibility of SQL injection all provided table and column names are quoted by utilizing the SQL Server QUOTENAME function, as shown above. All values, like the value of the primary key to stream binary data from or to, are passed to SQL Server as SqlParameter objects to avoid SQL injection and let SQL Server reuse cached execution plans.

SqlBinaryReader Class

Internal class that is responsible for streaming read access to BLOB. As already noted, it implements a System.IO.Stream to provide a well known standard behavior.
class SqlBinaryReader : Stream { //...
The most important public method the Read method, that copies data into a specified array:
public override int Read(byte[] buffer, int offset, int count) {
   // first read
   if (_buffer == null)
      ReadChunk();

   int done = 0;

   while (count != done) {
      // read buffered data into provided buffer
      done += ReadInternal(buffer, offset + done, count - done);
      // end of DB data reached
      if (_buffer.Length < _info.BufferSize)
         break;
      // read next chunk from database if needed
      if (done < count)
         ReadChunk();
   }

   return done;
}
The called method ReadInternal gets data out of the internal buffer into the specified one. The ReadChunk method handles the database access and is shown here:
private void ReadChunk() {
   SqlBinaryMetaData metaData = GetMetaData();
   // create an internal database connection if not yet available
   if (_cn == null)
      _cn = _info.CreateConnection(GetMetaData().ConnectionString);

   // create an internal data reader
   if (_reader == null) {
      string sql =
         string.Format("SELECT {0} FROM {1} WHERE {2} = @pk",
                        metaData.BinaryColumn,
                        metaData.QualifiedTableName,
                        metaData.PkColumn);
      using (var cmd = new SqlCommand(sql, _cn)) {
         cmd.Parameters.Add(_info.CreatePkParam());
         // open the reader with sequencial access behavior to enable 
         // streaming data from database
         _reader = cmd.ExecuteReader(CommandBehavior.SequentialAccess);
         _reader.Read();
      }
   }

   int read = (int)_reader.GetBytes(0, _position, _buffer, 0, _buffer.Length);
   if (read != _buffer.Length)
      Array.Resize(ref _buffer, read);
   _offset = 0;
}
The internal instance of an SqlDataReader is created with option CommandBehavior.SequentialAccess what causes ADO.NET to stream data out of SQL Server instead of pulling all data in one chunk. The reader and its parent connection are cleared from memory when the stream instance becomes disposed.

Data will always be streamed with the specified buffer size, even if the caller requests a larger count of bytes in one call. This is to avoid a too high memory utilization on server side. So it is still possible to get 10MB of data in one single call of Read method, even if this might cause 10 database calls if buffer size is specified for 1MB.

SqlBinaryWriter Class

Internal class that is responsible for streaming write access to a VARBINARY(MAX) column of a tables row. As well as the SqlBinaryReader it implements a System.IO.Stream.
class SqlBinaryWriter : Stream {
The two important public methods of the write stream are Write and Flush method.

The Write method streams the provided data in chunks into the internal buffer. Whenever the internal buffer is full, it calls the Flush method which is responsible to writing the current chunk of data to the server.
public override void Write(byte[] buffer, int offset, int count) {
   if (_failedState)
      throw new InvalidOperationException("Stream is in a failed state");

   int done = 0;
   while (done != count) {
      int chunk = Math.Min(_internalBuffer.Length - _internalOffset, count - done);
      // push a chunk of bytes into the internal buffer
      Array.Copy(buffer, offset + done, _internalBuffer, _internalOffset, chunk);
      _internalOffset += chunk;
      // if internal buffer is full, flush to database
      if (_internalOffset == _internalBuffer.Length)
         Flush();
      done += chunk;
   }
}
To avoid unneeded memory copying on server side new data are appended to the database field by using the VARBINARY(MAX).WRITE method (see UPDATE (Transact SQL)) which performs a partial update of existing data instead of a complete reallocation of all data.

public override void Flush() {
   if (_internalOffset == 0)
      return;
   if (_failedState)
      return;

   SqlBinaryMetaData metaData = GetMetaData();

   using (SqlConnection cn = _info.CreateConnection(metaData.ConnectionString))
   using (var tran = cn.BeginTransaction()) {
      try {
         // handle NULL value and "append" configuration
         PrepareValue(cn, tran);
         // UPDATE SchemaName.TableName 
         // SET BinaryColumn.WRITE(@buffer, @offset, @count) 
         // WHERE PkColumn = @pk
         string sql =
            string.Format("UPDATE {0} SET {1}.WRITE(@buffer, @offset, @count) WHERE {2} = @pk",
                          metaData.QualifiedTableName,
                          metaData.BinaryColumn,
                          metaData.PkColumn);
         using (var cmd = new SqlCommand(sql, cn)) {
            cmd.Transaction = tran;

            var bufferParam = cmd.Parameters.Add("@buffer", SqlDbType.VarBinary, _info.BufferSize);
            var offsetParam = cmd.Parameters.Add("@offset", SqlDbType.Int);
            var countParam = cmd.Parameters.Add("@count", SqlDbType.Int);
            cmd.Parameters.Add(_info.CreatePkParam());

            byte[] buffer;
            if (_internalOffset == _internalBuffer.Length)
               buffer = _internalBuffer;
            else {
               // avoid bumping not needed data over network
               buffer = new byte[_internalOffset];
               Array.Copy(_internalBuffer, buffer, _internalOffset);
            }

            bufferParam.Value = buffer;
            // VARBINARY(MAX).WRITE works with a zero based index
            offsetParam.Value = _position;
            countParam.Value = _internalOffset;
            // write chunk
            int affected = cmd.ExecuteNonQuery();
            _info.AssertOneRowAffected(affected);
            _position += _internalOffset;
            _internalOffset = 0;
         }
         tran.Commit();
      }
      catch {
         _failedState = true;
         tran.Rollback();
         throw;
      }
   }
}
The PrepareValue method, called from Flush handles the preparation of the database field for the first call. If the stream was created with option "append" false it resets any existing data to initial value "0x". If the stream was created with "append" true it determines the current length of the data in the binary field or sets it to "0x", if its value is currently NULL.

How To Use

Here is an example that shows how to use the class to streaming write and read data.
SqlBinaryData data = 
   SqlBinaryData.CreateLongPrimaryKey(ConnectionString, "TestBlob", "Data", 3L, 5);

byte[] expected = Guid.NewGuid().ToByteArray();
using (var writer = data.OpenWrite(false)) {
   writer.Write(expected, 0, 4);
   writer.Write(expected, 4, expected.Length - 4);
}

byte[] actual = new byte[expected.Length];
using (var reader = data.OpenRead()) {
   reader.Read(actual, 0, 3);
   reader.Read(actual, 3, actual.Length - 3);
}

Assert.IsTrue(expected.SequenceEqual(actual));

To create a new instance of a streaming providing SqlBinaryData class you can use the constructor. The parameter pkParam is used as specification of the primary key column.
public SqlBinaryData(string connectionString, string tableName, string tableSchema,
                     string binaryColumn, SqlParameter pkParam, object pkValue, 
                     int bufferSize);
In case of an INT, BIGINT or UNIQUEIDENTIFIER primary key column, you can also use one of the static factory methods of SqlBinaryData (each one with and without specifying a table schema name):
public static SqlBinaryData CreateIntPrimaryKey(string connectionString, 
      string tableName, string tableSchema, string binaryColumn, int pkValue, int bufferSize);

public static SqlBinaryData CreateIntPrimaryKey(string connectionString, 
      string tableName, string binaryColumn, int pkValue, int bufferSize);

public static SqlBinaryData CreateLongPrimaryKey(string connectionString, 
      string tableName, string tableSchema, string binaryColumn, long pkValue, int bufferSize);

public static SqlBinaryData CreateLongPrimaryKey(string connectionString, 
      string tableName, string binaryColumn, long pkValue, int bufferSize);

public static SqlBinaryData CreateGuidPrimaryKey(string connectionString, 
      string tableName, string binaryColumn, Guid pkValue, int bufferSize);

public static SqlBinaryData CreateGuidPrimaryKey(string connectionString, 
      string tableName, string tableSchema, string binaryColumn, Guid pkValue, int bufferSize);

Transactions

Since the classes create their database connections inside of the class you cannot directly use a SqlTransaction to scope your DML operations. However, since the ADO.NET data provider for SQL Server supports transaction scoping it is still possible to ensure an isolated processing by using the System.Transactions.TransactionScope.
using (TransactionScope tran = new TransactionScope()) {
   using (var cn = CreateConnection()) {
      // do stuff
   }

   var binaryData = CreateBinary(10L, 16);
   using (var stream = binaryData.OpenWrite(false)) {
      byte[] toWrite = Guid.NewGuid().ToByteArray();
      stream.Write(toWrite, 0, toWrite.Length);
   }

   using (var cn = CreateConnection()) {
      // do more stuff
   }
}

Server Side Resource Allocation

Since we can specify the exact buffer size to work with and chunks are always taken in this size we know that we are fine on client side. The other side to look at is what resources are allocated on server side. We will do this by imitating the classes behavior in SQL Server Management studio and checking results in SQL Server Profiler.

Here is the setup up for the following tests.
SET NOCOUNT ON;

IF (OBJECT_ID('TestBlob') IS NOT NULL)
   DROP TABLE TestBlob;

CREATE TABLE TestBlob (
   Id INT NOT NULL PRIMARY KEY CLUSTERED
   ,Data VARBINARY(MAX)
);

INSERT INTO TestBlob VALUES (1, 0x);
INSERT INTO TestBlob VALUES (2, 0x);
GO
First, let's UPDATE a row, with 50MB of binary data in one single batch.
-- one big update with 50MB of data
DECLARE @data VARBINARY(MAX) = 
   CONVERT(VARBINARY(MAX), REPLICATE(CONVERT(VARBINARY(MAX), NEWID()), 65536 * 50));

UPDATE TestBlob SET Data = @data WHERE Id = 1;
A look into SQL Server Profiler shows, as expected, a huge resource allocation.

Now, let's get the same amount of data into SQL Server by using the WRITE method and sending in 10 chunks of 5MB. To keep the test valid we have to remember the current position to write at. To keep this information available over the scope of a batch we can use the the CONTEXT_INFO() which allows to store up to 128 bytes of custom information for the current session.

SET CONTEXT_INFO 0x; -- clear the context
GO
-- 10 updates, each with 5MB
DECLARE @offset INT = 
   ISNULL(CONVERT(INT, SUBSTRING(CONTEXT_INFO(), 1, 4)), 0);
--PRINT @offset;
DECLARE @data VARBINARY(MAX) = 
   CONVERT(VARBINARY(MAX), REPLICATE(CONVERT(VARBINARY(MAX), NEWID()), 65536 * 5));
DECLARE @data_length INT = DATALENGTH(@data)

UPDATE TestBlob SET Data.WRITE(@data, @offset, @data_length) WHERE Id = 2;
SET @offset += @data_length;

SET CONTEXT_INFO @offset;
GO 10 -- count of batch execution loops
Another look into Profiler shows, we get 10 statements with a really nice resource allocation.

Unfortunately I cannot provide a valid test result for read operations using the SqlDataReader.GetBytes method at the moment. This is caused by an issue of SQL Server Profiler when trying to trace BLOB data (see MS Connect MSFT:EDW - Profiler displays "Trace Skipped Records" for large (>530Kb) batch). I don't want to show test results with too small binary data since measuring inaccuracy might be too high. If you can tell me a way of how to get valid test results for this I would be happy if you let me know.

Restrictions

The current version of this class supports only single column primary keys. An exception will be thrown if a primary key consists of more than one column. Let me know, if there is a need of a version that supports composite primary keys.

It is required that the table containing the binary data has a primary key, the class doesn't work with heap tables. An exception is thrown if an accessed table does not have a primary key.

The class supports only columns from type VARBINARY(MAX). If there is a need for a VARCHAR(MAX)/NVARCHAR(MAX) based version, let me know.

The class does not handle any possible concurrent client sessions to avoid unwanted database locks. If data, accessed by a current instance of a read or write stream become modified by another user, the processed data will become most likely corrupted on client or even server side. To secure write operations use a TransactionScope surrounding the writing stream.

IBinaryData Interface

The SqlBinaryData class implements an interface called IBinaryData. This interface is not required for using this class, but I use it for unit testing purposes in business layers of a project where it allows me to work with files of a directory instead of data from a database. Feel free to remove it.

Attachments

Here you can find ZIP archive containing all classes of SqlBinaryData and a NUnit based test class.

Tuesday, August 16, 2011

ADO.NET Connection Pooling, Internals and Issues

Description

The creation of a new database connection is a relatively expensive process on server side. The server needs to allocate new memory, create a new session, allocate all needed client information, authenticate the client and so on. On the other side, it does not make sense to keep keep connections open for the whole lifetime of a application, since this causes a higher session management overhead on server side due to many concurrent connections.

To handle those two issues, ADO.NET uses connection pooling. This means, the well known SqlConnection object from .NET framework is just a wrapper for the real, internal database connection. Whenever a new SqlConnection becomes opened it asks the connection pool for an existing, and currently not used, internal connection. If an existing, free connection is available ADO.NET wont create a new database connection but reuse the existing one. As soon as the connection is not needed any more it becomes sent back into the connection pool. If a connection is not used for a longer time it becomes automatically closed by the ADO.NET connection pooler, to release resources on SQL Server. To determine if an internal connection can be used for a current request by a SqlConnection, ADO.NET compares several (not all) connection string information. This includes authentication information, database name, transaction context enlisting configuration and so forth.

This blog post focuses on ADO.NET SqlClient and SQL Server. Some of the information might be different when working with other database servers or other data providers.

Myth Close and Dispose

Many books and web resources preach that it is important to always Close and dispose a connection, to ensure that the internal connection will be sent back to the connection pool. Until .NET 3.5 even MSDN stated this, but this became fixed version 4.0. Both ways, calling the Close method or by disposing the SqlConnection will actually send the connection back to the pool. Since the using-directive provides a language integrated way to dispose a component as soon as it goes out of scope, I usually prefer this over calling the Close method in a try-catch-finally block.

Here's a little sample that shows that both ways cause a correct connection pooling:
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;";
string sql = "SELECT connect_time FROM sys.dm_exec_connections WHERE session_id = @@SPID";
SqlConnection cn1 = new SqlConnection(cnStr);

// close the connection but don't dispose it
cn1.Open();
using (SqlCommand cmd = new SqlCommand(sql, cn1)) {
   DateTime loginTime = (DateTime)cmd.ExecuteScalar();
   Console.WriteLine("New connection: {0}", loginTime.ToString("HH:mm:ss.fff"));
}
cn1.Close();

// dispose but no close
Thread.Sleep(500);
using (SqlConnection cn = new SqlConnection(cnStr)) {
   cn.Open();
   using (SqlCommand cmd = new SqlCommand(sql, cn)) {
      DateTime loginTime = (DateTime)cmd.ExecuteScalar();
      Console.WriteLine("After Close only: {0}", loginTime.ToString("HH:mm:ss.fff"));
   }
}

// result afer dispose but no close
Thread.Sleep(500);
using (SqlConnection cn = new SqlConnection(cnStr)) {
   cn.Open();
   using (SqlCommand cmd = new SqlCommand(sql, cn)) {
      DateTime loginTime = (DateTime)cmd.ExecuteScalar();
      Console.WriteLine("After disposing only: {0}", loginTime.ToString("HH:mm:ss.fff"));
   }
}

cn1.Dispose();
Console.ReadLine();
By returning the connect_time column from sys.dm_exec_connections for our current session id we retrieve the creation time of the connection from SQL Server.

As we see, in all cases we reuse the same, internal connection to SQL Server.

How a Pooled Connection becomes Reset

Since ADO.NET does not know what happened in a previous session of a pooled connection, when it is returned to a new instance of a SqlConnection, it has to reset the connection before it can be safely reused for another session. When resetting a connection all created temporary objects become dropped, all previously allocated resources are freed and the initially connected database will be set - if it was changed in the previous session. This resetting is done by calling the SQL Server procedure sp_reset_connection.

This can cause some issues. The ADO.NET connection pooler does not reset the connection when it is sent back to the pool, but when it is returned to a new opening SqlConnection, as shown in the next sections.

In addition, the transaction isolation level will not be reset if you changed it in a previous session. If you change the transaction isolation level in some of your sessions you will need to manually reset it whenever you get a pooled connection. At the moment this behavior is by design, due to backward compatibility (see Microsoft Connect issue sp_reset_connection doesn't reset isolation level). (Thanks to Greg who suggested this!)

Temporary Tables and Named Constraints

When working with temporary tables in SQL Server, there is still an issue, names of temporary tables only need to be unique within the current connection, but names of constraints need to be unique for all current connections. Since SQL Server 2005 and following versions it is possible to define constraints without specifying a name for them. Previous versions did not support this feature and the syntax to specify a name for a table constraint is, for sure, still valid. When executing the following SQL statement in two concurrent connections you get an error message:

CREATE TABLE #t1 (
   Id INT NOT NULL CONSTRAINT PK_TEMP PRIMARY KEY CLUSTERED
)
Only the first connection is able to create the temp table, the second connection gets the following error:
Msg 2714, Level 16, State 5, Line 1
There is already an object named 'PK_TEMP' in the database.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

This also affects ADO.NET connection pooling. Say, we've got two different processes, the first creates a new entry in a database and the second one listens for those entries to process them. When the first process finishes its work without an explicit DROP of the temporary table, the connection stays open and the temporary table will stay in database until the connection is reused. If the second process tries to create the same temporary table it will receive the same exception.

Global Temporary Tables

Same problem as described for constraints of usual temporary tables applies to global temporary tables. Their name needs to be unique over all concurrent SQL Server connections. If one process creates a global temp table and does not explicitly drop it, when closing/disposing the SqlConnection, the table will stay on the server until the current process opens another connection to this server (and database). This behavior can cause confusing errors in a production system.

Release of Temporary Allocated Data

For sure, since temporary tables stay available until a connection becomes reused, this also implies that all data, stored in those tables, stay allocated until the process exists or reuses the connection.

If we have a process that creates a temporary table and fills it with hundreds of thousands of rows those data will stay in tempdb for probably longer than expected.

One way to ensure all allocated resources and temp tables become cleaned as soon as the connection is closed would be to deactivate connection pooling. However, we should always try to avoid this. The cleanest way to avoid all previously described issues is to always, explicitly drop temporary objects.

Pooled Connection became Invalid

If an connection becomes invalid while it is in the connection pool, it will still be returned for next usage. This could be caused by network problems, a restart of the SQL Server service, a manual KILL of the connection on database side or several other reasons.

Here is a little example that can be debugged to show this:
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;";
         
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand("SELECT @@SPID", cn)) {
   cn.Open();
   Debug.Print("SPID: {0}", cmd.ExecuteScalar());
}

// at this point, restart the service or KILL the connection on server side

using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand("SELECT @@SPID", cn)) {
   cn.Open();
   Debug.Print("SPID: {0}", cmd.ExecuteScalar());
}
When debugging this code and restarting the SQL Server service after the first connection became disposed and before the second connection becomes opened you will get an exception. In my case (since I'm working with a local instance of SQL Server):
A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)
At the time of this post, there is no built-in support of ADO.NET for SQL Server that tells the resource pool to verify the connection before it is returned to the SqlConnection. However, if occasional network issues or other problems, that cause an invalid internal connection, sometimes happen in your environment, you can implement your custom connection validation in a factory method that creates your connections.

string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;";
SqlConnection cn = new SqlConnection(cnStr);
cn.Open();

try {
   using (SqlCommand cmd = new SqlCommand("DECLARE @i INT", cn)) {
      cmd.ExecuteNonQuery();
   }
}
catch (SqlException ex) {
   if (ex.Message.StartsWith("A transport-level error has occurred ")) {
      cn = new SqlConnection(cnStr);
      cn.Open();
   }
   else
      throw;
}
return cn;
As you see, the connection sends a tiny SQL statement to the server before it is returned to the caller. If a transport-level exception occurs, the internal connection becomes removed from pool and another new connection will be created and returned. You can replace the "DECLARE @i INT;" SQL statement with an empty stored procedure called usp_VerifyConnection. This might be slightly faster (not tested, though), but will definitely look less confusing in SQL Server traces ;-).

Control the Connection Pool over Connection String

The SqlConnection.ConnectionString provides several attributes that allow to control the behavior of ADO.NET pooling.

Connection Lifetime can be used to specify a maximum life time for a connection in seconds. When a connection is returned to the pool this value will be compared to the (client side) creation time of the connection. If the connections lifetime is timed out, it will be closed and not put into pool. Attention, this implies if a connection with connection is sent back into the pool before its lifetime elapsed and not used for a long time, it will still stay opened and reused for one more time!
// connection string with a lifetime of 10 seconds
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;connection lifetime=10";
string sql = "SELECT connect_time FROM sys.dm_exec_connections WHERE session_id = @@SPID";

// create a new connection with 10 seconds lifetime and send it back to the pool
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

// wait 15 seconds
Thread.Sleep(15000);
// aquire a new connection and release it
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

// create another connection
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}
And here the output of this test:
Connect time: 12:47:07.617
Connect time: 12:47:07.617
Connect time: 12:47:22.837
As we see, even if the specified connection lifetime is elapsed, the second connection will get the pooled connection. As long as the elapsed connection is not sent back to the pool it will stay active. Only the third connection retrieves a new created connection, since the connections lifetime was already elapsed as it was sent to the pool.

The default value for Connection Lifetime is zero, what means a maximum lifetime.

Min Pool Size and Max Pool Size can be used to specify a minimum/maximum count of pooled connections. The default value for Min Pool Size is zero, the default value of Max Pool Size is 100. When trying to use more concurrent connections than specified value of Max Pool Size you will get an exception.

The Pooling attribute (default 'true') says if connection pooling shall be used for the created connection or not. If you use a SqlConnection with pooling 'false' the database connection will always be newly created and closed after using it.
// connection string with "pooling=false"
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;pooling=false";
string sql = "SELECT connect_time FROM sys.dm_exec_connections WHERE session_id = @@SPID";

// create a new connection and print the current session id
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

// create a new connection and print the current session id
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

And here is the output:
Connect time: 13:41:32.790
Connect time: 13:41:33.050

As we see, the connect time on server side is different for both connections where it would be equal if we would have reused the same internal connection for the second SqlConnection.

Enlist (default 'true') attribute is a more special than the other connection pooling related attributes of connection string. If you don't work with distributed systems and/or Systems.Transactions namespace, you wont need this attribute. The Enlist attribute is important if you need to control how to span transactions over more than one transactional sub-system (like database servers, WCF services, ...). If Enlist is set to 'true' and a parent transaction context is available for the static property System.Transactions.Transaction.Current, ADO.NET will automatically register the SQL connection and its transaction in this transaction context. If you commit your database transaction (implicit or explicit) the transaction will stay pending until the parent transaction becomes committed or rolled back. Today, the most common way to manage those parent transactions is utilizing the TransactionScope class.

A hypothetical system, that might use transaction scopes, is a order management system that automatically creates a shipping order by utilizing a suppliers web-service when an order becomes created or approved. Here we need to update/create new data in a local database and create a shipping order over a web-service within one transaction. If both systems support (and allow) transactions a transaction scope can be used to ensure that all data are either written or not.

If Enlist is set to 'false', a new connection will not be enlisted to an existing transaction scope and behave like no transaction scope was available. All database transactions become committed, independent if the parent transaction will fail or succeed.

Control the Connection Pool over static SqlConnection methods

In addition to the control mechanisms provided by the connection string, ADO.NET SqlConnection provides two static methods to partially, or completely clear the current pool.

SqlConnection.ClearPool(SqlConnection) clears all pooled connections that match the provided connection - by its connection string.

SqlConnection.ClearAllPools clears all currently pooled SQL Server connections.

Pool Fragmentation

There are two more possible issues with ADO.NET connection pooling. Though, I find both are well described at SQL Server Connection Pooling (ADO.NET), so I just quote the paragraphs for sake of completeness.

Pool Fragmentation Due to Integrated Security (Windows Authentication)
Connections are pooled according to the connection string plus the user identity. Therefore, if you use Basic authentication or Windows Authentication on the Web site and an integrated security login, you get one pool per user. Although this improves the performance of subsequent database requests for a single user, that user cannot take advantage of connections made by other users. It also results in at least one connection per user to the database server. This is a side effect of a particular Web application architecture that developers must weigh against security and auditing requirements.
Pool Fragmentation Due to Many Databases
Many Internet service providers host several Web sites on a single server. They may use a single database to confirm a Forms authentication login and then open a connection to a specific database for that user or group of users. The connection to the authentication database is pooled and used by everyone. However, there is a separate pool of connections to each database, which increase the number of connections to the server.

This is also a side-effect of the application design. There is a relatively simple way to avoid this side effect without compromising security when you connect to SQL Server. Instead of connecting to a separate database for each user or group, connect to the same database on the server and then execute the Transact-SQL USE statement to change to the desired database. The following code fragment demonstrates creating an initial connection to the master database and then switching to the desired database specified in the databaseName string variable.

// Assumes that command is a SqlCommand object and that
// connectionString connects to master.
command.Text = "USE DatabaseName";
using (SqlConnection connection = new SqlConnection(
  connectionString))
  {
    connection.Open();
    command.ExecuteNonQuery();
  }
Two short things to the second quote. I would not restrict this possible issue to web applications. This kind of pool fragmentation can also happen on application servers that work with many databases on same database server. I don't know why MSDN suggests to use a explicit SqlCommand to change the database, instead of using the SqlConnection.ChangeDatabase instance method.

Sources

Wednesday, August 10, 2011

Generic C# Resource Pool

I guess this becomes another almost code-only blog.

Definition

A resource pool is a class that manages a synchronized access of (theoretically) infinite callers of a restricted count of resource items.



One reason to use a resource pool could be a integrated sub system that is restricted by its maximum count of concurrent accessing threads. This could be a web service with a restricted count of sessions. Another reason for a resource pool is a resource that becomes working slow when it becomes accessed by to many concurrent connections. This could be a web-service, a database or a farm of processing servers. A third reason for a resource pool could be a resource that takes long to become initialized, but can be reused. This could be database connections, XSLT transformation classes, custom XML serializers or many other classes.
Some well known .NET framework implementations of resource pools are:

Implementation

A few month ago I wrote myself a generic resource pool class that helps me to get those kinds of bottle necks managed.

Here is my implementation of a generic, thread safe C# resource pool. All source code files, as well as some NUnit tests, are also attached as downloadable files at bottom of this post.

// ===================================================================
// PoolItem<T>
// ===================================================================

/// <summary>
/// Represents an item in the <see cref="ResourcePool{T}"/>
/// </summary>
/// <typeparam name="T">The type of the resource to be hold</typeparam>
public sealed class PoolItem<T> : IDisposable where T : class {
   internal PoolItem(ResourcePool<T> pool, T resource) {
      _pool = pool;
      _resource = resource;
   }

   private T _resource;
   private readonly ResourcePool<T> _pool;

   public static implicit operator T(PoolItem<T> item) {
      return item.Resource;
   }

   /// <summary>
   /// Gets the resource hold by this resource pool item
   /// </summary>
   public T Resource { get { return _resource; } }

   /// <summary>
   /// Disposes this instance of an resource pool item and sends the resource back
   /// to the pool
   /// </summary>
   public void Dispose() {
      _pool.SendBackToPool(_resource);
      _resource = null;
   }
}


// ===================================================================
// ResourcePool<T>
// ===================================================================

/// <summary>
/// Generic pool class
/// </summary>
/// <typeparam name="T">The type of items to be stored in the pool</typeparam>
public class ResourcePool<T> : IDisposable where T : class {
   /// <summary>
   /// Creates a new pool
   /// </summary>
   /// <param name="factory">The factory method to create new items to
   ///  be stored in the pool</param>
   public ResourcePool(Func<ResourcePool<T>, T> factory) {
      if (factory == null)
         throw new ArgumentNullException("factory");
      _factoryMethod = factory;
   }

   private readonly Func<ResourcePool<T>, T> _factoryMethod;
   private ConcurrentQueue<PoolItem<T>> _freeItems = 
      new ConcurrentQueue<PoolItem<T>>();
   private ConcurrentQueue<AutoResetEvent> _waitLocks = 
      new ConcurrentQueue<AutoResetEvent>();
   private ConcurrentDictionary<AutoResetEvent, PoolItem<T>> _syncContext =
      new ConcurrentDictionary<AutoResetEvent, PoolItem<T>>();

   public Action<T> CleanupPoolItem { get; set; }

   /// <summary>
   /// Gets the current count of items in the pool
   /// </summary>
   public int Count { get; private set; }

   public void Dispose() {
      lock (this) {
         if (Count != _freeItems.Count)
            throw new InvalidOperationException(
               "Cannot dispose the resource pool while one or more pooled "
               + "items are in use");

         foreach (var poolItem in _freeItems) {
            Action<T> cleanMethod = CleanupPoolItem;
            if (cleanMethod != null)
               CleanupPoolItem(poolItem.Resource);
         }

         Count = 0;
         _freeItems = null;
         _waitLocks = null;
         _syncContext = null;
      }
   }

   /// <summary>
   /// Gets a free resource from the pool. If no free items available this method 
   /// tries to create a new item. If no new item could be created this method 
   /// waits until another thread frees one resource.
   /// </summary>
   /// <returns>A resource item</returns>
   public PoolItem<T> GetItem() {
      PoolItem<T> item;

      // try to get an item
      if (!TryGetItem(out item)) {
         AutoResetEvent waitLock = null;

         lock (this) {
            // try to get an entry in exclusive mode
            if (!TryGetItem(out item)) {
               // no item available, create a wait lock and enqueue it
               waitLock = new AutoResetEvent(false);
               _waitLocks.Enqueue(waitLock);
            }
         }

         if (waitLock != null) {
            // wait until a new item is available
            waitLock.WaitOne();
            _syncContext.TryRemove(waitLock, out item);
            waitLock.Dispose();
         }
      }

      return item;
   }

   private bool TryGetItem(out PoolItem<T> item) {
      // try to get an already pooled resource
      if (_freeItems.TryDequeue(out item))
         return true;

      lock (this) {
         // try to create a new resource
         T resource = _factoryMethod(this);
         if (resource == null && Count == 0)
            throw new InvalidOperationException("Pool empty and no item created");

         if (resource != null) {
            // a new resource was created and can be returned
            Count++;
            item = new PoolItem<T>(this, resource);
         }
         else {
            // no items available to return at the moment
            item = null;
         }

         return item != null;
      }
   }

   /// <summary>
   /// Called from <see cref="PoolItem{T}"/> to free previously taked resources
   /// </summary>
   /// <param name="resource">The resource to send back into the pool.</param>
   internal void SendBackToPool(T resource) {
      lock (this) {
         PoolItem<T> item = new PoolItem<T>(this, resource);
         AutoResetEvent waitLock;

         if (_waitLocks.TryDequeue(out waitLock)) {
            _syncContext.TryAdd(waitLock, item);
            waitLock.Set();
         }
         else {
            _freeItems.Enqueue(item);
         }
      }
   }
}

As you see, the ResourcePool<T> returns PoolItem<T> objects that hold a reference of the pooled resources. The pool does not need to become initialized with any resource items, new resource items become lazy initialized when (if) needed. Therefore the pool requires a factory method provided at construction. This method becomes called whenever a new resource is requested and no free resources are currently available. If the factory method returns null the requesting thread becomes suspended until another thread releases a used resource.

Usage

The pool can be used out of the box.
class MyResource {
   public void DoSomething() { }
}

[Test]
public void SampleUsage() {
   ResourcePool<MyResource> pool = new ResourcePool<MyResource>(CreateResource);

   using (var poolItem = pool.GetItem()) {
      MyResource resource = poolItem.Resource;
      resource.DoSomething();
   }
}

private static MyResource CreateResource(ResourcePool<MyResource> pool) {
   return pool.Count < 3 ? new MyResource() : null;
}

However, if used at many positions in your system I'd suggest to wrap it into a custom pool class. This pool can return a wrapper of the real resources that provide a more specific interface to the consumer. This wrapper can hold an internal reference to a PoolItem<T> and implement IDisposable to free the resources back into the pool when not needed anymore.
// ==========================================================
// The real resource
class InternalResource {
   public void DoSomething() { }
}

// ==========================================================
// The external wrapper returned to the consumer
class MyResource : IDisposable {
   private PoolItem<InternalResource> _poolItem;

   public MyResource(PoolItem<InternalResource> poolItem) {
      _poolItem = poolItem;
   }

   public void DoSomething() {
      _poolItem.Resource.DoSomething();
   }

   public void Dispose() {
      _poolItem.Dispose();
      _poolItem = null;
   }
}

// ==========================================================
// The custom pool
class MyPool {
   private ResourcePool<InternalResource> _pool;

   public MyPool() {
      _pool = new ResourcePool<InternalResource>(CreateResource);
   }

   public MyResource GetItem() {
      return new MyResource(_pool.GetItem());
   }

   private InternalResource CreateResource(ResourcePool<InternalResource> pool) {
      return pool.Count < 3 ? new InternalResource() : null;
   }
}

// ==========================================================
// Sample usage
[Test]
public void SampleUsage() {
   MyPool pool = new MyPool();

   using (MyResource resource = pool.GetItem()) {
      resource.DoSomething();
   }
}

Here are the files for download:

I hope the pool can help you to solve a few of your issues, as it did to me.

Wednesday, August 3, 2011

Object reference not set to an instance of an object

We all know, and love, this meaningful message from somewhere inside of a system.
Luckily I can say, in my last projects I became able to almost eliminate those failures - without a 100% unit test code coverage. (Those outside there who say, 100% code coverage by unit tests is a must must, please don't flame me. I'm a big fan of well tested software, but I guess I'm not yet ready for 100% coverage ;-).)

My solution for this old enemy was much easier than some might think. It was done by a simple coding convention. Don't get me wrong, I don't like hundreds of rules about how to write code. In my current project we've got less than 10 project internal coding conventions. Developing large software systems is some kind of a creative work and I think to many rules can eliminate the creativity of developers. I say every experienced developer should be able to write code that others can read and should be able to read code of others.

Our new rule to get rid of null reference exceptions was simply:

Get-methods never return null. They return a valid value or it throws an exception.

This does not mean that our system throws thousands of exceptions per minute. It also means that it is still possible to unsuccessfully try to get data from a method without an exception, we just do this different.

In old projects, code like this was a main part of daily business:
Customer customer = dataSource.GetCustomerById(1234);
if (customer != null) {
   customer.DoSomething();
}

This is nice code and works perfectly. Though, what if someone (me?) ever forgets to check if the method returned <null>? In this case it becomes only a question of time to see our old friend in a log file or - even worse - in a dialog shown to the user.
To avoid try-catch blocks over and over and avoid NullReferenceExceptions we simply established the TryGet pattern used by .NET classes like the Dictionary<T,V>. Instead of a possibly null returning Get-method we use a TryGet-method that returns a bool value indicating if our try to get something was successful or not.
public bool TryGetCustomerById(int id, out Customer customer) {
   using (var connection = CreateOpenConnection())
   using (var command = CreateProcCommand(connection, "GetCustomerById")) {
      command.Parameters.Add("@id", SqlDbType.Int).Value = id;

      using (var reader = command.ExecuteReader(CommandBehavior.SingleRow)) {
         if (reader.Read()) {
            customer = new Customer();
            customer.Id = id;
            customer.Name = (string)reader["Name"];
         }
         else {
            customer = null;
         }
      }
   }
   return customer != null;
}

Plus, if really needed, a Get-method that utilizes the TryGet-method and throws an exception if the search was unsuccessful.
public Customer GetCustomerById(int id) {
   Customer customer;
   if (!TryGetCustomerById(id, out customer))
      throw new EntityNotFoundException(
         string.Format("Cannot find customer with id {0}", id));
   return customer;
}

However, in many cases we don't need the Get-methods any more and in a few other cases we don't need the TryGet-methods since something must be wrong if the requested information is not available and a (meaningful) exception is a must.

This simple change to a TryGet-method changes our consuming library to something like this:
Customer customer;
if (dataSource.TryGetCustomerById(123, out customer)) {
   customer.DoSomething();
}
Some might argue "we still need the check for a result", and they are correct. But, this is a psychological thing. Calling a method that is called Try... clearly tells the consumer, that this call can return without a result. Even more. Sometimes, when it is okay if the method does not return a value, a developer might leave out the if-condition at the point where (s)he calls a TryGet-method. In those cases I saw other developers asking "But you call a Try-method and you don't check for the null value, are you sure??".

In cases where a method can return more than one item, like GetCustomersForCountry, we say it is valid to return an empty list. This is because of two reasons, an empty list is not null and most of the time those calls are followed by a loop that will not be entered for an empty list. However, if we call a method that returns a list and we only need the first item we should consider to add an additional method that returns only one item (like GetFirstCustomerForCountry). Another solution is an extension method "IEnumerable<T>.TryGetFirst(out T item)" and we are back in our style.

For sure, the samples I used here, to get data from some kind of data source are only one of many parts in a system where we find Get-methods. We use this pattern for all methods we write, independent if it works with a data source or any other objects. And we are doin' good with it.

As I said, every now and then we still see a NullReferenceExcption, but I guess there have been about 2 or 3 over the last 5 months of intensive development and testing. All in development or test environment, none in production. Plus, if this happens, we know there are only two options how to fix it. Either add the missing if-clause around the call of a TryGet-method (didn't happen over the last year) or change the null returning Get-method.

Probably I'm the only one who ever had problems with null reference exceptions but, for some reason, it feels like I'm not ;-). I can say we are truly on a point where we can call this old problem (almost) gone.