Monday, July 19, 2010

If You Cannot Use Identities; Cache Your Ids

Sometimes it is not possible to use IDENTITY columns in database tables. Unfortunately, one of the most common approaches to handle new IDs in non IDENTITY columns is working with MAX(). I just saw this yet another time in a blog post I don't want to refer here. In my opinion, the generation of unique ID values is quiet basic but should be done correct, so this blog post is all about this topic.

If you are able to use IDENTITIES in all tables, you are find and you can finish reading here. :-)

Reasons Not To Work With Identity Columns

One reason might be, that the used framework does not support them, like Entity Framework in combination with SQL Server Compact. SQL Server Compact does not support batch execution, so every query allows only one SQL statement and EF does not support a second query for each INSERT to determine the last generated IDENTITY value.

Another reason not to work with IDENTITY columns are scenarios where computed IDs are required, like in distributed systems. I used them to guarantee a unique PRIMARY KEY over all databases (all over the world) where the first part of the primary key was an incrementing BIGINT value and the second part was a SMALLINT, describing the database where the table row was created. Some of you might argue, that I would have also been able to use a Guid, however I still don't like them in databases since they are way too large and too slow.

The MAX() Approach

A very common approach to generate new ID values is to select the currently maximal value of the table where new rows have to be inserted.

Here a plain SQL sample.
-- ======================================================================
CREATE TABLE Orders (
   Id INT NOT NULL PRIMARY KEY CLUSTERED
   ,OrderDate DATETIME2
);

-- ======================================================================
-- some existing rows
INSERT INTO Orders VALUES (1, SYSDATETIME());

-- ======================================================================
-- generate new ID values, using MAX
DECLARE @newId INT;

-- select current maximal ID + 1
SELECT @newId = ISNULL(MAX(Id), 0) + 1 FROM Orders;

INSERT INTO Orders VALUES(@newId, SYSDATETIME());
Unfortunately, this approach has some issues. First, let's have a look at the performance. Each call of MAX() causes an index scan of the primary key, as shown in the below execution plan. SQL Server 2008 does a great job, avoiding to scan the whole index, however this scan for each new row should be avoided.


The second, more important issues is concurrency. If we are working with SQL Server Compact, this issue does not exist, since the DBMS supports only one current connection at the time ;-). In every other environment, where we have more than one concurrent user connection, we might (and occasionally will) get an error if there are two concurrent inserts and both connections are at the same time between row 16 and row 18 of the above sample script. Both will get the same MAX ID value from the table, but only one of them can use the new calculated ID for insert. The second one will retrieve a primary key violation.

Using an ID Table

To avoid both previously denoted issues, we can use an ID-table, containing the name of the table to create a new ID for and the next available ID.
-- ======================================================================
-- id table
CREATE TABLE Ids (
   Name VARCHAR(30) PRIMARY KEY CLUSTERED
   ,NextId BIGINT NOT NULL
);
-- ======================================================================
-- init ids for Orders table
INSERT INTO Ids VALUES('Orders', 1);
GO
-- ======================================================================
DECLARE @id BIGINT;
-- get the next available "Orders" id from Ids table
UPDATE Ids SET
   @id = NextId
   ,NextId += 1
WHERE Name = 'Orders';

INSERT INTO Orders VALUES(@id, SYSDATETIME());
Usually, we would move the ID allocation into a procedure, which I left out for sake of brevity. This already looks like a much better approach, doesn't it?

However, till now, this solution brings up another issues. Now the whole system has to access (with a write lock) the ID-table for each single row inserted into any other table. This usually causes a system wide performance issue when system is on load. The solution for this problem is to cache IDs as explained below.

ID Caching

To avoid the access problem with the ID table, clients can use a caching mechanism. This means each client avoids to allocate new IDs one by one, but always allocates a larger count of IDs in one step and works with them for the next new rows to be inserted.

The count of IDs to be allocated in one step depends on the client and the count of row it usually creates. A GUI like a web- or windows-application is often fine with an ID cache size of 10 where a data loader process might need a cache size of 1000.

For sure, this causes that some ids are never used. E.g. if a web application allocates 10 new IDs but the user enters only one new row, 9 IDs are lost. However, this doesn't really matter since primary key IDs are only needed for uniqueness and should never be used as shown sequence in the client.

The easiest way to allocate new IDs is a simple procedure which gets the name of the table to allocate new IDs for and the count of IDs to be allocated and returns the next available ID to be used.
ALTER PROCEDURE AllocateIds
   @nextId BIGINT OUTPUT
   ,@name VARCHAR(30)
   ,@count INT
AS
SET NOCOUNT ON;

UPDATE Ids SET
   @nextId = NextId
   ,NextId += @count
WHERE Name = @name
GO

Client Side Caching

Caching IDs in a client application is usually quiet easy, as long as the client is stateful like usual web- or windows-applications and most web services.

This IdGenerator is a very simple sample of how to cache IDs in the client, utilizing our previously shown stored procedure.
static class IdGenerator {
   class IdEntry {
      public long NextId;
      public long MaxId;
   }
   // table depenant cache
   static IDictionary<string, IdEntry> _cache = new Dictionary<string, IdEntry>();
   // connection and allocation information
   static string _connectionString;
   static int _allocationSize;
   // initializes the connection and allocation information
   public static void Init(string connectionString, int allocationSize) {
      _connectionString = connectionString;
      _allocationSize = allocationSize;
   }
   // public interface to allocate new ids
   public static long NextId(string tableName) {
      IdEntry entry = GetEntry(tableName);
      if (entry.NextId > entry.MaxId)
         AllocateNewIds(entry, tableName);
      return entry.NextId++;
   }
   // get an id entry for a specified table
   private static IdEntry GetEntry(string tableName) {
      IdEntry entry;
      if (!_cache.TryGetValue(tableName, out entry)) {
         entry = new IdEntry { NextId = 1 };
         _cache.Add(tableName, entry);
      }
      return entry;
   }
   // allocate new ids from database
   private static void AllocateNewIds(IdEntry entry, string tableName) {
      Console.WriteLine("Allocating new ids from database");
      using (SqlConnection connection = CreateConnection())
      using (SqlCommand command = new SqlCommand("AllocateIds", connection)) {
         command.CommandType = CommandType.StoredProcedure;

         SqlParameter nextId = command.Parameters.Add("@nextId", SqlDbType.BigInt);
         nextId.Direction = ParameterDirection.Output;
         command.Parameters.Add("@name", SqlDbType.VarChar, 30).Value = tableName;
         command.Parameters.Add("@count", SqlDbType.Int).Value = _allocationSize;
         command.ExecuteNonQuery();
         entry.NextId = (long)nextId.Value;
         entry.MaxId = entry.NextId + _allocationSize - 1;
      }
   }
   // create a new, open database connection
   private static SqlConnection CreateConnection() {
      SqlConnection connection = new SqlConnection(_connectionString);
      connection.Open();
      return connection;
   }
}
... and a sample, how to use the ID generator...
IdGenerator.Init("Server=.;Database=Sandbox;Integrated Security=sspi;", 10);
for (int i = 0; i < 100; i++) {
   long id = IdGenerator.NextId("Orders");
   Console.WriteLine(id);
}

ID Tables And T-SQL

Well, while ID-tables and caching are a good way to go in client applications, they are a bit tricky in T-SQL (and probably in most other SQL dialects). The problem is that T-SQL is stateless and due to this fact it does not support caching, since we don't have that "static" place where we can store our cached IDs.

One workaround is to create a temp table that looks like the stateful ID-table and represents our "own" cache. But, due to the scope handling of SQL Server it is not possible to move the temp table creation into a procedure, so this workaround is quiet awkward to maintain since all scripts/procedures that want to a client like caching have always to create their temp table by their own.

However, usually T-SQL scripts should never try to work like clients - in a row based manner, so they should not really need a ID cache. Whenever new rows have to be inserted into a table, the executing procedure/script should determine the full count of needed rows, allocate the required range of IDs and insert all rows in one operation.

One thing that changes when working with ID-tables is, IDs for single row inserts should be provided from client now. Why? Because of the missing ability to cache IDs in T-SQL. Where we usually had insert procedures lie this...
CREATE PROCEDURE InsertOrder
   @id INT OUTPUT
   ,@orderDate DATETIME2
AS
   INSERT INTO Orders (OrderDate)
      VALUES (@orderDate);
   SELECT @id = SCOPE_IDENTITY();
... we would have to act like this...
CREATE PROCEDURE InsertOrder
   @id INT OUTPUT
   ,@orderDate DATETIME2
AS
   EXECUTE AllocateIds @id OUTPUT, 'Orders', 1;
   INSERT INTO Orders (Id, OrderDate)
      VALUES (@id, @orderDate);
... but this would cause a huge traffic in our ID-table. So it is usually more productive to get the IDs from the client, which can easily handle the caching.
CREATE PROCEDURE InsertOrder
   @id INT
   ,@orderDate DATETIME2
AS
   INSERT INTO Orders (Id, OrderDate)
      VALUES (@id, @orderDate);
Some might say, that they don't want to give clients the force to handle the new primary key values. Though, SQL Server enforces the uniqueness for you ;-) and the price to pay is quiet cheap, compared to the other option.

Stateless Clients

Stateless clients have almost same issues like T-SQL procedures and scripts. Due to their statelessness, they are usually not able to cache their IDs. The only additional option we got with stateless clients is to move the ID caching into an external resource like a WCF service. If an system has several stateful and several stateless components, it is an option to keep the stateful applications fast by providing a separate ID cache for those that are stateless.

For sure, you could consume this service also from T-SQL (quiet simple with SQLCLR as I showed here) but SQLCLR to access web services is still rare used and the overhead is usually way larger than directly accessing the ID-table. So, the service should only be used from client side to keep the SQL out of the ID generation.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.