Saturday, November 7, 2009

Bulk Data Persistence Pattern

"Bulk Data Persistence" (BDP) covers a problem I've seen many times in while doing my business or speaking with people all over the world. BDP is an extension of the "Unit of Work" (UoW) pattern, defined by by Martin Fowler in his great book Patterns of Enterprise Application Architecture (PoEAA). BDP describes a different way how clients communicate with their persistence layer to save changes.

A Real-Life Scenario

Guess it's Sunday morning. You stand up, go to bath for oral hygiene and taking a shower. After that you dress up.

Now you leave your house/flat, go to the baker and buy one biscuit (a bab if you live in England or a Semmel if you live in Bavaria - like I do :-D ); back at home you put your biscuit (bab/Semmel) onto the kitchen table. Now you leave your house again, go to the baker to buy the biscuit (bab/Semmel) for your husband or wife; back to home and put it onto the kitchen table. Since you have two children and mom is currently visiting you end up with five trips to the baker and back.

Sure, you'd never do this. You go to the backer and take all biscuits (babs/Semmeln) for the whole family in one round trip. However this is a good example to see how software interacts with databases.

A Technical Scenario

Guess a system is made to enter orders.

The user creates a new order and adds ten order positions. The implementation of this might be realized with a UoW design. When hitting the save button the business logic layer (BLL) validates the order and all oder positions against existing business rules. When everything is fine the client opens a connection to the persistence tier (usually a database). The data access layer (DAL) maps the order object to a "Orders" table and sends row to the database. When the order was inserted the DAL maps the first order position object to the OrderPositions table and sends a row to the database. When this operation returns it maps the next oder position and sends the next row to the database. And so on...

Hence, quiet same scenario as the real-live baker sample.

Tip to the Balance

Todays systems become larger and larger and the count of data to be persisted heavily grows, too. Scalability is a very important design for applications like web applications or service endpoints. A scalable designed application enables the production provider to use load balancing technologies to host the applications on several boxes.

Unfortunately, the operational persistence layer is almost never scalable. It is possible to use replication or data-warehouse architectures to move reporting processes away form the operational system. Nevertheless, business C(R)UD operations always have to work on the one and only operational database and this database is a common bottleneck of several system.

A short look into the Unit of Work (UoW)

Since Bulk Data Persistence (BDP) pattern is a extension of the Unit of Work (UoW) pattern I'll start with a short description of UoW pattern.

The UoW describes a client side transaction object which registers all object changes during one complete business operation. In our sample of the order system, an order only makes sense with all of it's order positions. The UoW transaction object holds the order and all order positions until everything becomes stored in the database.

The following diagram shows the most important part of the UoW pattern needed to understand BDP.

I abstracted the data persistence part since this is the important we will look at in the following text.

A classic implementation of the data persistence would be single database operations for each object which was changed within the UoW. The following diagram describes this approach.

In our sample order system we would generate 11 single database operations. One to insert a row into the Orders table and ten to insert the rows into the OrderPositions table.

The Bulk Data Persistence Pattern

Bulk Data Persistence means to aggregate all data operations of one object type and send them as one  single database operation to the database system.

The idea of BDP is to enable the database system to do what it is really optimized for, work with bulks of data. This usually causes way faster statement execution and even better responsibility for concurrent applications since one larger lock works faster than many small locks caused by many single persistence operations. Last but not least the network round trips of the clients are reduced to a minimum what also causes a faster execution.

In our order system sample there would be just two database operations. One to insert the new Order row and one to insert all the OrderPosition rows.

In this sample 11 single statements seem to be no problem but consider the number of concurrent users.

Another scenario can be a process which synchronizes data between a distributed database system or a process which imports data from another system. Both these scenarios might work with thousands of rows to be saved.

What BDP does not cover

Please Don't misunderstand. The idea of BDP is not to move business logic into the database! Don't try to move all changed object types of one UoW in one package to the database. When your database handles complete messages of business objects, it becomes a data service and that's usually not the intention in todays software development. Instead of this send one data package per type. BDP covers the same functionality and contract as any single data persisting data access layer and persistence tier.


I'll provide some different samples in the next days to show how it works and what's the profit. Links to these samples will be added here.

No comments:

Post a Comment