Monday, December 21, 2009

A Strange Kind of a Relational EDI


Today I'll speak about a very different kind of data exchange. The following approach might sound strange for the first moment (it did to me!). However, after some days thinking about, I think this might be a new way which probably provides some powerful new solutions.

First as first. Let's start up with a general discussion about different types and directions of EDIs (Electronic Data Interchange).

Many of us work with EDIs to exchange data between different systems and/or locations. I think there are two main types of data interchange directions.

The first direction of EDIs are data-loader routines or processes which load data from 3rd level applications. Especially todays EAI (Enterprise Application Integration) applications are often related to many different other applications. This might be an order system which takes customer information from a CRM tool and some employee information from a HR-software. The important attribute here for this kind of data exchange is the communication with other applications which are not part of your own project.

The second kind of EDIs are synchronization routines or processes which are used to keep systems in sync. These processes are often used to exchange data between different servers (e.g. locations) of a distributed system. Another sample for sync routines is a offline support of a product catalog on a sales notebook. Last but not least synchronization processes are used for staging areas between operational and analysis systems like SSAS. The important attribute of this kind of data exchange in this post is the communication between systems of the same project or company.

As you see, I want to point the difference of communication between your own project modules and external systems.

Next we'll go on with a short listing of the most common available EDI formats and their strengths and weaknesses.

An EDI Format Inventory

Todays most common EDI formats seem to be CSV, XML. Sure there are several other - more specific - EDI formats like EDIFACT, BORD128, binary serialization and so forth. Though, for sake of brevity those formats are not scope of this post.

CSV

CSV files (separated by comma, tab, semicolon or whatever) are especially used to exchange huge amounts of data. The main strength of this format is its compactness. CSV files contain almost no overhead. The files are human readable with tools like Notepad (or any other text editor) or Excel. Especially data storage systems like SQL Server, many other RDBMSs or SSIS provide build in functionality to write and read CSV files.

On the other hand, CSV files have some weaknesses. First, the files are not relational, which means you cannot specify links between different kinds of data. Sure you can use distinct files for each object (or data) type and put everything together into an ZIP archive. Nevertheless, the files are not integrity checked. Another issue of CSV files is a missing search functionality. Microsofts OLEDB Jet-Provider enables simple SQL queries on CSV files but this is not a standard on other operational systems. Finally CSV files are not type safe. Your third column might be specified as an amount of goods, but you can never be sure that data in all rows are really stored in a decimal format and even if they are, you need to know the formatting information of the data. E.g. an amount of "1,234.56" in German format is "1.234,56". With date/time formats it becomes more funny ;-)

XML

Over the last some years XML became more and more the most common format for almost any kind of data exchange. Why? Here just a subset of reasons...

  • It supports references (links) between different objects. This could be direct parent-child relations or foreign-key like references to other XML-objects within or even without the current XML file.
  • With XPath it supports a full featured XML query language.
  • Almost all todays development kits provide build in XML support.
  • XML files can be validated via XSDs for data types, required and optional object properties and required relations.
  • It is human readable with any kind of text editor, web browsers and many IDEs.

A common approach to use XML for data exchange is object serialization and there are two generally different ways to handle XML serialization. In smaller projects it often makes sense to directly serialize the existing Domain Objects. In more complex domain models serialization is often done by creation of DTOs, since the domain model might be too complex and contain too much validation which might be not needed (or desired) while (de-)serialization. DTOs are a simplified and coarse grained simplified versions of fine grained domain objects.

Sure, XML can also be used without any build in serialization methods. Since this slightly eliminates the "easy to use" factor, this is usually only needed for very large amounts of data or very flexible requirements.

Independent of all the strength of XML, it also has some weaknesses which might be important for some requirements.

XML is a very verbose data format. Though, this can be handled by using any kind of compression (like ZIP, GZip and so on).

Another weakness of XML is XPath. Don't get me wrong, XPath is a powerful query language and I (almost) always get what I want. Nevertheless, most times original data storages are relational databases. When using XML you always have to either implement some kind of XPath based Mapper (when working with plain XML) or any object based way to find all de-serialized objects and their relations (when working with serialization).

A Relational EDI

Wouldn't it be cool to work with SQL when working with EDI, eh? In my opinion, yes! So, why don't do this?

Did you ever think about using embedded databases (like SQL Server Compact, SQLite or others)? Sounds strange to you? It did sound strange to me, too. However, who cares? After several days thinking about this approach, this appears to be a probably very new kind of data exchange.

When you work with an O/R-Mapper you might even get this feature as a gift. Even if you don't work with an O/R-Mapper, most usual SQL statements are supported by those new embedded database systems. When working with SQL Server Compact you might think about using Microsofts Sync Framework to load and update your data from your "database interchange".

Embedded databases support constraints primary keys, foreign keys, indexes (and some even more). Armed with these constraints, embedded databases provide a ensured relational integrity. You can also be sure that all your data are stored in correct format.

One potential issue of embedded databases is the file size. A new, empty SQL Server Compact database has a initial size of 20KB (for being empty :-D ). 20KB appear to be not too much for me, but what about file growing when data become loaded? After this initial size it grows less than XML files. I tried to create an XML and a SQL Server Compact database each containing all Customers (91 rows) and Orders (921 rows) from good old Northwind database. The XML file ended up with 437KB, the SQL Server Compact database ended up with 184KB (containing both primary keys and a foreign key from Orders to Customers). To be fair, I also tried to compress both resulting files with WinRAR. The XML file was compressed to 30KB, the database file was compressed to 31KB. Sure, "large" and "small" always depend on your environment, but in my opinion this compression appears to be okay.

Is missing human readability an issue of embedded databases? I'd say, yes ant no. You cannot open the files with a text editor. But you can use tools like SSMS (in case of SQL Server Compact) or any other database query tool. If you work with an O/R-Mapper or any other Data Access Layer which supports different database systems you can probably even use your tailor made business application to browse the data within the "interchange".

Nevertheless, the most important issue of using embedded databases as EDI is the fact that it is completely no standard today. For this reason this should only be used for in-project synchronization between different storages at the moment.

Conclusion

If network traffic is one of the issues you fight with when exchanging EDIs between systems, CSV is definitely the way to go.

If you need a standardized communication to any outside (of your project) module, XML is usually the way to go.

If you have to synchronize larger amounts of data between different parts of your own system, you might think about using embedded databases doing this job. At least, give it a try or some hours to think about ;-).