Technically Speaking: Data Integration

Technically Speaking: Data Integration

Accessing data doesn’t just mean having a unified view of it all. For practical purposes of crunching all the data, it needs to be in one place where your analytics program can reach it. That involves “moving” data (more like copying and pasting, actually) from one place to another, usually from storage systems into a data warehouse capable of analyzing it. Methods for doing that include processes called ETL (for Extract, Transform and Load) and Data Replication, the latter of which, while often used for tasks like Disaster Recovery and data migration, in relation to Big Data, offers a high performance data movement tool that should be able to quickly synchronize large quantities of data.

In order to conceptualize the ETL and Data Replication processes, data analysts usually refer to where the data is stored at the start of the process as the source, whereas where you want to move/copy the data to is the target. ETL tools are your basic data movement tools, which extract all the data files selected from the source, transform them into a structure readable by the target and Business Intelligence applications on the target, and then load the transformed data into the target. ETL tools are good at moving large quantities of data all at once in what is called a batch. They also do a good job when significant transformation of the data is required before loading into the target.

ETL on its own can have trouble handling certain situations, however. If data in the source is changing in real-time, you may need to analyze that data quickly and perhaps in real-time as well. Because ETL loads everything from the source into the target all at once during a batch transfer, the target can experience downtime for hours while the data is loaded. The more data you need to move, the more downtime. If the target isn’t supposed to be used for long periods of time, like at night, and if you don’t require immediate analysis of new/changed data, then that downtime may not be an issue. However, you could still be wasting time and money if much of the data that your ETL tool is extracting from the source is already in the target from a previous batch load.

So, for those with high-performance requirements and a need to increase the efficiency of data transfer, a Data Replication solution will be a necessary add-on. Most replication solutions will contain a Change Data Capture (CDC) module which captures changes made in source systems and then replicates that change into a target system, keeping the databases synchronized. In some cases, the CDC tool can be sold separately from the rest of the replication package. Other parts of a Data Replication solution can include schema and DDL replication, an easy to manage user interface, and software and hardware architecture designed for moving large amounts of data very quickly without creating down-time for your sources or targets or interfering with the ability of your enterprise applications to keep running. This same capability can also ensure that in the event of a crash, your company has the most up to date data with which to pick up the pieces. In addition, good replication solutions should be fully automated in order to optimize IT productivity and save costs on professional service needs.

Data replication can have drawbacks, however. An enterprise replication solution can cost many tens of thousands of dollars, placing the capability out of reach for many smaller companies. Additionally, many replication solutions are not very good at the transformation task that’s often needed when moving data from sources to targets. The result is that you need to piggyback the replication solution on top of an ETL solution, and not all replication solutions work with all ETL solutions. In fact, it’s best to think of replication not as a replacement for ETL, but as a complementary solution. Both processes will be needed in executing the Data Integration part of your Big Data strategy.

Timothy King
Follow Tim