3 Key Data Replication Challenges & What to Do About Them
Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, Arcion CEO Gary Hagmueller offers commentary on key data replication challenges and solutions for achieving success.
It’s virtually impossible for any organization to combine all of its data into a single database. Transactional requirements typically demand different performance characteristics than operational or analytical ones. Even within the same type of system, data may be dispersed across a number of locations, and some far-flung areas might not have strong connectivity to a main site. In some situations, data consolidation may be desirable, but an organization will still require a way for best-of-breed applications to communicate with one another. In the world of databases, one theme is perpetually true: The need to transfer data between systems or applications is absolute.
While data mobility is a fact of life, moving high-volume, high-velocity data between transactional databases, data warehouses, and cloud platforms is no easy task. Here are three main challenges:
- Normalization: Anyone who has ever needed to integrate or move data can attest that each system, no matter how similar, has slight variances that can make it incredibly hard to synchronize.
- Scale: Modern databases now deal with larger and larger objects in real time, meaning the old approach to scale (e.g., prioritizing smaller objects over larger ones) now results in significant latency and unstable pipelines.
- Integrity: Integrity of data really matters in the modern data stack. Real time is only real if you deliver the order of transactions as they occur. Else, you will by default produce incorrect results.
Let’s take a closer look at each one. We will analyze each challenge and recommend solutions based on industry best practices and knowledge shared by experts in the field.
Data Replication Challenges
No matter how hard you try, it’s virtually impossible to have two systems stay identical in structure. Even backup systems can have slight variances. These differences make it incredibly hard to integrate data and keep it in sync. The problem gets significantly harder to sync systems that were built for different purposes, as they can have materially different formats for the exact same data. Let’s take a look at some simple examples to try to illustrate this point.
In one system, a ZIP code is stored as a five-digit number. In another, it may be a nine-digit string. In one system, a birthdate has six digits, another has eight. These are examples of data that represent the same value but are stored in different formats. At a basic level, all these nuances, whether subtle or material, need to be normalized, and it can be difficult to do so.
Going beyond differences such as data formatting, think of all the things that need to be normalized to integrate data properly. Assessments such as how column names line up, or even more tricky, what happens if one end is a document-based NoSQL system and the other is a SQL-based system? In order to normalize data and have usable input for data-hungry applications or analysis, the list of things that need to be solved under the “normalization” banner is considerable.
Then there’s one final challenge. Once the data is normalized, be prepared that nothing stays the same. Columns are added, object types are changed, schemas evolve, etc. The frequency of these changes is irregular, but spotting a change before propagating data into a downstream system is vital to operating effectiveness, as the failure to deal with them proactively will break downstream applications and uses.
How Do We Normalize Disparate Data?
First and foremost, select a tool or suite of tools that address the challenge of normalizing data upfront. I recommend a single tool to simplify deployment and maintenance, but this may not be possible sometimes. There are several data integration tools that provide an easy-to-use interface to help teams manage data transformations to a shared format. Some solutions take it a step further by automating the tedious manual repetitive tasks that come with data normalization. Whatever solution you choose, it is vital that it has the capability of detecting and normalizing changes to the data source. Look for solutions that can truly demonstrate the ability to “automatically detect and apply DDL changes.”
Being able to do anything at scale is both necessary and challenging. Once you’ve solved the normalization issues described above, you need to be able to handle the massive amounts of data that will be flowing in and out of one system and into others. The fact that modern organizations produce data at a scale that is growing exponentially means that this challenge is persistent and growing. Systems must always be a step ahead of the wave of data expansion. While such growth presents a rich source of inputs for downstream applications, handling the scalability of a system is a major engineering challenge.
For most older-generation applications, vertically scaling a system usually involves adding new CPU, storage or memory capacity and manually configuring the new hardware to start balancing the load. Many of these legacy systems had predictable and more linear scale curves, not at the exponential scale we see in modern times. Scaling horizontally meant deploying more servers to spread the load across, which became easier to configure with automation but still presented its own set of challenges. All of these solutions required capital and an investment that could not be scaled up and down on demand. Once the hardware was bought, there was no option to scale down — or at least easily scale down in a cost-effective manner.
Modern systems built on the cloud, by contrast, are built to automatically scale up and down based on the needs of the application and the load being experienced. Modern applications and infrastructure systems are designed to align with the cloud infrastructure that runs most modern enterprise applications.
Infrastructure vendors of the pre-cloud era also had to solve scale issues, especially those related to moving data from one platform to another. Migration and replication have always required a lot of computation resources in order to run efficiently and quickly. As little as five years ago, compute was extremely expensive and often out of physical reach. This led to data migration and replication being handled in slow and inefficient ways.
Early vendors dealt with these legacy scalability problems by doing what seemed natural for that era. Since databases a decade or more ago were mostly focused on the read and writes of text or small object-type data, it made sense to prioritize moving smaller objects over larger ones. Other techniques were employed, but the basic approach was the same — optimizing for the flow of the majority of the data, defined as the number of rows. If a few images or larger objects had to sit around for a while, that was an acceptable thing for systems of a decade or more ago.
But, as with any technology, databases and the enterprises that use them did not stand still. Data continued to grow and those legacy technologies were pushed to the absolute limit. Costs continued to climb exponentially and with it the demands placed on data migration and replication techniques. Modern databases must now deal with larger and larger objects in real time, meaning the old approaches now result in significant latency and unstable pipelines that are difficult to maintain. Cloud-native and distributed solutions born in the modern era are free of this highly limiting architecture and have helped usher in easy and cost-effective scalability with minimal latency. The new generation of data replication technologies have made real-time data replication and migration an easily solvable challenge.
How to Achieve Scale Without Deep Pockets?
The solution is aligned to what modern organizations are already moving toward — a cloud-native approach to solutions and applications. We recommend using a solution that is efficient, lightweight and cloud-native. Look for solutions that can demonstrate a microservices architecture and offer the ability to increase or decrease the use of computation resources without the need for a human to configure anything. By moving operations to the cloud or using a cloud-native solution in your VPC, the constraints of compute, RAM and storage can be optimized automatically based on the performance required. So, organizations can autoscale based on real-time use, minimizing spend while ensuring customers get the best experience from such applications at all touchpoints.
Once you’ve tackled the monumental tasks of normalization and scalability, you now have the last and perhaps most daunting problem to solve: data integrity. Data integrity is the overall accuracy, completeness and consistency of data. The integrity of data is crucial in the modern, real-time data stack. Real time is only effective if you deliver transactions in the order they occur. If transactional integrity is not maintained, you will by default produce incorrect results. Incorrect results could lead to disaster, in one form or another.
Part of dealing with transactional integrity is dealing with the fact that databases are consistently working with committed and uncommitted transactions. Ask the database servers to send only committed transactions, and you’ll slow your data replication or migration down. Pass uncommitted transactions to the warehouse or data app, and you’ll end up either having to run lots of extra complex queries (that break with the slightest data change) or generate garbage results.
Outside of these completed and in-flight transactions, what about those pesky network or system downtime issues affecting data delivery? How do you guarantee that the data in the destination system has been delivered 100% of the time but never more than 100% (which would result in duplicate data)? The list of these scenarios is almost infinite, but you can probably think of a number of other integrity challenges outside of the ones mentioned here.
How to Ensure Data Integrity Amid Known & Unknown Unknowns?
Akin to the solution of data normalization with diverse and disparate data sources, the challenge of guaranteeing data integrity is best handled by deploying a data integration tool that guarantees transactional integrity. But don’t confuse data integrity with high availability (HA), as they are distinct things. If you have any mission-critical applications that depend on real-time data flows, you should make sure that the vendor you select has both high-availability capabilities (the ability for a second deployment to seamlessly pick up when an online system goes offline) and data integrity (guaranteed only once delivered). And note that true HA should be built into the product you select and not require brittle DevOps hacks to implement. Finally, it is critical to ensure that the tool you pick explicitly calls out its support for both committed and uncommitted transactions.
Is it enough to consider only these top three challenges?
By now, you might be thinking that once you solve these issues, you’re pretty much home free! Well, the only things left are to containerize your solution, integrate the deployment with an automation tool like Terraform, and provide instrumentation and dashboards to keep an eye on its status so you know it’s working at all times. Oh, and don’t forget about scheduling the DevOps work needed to keep it humming and staff the maintenance team to fix the issues that invariably occur! Replicating data is hard, and creating a custom setup to do it in real time with accuracy can sometimes seem near the brink of impossible. This is why tools that provide out-of-the-box solutions for these challenges exist. In order for a data replication solution to be complete, it should be handling all three of the challenges mentioned above and more.
Some Food for Thought…
Solving these three challenges requires a lot of engineering resources and big financial investments, and even that doesn’t guarantee the solution can be delivered on time. A purpose-built replication tool is arguably the most efficient way to help with these challenges.
Many organizations and teams have tried homegrown tools and solutions to mitigate these challenges. But building these tools on your own will make it hard to realistically overcome these problems. Of course, nothing is impossible, but weighing the cost of building, testing and maintaining such a solution from scratch may make the costs of a tool that already does it seem like a drop in the bucket, in terms of budget and effort.
At the same time, adopting a solution that was first architected 10 or 20 years ago may present similar challenges. When evaluating these solutions, make sure that your proof of concept requirements include replicating database tables that contain your largest objects and your highest change velocity. Also, make sure you ask the vendor to do it with a load factor equal to the highest load you currently experience — or better yet, the peak load you anticipate two years from now. If by some chance a legacy vendor gets it to work, have them turn it off abruptly and see how it fails. If you are missing data or see a transaction from 10 minutes ago before you see one from an hour ago, keep shopping.
In summary, define your use case, use the challenges above as guides for picking a tool, and implement a cloud native, real-time data replication solution that was built to last. If I make it sound easy, it’s because with the right tool, it is.