This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Arcion Founder and Chief Architect Rajkumar Sen offers a look at Change Data Capture examples, with a complete introduction to CDC technologies.
Data management and utilization have become the differentiating factor for businesses in today’s competitive world. For companies in crowded business markets, data provides the opportunity to gain a competitive edge through a better understanding of customer needs and wants, which leads to strategic product differentiation in the relevant markets.
Moreover, the world is moving toward real-time. One report finds that 80 percent of American consumers are more likely to purchase from a company that personalizes its sales offering. Unfortunately, it’s impossible to deliver real-time personalization if your CRM takes hours or days to update your customer recommendation engine.
Now more than ever, business success is largely dependent on the data an organization collects, its quality and freshness of it, and what they do with it. A data-driven company that is focused on business success must rethink its data strategy in 2022 and embark on a data modernization journey. One of the most effective ways to modernize your data is to introduce a technology called change data capture (CDC) in your data infrastructure.
This article will take a deep dive into change data capture, its uses, and how it can simplify and improve both your application and data architectures.
Change Data Capture Examples
What Is Change Data Capture?
Change data capture is a methodology or design pattern that identifies and captures changes in your database over time. These changes are recorded, synced near instantly, and sent either to a table in the same database, a different database, a streaming platform, or to cloud storage. It is then possible to query and take action on the data at a later time. As soon as the source data changes, CDC syncs the databases, eliminating the silos of data. CDC allows businesses to make faster and more accurate data-backed decisions, with reduced resource expenditure.
Types of CDC
There are several different types of CDC, such as timestamp-based CDC, trigger-based CDC, snapshot-based CDC, and log-based CDC. The two main types, however, are trigger-based CDC and log-based CDC:
This method has been around for decades. The trigger-based CDC captures all insert, update and delete operations performed on tables or databases. For every insert, update and delete statement, a trigger is fired which captures the data manipulation language (DML) statement. Trigger-based CDC requires database triggers to be created in order to identify the changes that have occurred in the source system. It then captures those changes and writes out the changes into another table, typically called a shadow or staging table. Trigger-based CDC is easy to implement, can capture the entire state of the transaction, and is also customizable.
However, there are some known disadvantages to using triggers to enable change data capture. A big disadvantage is the setup of individual triggers for each table. This can lead to costly implementation and management overhead in the case of a large source database. Another challenge with triggers is that for large transactions, there is a significant overhead of doing multiple writes to a database for every insert, update or delete. Also, to apply the changes in the target database, the replication tool needs to connect to the source database at regular intervals; that can put additional load on the source database system and impact performance.
Log-based CDC, in contrast, works by reading the transaction logs of the source database. By reading the database’s log, you get the complete list of all data changes in their exact order of application. However, every database uses its own custom format to write redo log records to the transaction log, so a customized solution needs to be built to read those formats from the log and convert them into logical transactions and DML statements that could be written to a target system. Therefore, there is a lot of engineering effort that is required to build a solution that enables CDC to read a transaction log.
There are some obvious advantages to using log-based CDC over a trigger-based solution. First, you can get a lot of additional metadata from the transaction log, e.g., a transaction identifier that is useful to ensure that the replication has the ability to resume upon crash. Second, in log-based CDC, there is no connection being made to the source database, and no extra queries are performed on the source system, thus making it a zero-impact solution. This is extremely important for large production systems which are almost running at full capacity and cannot bear any extra overhead.
The Need for Change Data Capture
A business cannot operate on luck or intuition and hope to remain viable in the long run. Businesses need to be strategic in their decisions, and they can only make the right decisions if they have accurate and fresh data. A major benefit of CDC is that it provides fast, fresh, and accurate data so decisions can be made with speed and precision.
Other benefits of CDC include:
- It does not require bulk load updating and batch windows. CDC uses incremental loading or real-time data change streaming to the target repository.
- By sending only incremental changes, CDC reduces the cost of transferring data over the wide area network (WAN).
- With CDC, users can perform zero-downtime database migrations. It also supports real-time analytics, fraud protection, and data synchronization across geographically distributed systems.
- It is ideal for the cloud because it provides an efficient method for incrementally moving data across a WAN.
- CDC ensures data is kept in sync across multiple systems, which is very beneficial if quick decisions need to be made in a high-velocity environment.
- The efficiency of CDC helps to reduce disruptions to production workloads.
These are just a few of the many benefits of change data capture for the average business. Some businesses may be able to extract even more value from it depending on how reliant they are on data.
How Enterprises Leverage Change Data Capture
The following are examples of the different use cases for change data capture:
Use Case #1: Real-Time Data Streaming/Streaming ELT
Traditionally, updates were conducted with extract, transform, load (ETL) operations. This was a long process that involved feeding data to the data warehouse from the operational databases using batch loads. While this process was taking place, operational activities had to be slowed down or stopped altogether.
But with CDC, organizations can continue operations 24/7 without any downtime. High-volume data transfers can be carried out incrementally in real-time without disrupting a company’s operational activities.
Use Case #2: Big Data/Real-Time Analytics
For current business intelligence, CDC can feed changes in data to the analytics platform. This allows organizations to make quick decisions with accurate data. Real-time streaming analytics makes it easier for companies to decide what’s working and what isn’t. This allows for better management of the company’s product and client base.
These benefits have led some companies to further tap into real-time streaming analytics to do the following:
- Fine-tune app features: Companies will push out new features of their apps and act on real-time streaming data to understand customer adoption and ensure success. Real-time analytics also helps with detecting anomalies in the app and even predictive analytics.
- Personalization: Organizations use real-time analytics to help with the personalization of experiences, such as improving search relevance for e-commerce platforms.
- Improve advertising and marketing campaigns: Real-time analytics help companies improve advertising and marketing campaigns to maximize the ROI and stop ineffective spending.
These are just some of the advantages to having real-time analytics and the impact it has on making quick business decisions.
Use Case #3: Database Replications and Migrations
To make sure every resource has the latest version of data, CDC can be used for data replication to various databases, data lakes or data warehouses. Database replication creates analytic databases as separate copies in the production database. This frees the transactional database from analytical queries, while also making sure new data in the analytical database is fast and accurate. There are many methods to accomplish data replication techniques, but CDC has become one of the most popular methods. Instead of copying entire tables during every replication cycle, it only copies/updates the rows that have changed since the last replication.
Use Case #4: Event-Driven Architectures
Organizations use event-driven architectures (EDA) to capture insights and communicate changes immediately to help enhance customer experiences or improve organizational efficiency. CDC is inherently event-driven, and any event-driven architecture in an organization can immediately benefit from a CDC feed. Let’s say an organization has deployed a modern event-driven architecture with something like Apache Kafka, then databases can feed CDC events into Kafka topics, and those CDC events could be consumed to do event-driven analytics. For example, if an application needs alerting when a particular column value of a table in a source database exceeds a known threshold, the application could inspect the CDC events (out of the source database) in Kafka, establish if the threshold was hit, and alert itself accordingly.
The average organization is inundated with enormous amounts of data every day. CDC helps organizations quickly review and analyze data for business insights. For instance, many companies are negatively impacted by data access that is too slow or has poor quality because, unfortunately, they still use delayed batch processing to sync databases. In today’s landscape, companies need accurate information quickly to remain competitive. Slow processes will not allow an organization to pivot quickly and accurately. The path toward data modernization starts with adapting to change data capture.
- Change Data Capture Examples: An Introductory Guide to CDC - September 8, 2022