Expert Reveals Data Quality Trends to be Aware of in 2023

Data Quality Trends

Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, Arcion Founder and CTO Rajkumar Sen offers commentary on the top data quality trends to be aware of for this year and beyond.

Data quality is a measure of a few essential parameters describing the state of the data — mostly around the data’s accuracy, completeness, consistency, reliability and freshness. Let’s understand why data quality is foundational to almost every key data initiative in modern organizations.

What makes data quality so relevant?

With the rise of cloud analytics platforms like Databricks, Snowflake, BigQuery and others, the idea of real-time analytics has become very popular. For example, real-time analytics is critical in the financial services industry because they use real-time data to improve customer offerings, increase their fraud detection capabilities, and react to market trends faster. Real-time analytics use cases could be descriptive analytics, prescriptive analytics, and predictive analytics. Enterprises have accumulated a lot of golden data, and the onus is truly on the data engineering team to make sure they modernize their analytics stack to derive real-time insights from fresh and accurate data.

Data Quality Trends

Stale Data Can Impact Data Quality

A survey, conducted by popular analyst firm IDC along with InterSystems, was done with more than 500 executives. The survey found that more than 75 percent of respondents mentioned that untimely data had the potential to limit business opportunities. Untimely data is also slowing the pace of business, with around 50 percent of respondents claiming that untimely data massively limited operational efficiency. Another 27 percent of the respondents mentioned that it had affected productivity and agility, and around 15 percent mentioned that it limits their understanding of customers.

For use cases like real-time analytics, data quality is of utmost importance. Data quality in this case largely depends on the data latency because stale data can lead organizations to make critical decisions based on data that are no longer true or relevant. It is critical to deliver high-quality data at the right time to the consuming application to stay competitive. Data staleness can have massive repercussions to the business.

Inaccurate Data Sabotages Data Migrations & Replications

There are several challenges with ensuring data quality with respect to data accuracy. Enterprises must move terabytes of data daily across tens, or even hundreds, of systems to support a wide range of data usage: real-time applications, ML/AI workflows, data availability across continents, etc. It’s inevitable for data teams to think about data quality issues during the data streaming process. Additionally, data quality problems are compounded by DIY scripts, which are omnipresent. The best way to ensure data quality in a large replication process is to deploy real-time ETL or ELT pipelines that can guarantee zero data loss and are easily scalable. Daily batch scripts written with a legacy mindset are very brittle and need to be rearchitected for the future.

How to Ensure Data Integrity During Migrations & Replications

The most efficient way to ensure data quality for real-time decision-making is to use real-time data replication. Data replication has existed for decades, but a very innovative technique of doing incremental data replication has become popular in the past few years. This technique is called change data capture (CDC). Using CDC, enterprises can extract changes in a database immediately after the change has been made, transform it in real time, and apply the change in a target system, therefore, making the data replication incremental, fast and real time simultaneously.

The shift from a batch data pipeline to an incremental data pipeline also has cost benefits. A batch data pipeline would recompute every row, thereby burning massive amounts of CPU cycles, whereas an incremental data pipeline only recomputes the delta or the changes. Traditionally, less than 20% of total data in a database is changed on a daily basis, and therefore, an incremental data pipeline is almost guaranteed to ensure reduced compute costs. If these data pipelines are run in the cloud, e.g., on AWS, the user is bound to enjoy reduced EC2 costs.

Data Consistency While Data is On the Move

Data consistency is also important to ensure data quality for the end-users. In modern data platforms, data is constantly moving between systems, and at the modern data volume and complexity, systems go out of sync all the time. It is imperative that the data engineering team always validate data before and after the replication process to check its consistency and integrity. This is not an easy task, and in the past, some paid products have been available in the market. For example, several data teams used Oracle Veridata to do the data validation checks post-data replication.

Recently, some companies have built open-source software that can do data difference checks and also potentially fix those. One such example is the data-diff tool. Most of these tools are offline in nature, which means they can only work if the data in the source and target systems are not changing. If the source data is being constantly updated while the checks are being done, existing tools will report incorrect mismatches. Data validation while data is being constantly changed is an active area of work in data engineering. In the future, this area will be of valuable interest, and we can hope to see many more data management companies either building or open-sourcing similar tools.

What’s Next?

Data quality will continue to be top-of-mind for data engineering teams. Choosing the right technology and purpose-built tools can help mitigate known challenges in maintaining data consistency, data relevance, and data integrity at all times, especially when large swathes of business-critical data are moved around diverse systems and platforms. Taming these key areas of concern around data quality will go a long way in solving data quality challenges.

Rajkumar Sen
Follow