Ad Image

Four Technological Solutions to Improve Data Quality for AI Initiatives

Solutions Review’s Contributed Content Series is a collection of contributed articles written by thought leaders in enterprise technology. In this feature, Dataddo‘s co-founder and CEO Petr Nemeth offers commentary on several solutions for improving data quality for AI initiatives.

In discussions about AI tools and workloads, the emphasis tends to be on optimizing machine learning models, rather than ensuring they are fed high-quality data. But, if garbage in means garbage out, then keeping data quality high is just as important as building and training the models.

Some organizations are learning this the hard way. According to McKinsey, “industry leaders looking to leverage their data to power sophisticated AI models are discovering that poor data quality is a consistent roadblock for the highest-value AI use cases.”

People-focused solutions for data quality, like instituting a comprehensive data governance policy, will continue to remain important, but they need to be supplemented by technological solutions for standardizing and flagging up questionable data as early as possible in the AI lifecycle. This is why organizations that don’t have the appropriate technologies and tools in place are struggling to move AI initiatives into production.

How data is collected and prepared fundamentally determines the success of AI initiatives.

In the article that follows, I will talk about four technology-based data quality solutions essential for the success of AI initiatives, as well as some of the things organizations should consider when implementing them.

Download Link to Data Management Buyers Guide

Data Quality for AI

Technologies, Tooling, Takeaways for AI-Friendly Data Quality

Keep in mind that most data quality-related tools specialize in one of the following four solutions but also offer one or more of the other three solutions to varying degrees of robustness.

Data Integration

At this point, data integration is a self-evident area of data quality for AI, because most machine learning models require regular, automated input of data from various sources.

What may not be self-evident is how data integration tools, aside from actually integrating data, improve data quality.

Many data integration tools are extract, transform, load (ETL) tools, meaning they transform (or standardize) disparate data (e.g., date formats) before loading it to a destination. This makes the data machine-readable right at the moment of collection.

Moreover, integration tools today tend to offer some capacities for filtering, labeling, monitoring, or all of the above (these capacities will be discussed below).

Extract, load, transform (ELT) tools also have their place in the data integration space, but they are not ideal for AI workloads since they produce raw, unstandardized data. If an ELT tool is nevertheless used to integrate data for an AI initiative, transformations can, of course, be done later in a data warehouse.

Takeaway: Standardizing data at the earliest point in collection ensures an essential standard of quality.

Data Profiling and Filtering

In addition to integrating and standardizing disparate datasets, it’s important to keep outliers, anomalies, missing values, and duplicate values out of downstream systems because they can mislead machine learning models and produce false outcomes.

This is where profiling and filtering technologies come in. While there are dedicated profiling and filtering tools on the market, remember that data integration tools often have some kind of embedded profiling/filtering technology as well. Many filtering technologies are themselves AI-based.

Despite advances in filtering technologies, manual creation of visualizations in a traditional business intelligence tool (as opposed to automatic creation of visualizations in an ultra modern visualization tool) is still a very effective way to detect and remove anomalies, and is ultimately necessary for thorough validation of training data.

Also, keep in mind that, even though filtering can be automated, the filters themselves should be manually updated regularly, so that, as needs change, relevant data is not left out.

Lastly, for this section, it’s extremely important to mention that data collection for AI initiatives can be a major privacy risk. So, whenever possible, personal identifiable information (PII) should be filtered from any training datasets.

Takeaway: Filtering out anomalous and sensitive data should be done upon initial collection (via a data integration tool), and—in many cases—continued downstream using a dedicated filtering tool. Manual filter updates and data validation checks will still be necessary.

Dataset Labeling

Once enough datasets for training have been collected, they need to be labeled. Data labels (or metadata) are extremely important for preserving the context of disparate, yet standardized datasets. For example, standardization of amounts in different currencies across datasets could completely skew AI outcomes if the datasets are not labeled correctly.

Metadata is also important for privacy, as it may contain information about data ownership,  data controllers, access rights, usage, third parties, or other information relevant to privacy concerns.

There are dataset labeling tools for machine learning applications, some with embedded AI technology, but these work best if they “learn” from human-labeled datasets. Indeed, fully automated labeling tools tend to generate “weak” (i.e., less accurate) labels, which work best for use cases where a larger volume of data may compensate for the weakness of the labels. Manual labeling/intervention is still therefore necessary, and may be best for AI initiatives involving smaller volumes of data, assuming the labellers are properly trained.

Takeaway: No matter how clean data is, if it’s not labeled correctly during preparation, models will not produce accurate outputs. Dedicated tools do much to help with this, but automation should be used with caution.

Data Monitoring and Lineage

Monitoring technologies go hand in hand with filtering and profiling technologies, because they alert data teams of issues like anomalies and outliers in real or near real time.

Without question, the data integration tools, databases, data ops tools, and data visualization tools used for AI initiatives should have embedded monitoring capabilities; however, for cases where more robust monitoring is needed (e.g., more customization, reduced load on database), there is also a range of dedicated database monitoring tools on the market.

Dedicated data lineage tools are essential for understanding how data is sourced, transformed, and consumed from end to end. They help troubleshoot issues (e.g., discovering the source of bias in an AI model), ensure compliance with data governance policies, and provide transparency for auditing and regulatory purposes.

Just like in the case of filtering tools, lineage tools are more and more using AI-based technologies to flag up anomalies and outliers. Some are also offering AI-based visualization functionality; but, beware—these can sometimes lead to false positives, so, as mentioned above, manual visualization should still be used.

Takeaway: Make sure all the tools you are using for AI workloads have native monitoring systems in place, and use a data lineage tool to establish observability across all of them.

Half the Battle, All the Advantage

Since humans have been collecting data, organizational solutions like policies and methodologies have been essential for maintaining its quality. And, still, they are essential. However, by themselves, they are decidedly insufficient for AI workloads; they must be implemented alongside the right technologies and tooling. 

By the same token, technology-based data quality solutions alone can only win half the battle for machine learning success. But they provide the advantage that many companies today are lacking.

Download Link to Data Management Vendor Map

Share This

Related Posts

Insight Jam Ad


Widget not in any sidebars

Follow Solutions Review