The Last Mile Problem: Why Your AI Models Stumble Before the Finish Line
Encord’s Ulrik Stig Hansen offers insights on the last mile problem and why your AI models stumble before the finish line. This article originally appeared on Solutions Review’s Insight Jam, an enterprise IT community enabling the human conversation on AI.
In 2023, AI was the buzzword of the year. Enterprises across industries invested heavily in AI proof of concepts (POCs), eager to explore the technology’s potential. Fast-forward to 2024, companies face a new challenge: moving AI initiatives from prototype to production.
According to Gartner, by 2025, at least 30 percent of GenAI projects will be abandoned after the POC stage. The reasons? Poor data quality, governance gaps, and the absence of clear business value. Companies are now realizing that the primary challenge isn’t simply building models—it’s ensuring the quality of the data feeding those models. As companies aim to move from prototype to production of models, they’re realizing that the biggest roadblock is curating the right data.
The Data Dilemma: Why More Isn’t Always Better
In the early days of AI development, the prevailing belief was that more data leads to better results. However, as AI systems have become more sophisticated, the importance of data quality has surpassed that of quantity. There are several reasons for this shift. Firstly, large datasets are often riddled with errors, inconsistencies, and biases that can unknowingly skew model outcomes. With an excess of data, it becomes difficult to control what the model learns, potentially leading it to fixate on the training set and reducing its effectiveness with new data. Secondly, the “majority concept” within the dataset tends to dominate the training process, diluting insights from minority concepts and reducing model generalization. Thirdly, processing massive datasets can slow down iteration cycles, meaning that critical decisions take longer as data quantity increases. Finally, processing large datasets can be costly, especially for smaller organizations or startups.
Organizations must strike a delicate balance between having enough data to train robust models and ensuring that it’s the right data. This means moving beyond data accumulation and focusing on data quality. By investing in practices like cleaning, validation, and enrichment, companies can ensure that their AI models are not only built on a solid foundation of high-quality data but are also well-prepared to scale and perform effectively in real-world production environments.
The Price of Bad Data: The Ripple Effects of Poor Data Quality on AI Innovation
A study by IBM found that poor data quality costs the United States economy around $3.1 trillion annually. Across industries, this issue is the root cause of AI initiatives stalling after proof of concept, draining resources, and blocking companies from achieving full production-scale AI.
Beyond direct financial losses, failed AI projects incur significant indirect costs, including wasted time and computational resources. Most critically, these failures represent missed opportunities for a competitive advantage and can damage both internal and external reputations. Repeated failures can create a culture of risk aversion, stifling the very innovation that AI promises to deliver.
What Makes AI Data High-Quality?
Research indicates that data scientists spend approximately 80 percent of their time preparing and organizing data before they can conduct any meaningful analysis.
To overcome the root challenge of poor data quality, high-performance AI datasets must exhibit five key characteristics: accuracy in reflecting real-world scenarios, consistency in format and structure, diversity to enhance adaptability, relevance to specific objectives, and ethical considerations in data collection and labeling.
To illustrate the importance of these characteristics, consider an example from Automotus, a company that automates payments for vehicle unloading and parking. The company faced challenges with poor data quality, including duplicate and corrupt images, which hindered their ability to convert vast amounts of image data into labeled training datasets for their AI models. To address these issues, the company utilized data quality tools to efficiently curate and reduce their dataset by removing the bad examples—achieving a 20 percent improvement in mean Average Precision (mAP) for their object detection models. While the data reduction enhanced model accuracy, it further led to a 33 percent reduction in labeling costs, demonstrating that investing in data quality can yield both performance improvements and economic benefits.
Unlocking Success: How to Achieve High-Quality Data
To navigate the challenges of AI development, organizations need to take the following concrete steps to enhance their data practices:
-
Establish Clear Data Governance Policies: Organizations should create comprehensive data governance policies that outline roles, responsibilities, and standards for data management. These guidelines ensure uniform data quality throughout the organization, reducing the risk of poor data impacting decision-making.
-
Implement Rigorous Data Cleaning Techniques: Employ techniques such as outlier detection, imputation for missing values, and normalization to maintain the integrity of datasets. These practices help ensure that the data used for AI models is accurate and reliable.
-
Invest in Accurate Labeling Processes: High-quality labels are essential for model precision. Automated data labeling can offer significant advantages over manual labeling by reducing costs and streamlining the process. However, a hybrid approach that combines automated tools with human oversight can enhance accuracy by leveraging the strengths of both methods.
-
Source Data from Diverse and Reliable Sources: Companies should seek diverse data sources to reduce bias and improve model performance. Examples include public datasets, industry-specific databases, and third-party data providers. Ensuring these sources are credible is crucial for maintaining data quality.
-
Leverage Advanced Data Management Tools: To ensure ongoing AI performance, leverage advanced data management tools to continuously curate and update training datasets. Data distributions can change over time in production environments, and these tools can help companies adapt datasets accordingly.
The Path Forward: Elevating Data Quality and Scaling AI
The demand for high-quality data will only grow as AI adoption increases. Gartner predicts that by 2025, enterprises will process 75 percent of their data outside traditional data centers or the cloud, highlighting the need for new strategies to maintain data quality in distributed environments. To confront these obstacles, key innovations are emerging in the field of data quality, including automated data checks, machine learning for data cleaning, privacy-preserving methods for training models on distributed data, and the generation of synthetic data to enhance real datasets.
These advancements are making it possible – and easy – for every company to create a data-centric culture. By prioritizing data quality, companies aren’t merely avoiding pitfalls; they’re unlocking AI’s full potential and setting new industry standards. It’s time to rally around the power of quality data—not just for competitive advantage, but to elevate the entire AI ecosystem. As AI continues to mature, the question isn’t “Do we have enough data?” Instead, it’s time to ask, “Do we have the right data to power the AI solutions of tomorrow?”