Data is King: The Role of Data Capture and Integrity in Embracing AI

By Alexandra Anghel
Best Practices,

In the world of AI, data serves as the foundation for machine learning models to identify trends and patterns, which it then uses to make predictions and decisions based on new, unseen data. The more data the model is trained on, the more accurate it can become in predicting outcomes or making decisions. When it comes to AI, you often hear in the media that certain machine learning models were trained on hundreds of gigabytes of data. However, the data quality is more important than size.

Just having a lot of data is not sufficient for training a good model. The saying “garbage in, garbage out“ is a well-known concept in computing, indicating that flawed input data or instructions will generate flawed outputs. Data quality concerns are frequently overlooked in ML research and education, with major textbooks focusing on the mathematical foundation of ML and using clean, organized, and pre-labeled “toy” datasets.

Despite this, implementing ML in a particular domain has to take into account that real-world data is flawed. This is a fact that any ML engineer or Data Scientist who works with productionalizing ML models is well-versed in, as most of the challenges in creating ML models that output quality results are data-related.

Why do some ML Models need a lot of data?

Put simply, an ML model is a combination of a dataset and the algorithm used to train on that particular dataset. It follows, then, that the same algorithm trained on different datasets will produce very different results.

Some machine learning models require thousands of examples, but others may only need a handful of examples. Above all, clean data is vital. The saying “bad data in, bad data out” is absolutely applicable to machine learning.

In general, the more complex the problem, the more data the model will need to learn and make accurate predictions. Additionally, if the data is noisy or contains many outliers, the model may require more data to filter out these anomalies.

When a model is trained on a limited amount of data, it may not have enough examples to accurately generalize to new data, resulting in overfitting or underfitting – basically the ML model learns the dataset “by heart” or fails to capture the underlying patterns in the data, resulting in poor performance when predictions are generated.

Why is data quality important?

Having more data is not always better, as the quality of the data is equally important. Poor quality data can negatively impact the performance of the model, even if there is a large amount of it.

The accuracy of the model’s predictions is highly dependent on the quality of the data it has been trained on. If the data is noisy, inconsistent, or contains errors, the model is likely to learn and propagate these errors, resulting in inaccurate predictions.

The basic blocks, such as model architectures, public datasets or ML training algorithms have been available in scientific publications for a few years. However, a lot of engineering time went into putting them together to create usable products. One key ingredient has been the quality of the data: the curation of incorrect and duplicate data from the internet, plus the human annotations where natural labels are not enough.

Data quantity vs. data quality

In machine learning, there is often a trade-off between the quantity and quality of data. More data can lead to better performance of a machine learning model, but only if that data is of high quality. On the other hand, even a small amount of high-quality data (for example using techniques such as active learning) can produce a useful machine learning model. For such cases, you can also use extrapolations to generate more data out of a small, quality dataset.

A few considerations to keep in mind when searching for the balance between the amount and quality of data:

Collecting and labeling a massive amount of data can be costly and time-consuming.
If the data is low quality, it may lead to a model with poor accuracy.
Data can be validated, cleaned and preprocessed to fix some errors like removing bad examples or filling missing values.
If you have a huge dataset, you don’t have to use all of it, as training a model with such a dataset is expensive. In fact, experimentation can be done – varying the dataset size to measure how much data is required to reach optimal performance.

It is important to consider the specific task and context and determine the appropriate amount and quality of data required for building a successful machine learning model.

Where to start when collecting data for ML models

When going into data collection with the purpose of developing a ML model, start by asking yourself the following questions:

Is the data accurate and error free?
Is the data relevant to the problem you are trying to solve?
Is the data complete, with enough examples to train the machine learning model effectively?
Is the data consistent, does it contain conflicting or contradictory information?
Does the data reflect a real-world scenario?
Is the data unbiased?

The required volume of data depends on the complexity of the problem you are trying to solve. Quality data is crucial for ensuring the accuracy and fairness of machine learning models. So plan to carefully curate, preprocess and validate it, thus ensuring it meets the necessary standards for the problem being solved.

This article was written by Alexandra Anghel on September 15, 2023

Alexandra Anghel

Best Practices

Data is King: The Role of Data Capture and Integrity in Embracing AI