3 Data Quality Stages for Preparing Machine Learning Data
This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, dotData Founder and CEO Ryohei Fujimaki offers commentary on data quality strategies to get your data machine learning-ready.
As the world embraces machine learning (ML) and Artificial Intelligence (AI), data leaders are adjusting and perfecting data quality management frameworks. Traditionally, there are two stages in data quality: raw unprofiled data and cleansed data, free of common errors and commonly used for business intelligence (BI). But, companies at the forefront of data-driven decision-making have realized that data quality needs to level up—and this is where ML-ready data comes in.
Data teams innovating in data quality have created a third level: ML-ready data. This new data phase requires additional preparation to meet the high standards required by ML and AI. Leading companies refer to the three stages of data as; Bronze, Silver, and Gold. Additionally, new technologies designed to support, accelerate, and automate the transition of data are disrupting the industry.
Bronze, Silver, and Gold: The New Data Standards
The concept of Bronze, Silver, and Gold data define not only characteristics but also the structure, data layers, data flows, data warehouses, and architectures, as DataBricks explains. By creating a series of data layers, quality can be better guaranteed as data flows. Each layer has its validation and transformation process standards.
Additionally, using this approach, data states are not lost: Each layer stores the data in its original state. This allows for better security, isolation, and verification as data quality processes can be restarted by data teams.
Data Quality Stages
Bronze Layer: Raw Data as It Lands
The Bronze layer is where the data used in organizations’ systems begins its journey. The first rough standardization and formatting are done in this layer. In the Bronze layer, data is generated, gathered, or ingested. These layers are appended incrementally and grow over time. When organizations store raw data, experts recommend keeping its history, metadata, and original format. This will allow data to recreate the following data states, Silver and, consequently, Gold.
DeltaBricks recommends users keep the data in the Bronze layer in DELTA format (DEscription Language for TAxonomy). Data can be transformed from JSON or XML formats to DELTA as well.
The Silver Layer: Enterprise Central Repository
In the Silver layer, data scientists and engineers can accelerate data quality transformations. Data stored in this stage is already validated, either manually or by automated technologies. This data will go through the final steps and be ready to be used for BI analytics in the final stages of the Silver layer, while more advanced BI analytics that leverages AI can use data from the Gold layer.
Silver data is cleaner and filtered to give a more refined view. In this stage, data teams join tables, and constraints are added for better integrity. Then Silver data can be stored in accurate datasets with solid structures. Analytics teams can query the data and put it to work to meet business goals.
The Silver layer is when teams first match, merge, transform, and clean the data. Traditionally, this layer was known as a Central Repository or Data Domain — where data “fit for business” is stored. This layer provides an “enterprise view” of data, a single source of truth consolidated from varied sources. While some operations like those that require rapid internal business data decision-making are still done with data from the Silver layer, the more l advanced BI analytics, ML and others can be executed using more refined data from the Gold layer.
The approach recommended for loading data in the Silver layer is Extract-Load-Transform (ELT), which replaces the traditional Extract-Transform-Load. ELT accelerates the process and applies minimal transformations of enterprise-level rules.
Additionally, tools that automatically perform data cleansing, such as string value canonicalization, record duplication removal, missing value imputation, and outlier elimination, support the work of data teams and help them move through the stages of data.
The Gold Layer: Product Data Standards, ML-Ready
Gold data is used by organizations to share data with their clients, partners, and external sources. It is also the data that powers advanced BI analytics, ML, and AI applications. Gold data is highly refined and aggregated. While Bronze data replicates the source, Gold data is read-optimized, displays rapidly, is highly optimized and structured, and contains strictly essential information.
Having a Silver layer also accelerates the work of data scientists and data engineers developing ML models—they will turn to this layer for their source data. This removes the cost and time-consuming manual process that data teams still have today, where they scan raw data, profile it, and cleanse it to identify features later.
Microsoft — also incorporating the Bronze, Silver, and Gold data-layer approach in the Azure Databricks analytics platform — explains that the Gold layer stores aggregate data, which data scientists use for ML model preparation, training, and BI.
Why should leaders shift to the Three-Layer Approach?
The Bronze, Silver, and Gold approach is used by companies looking to break siloes, become a data-driven organization, and develop efficient and productive ML models to leverage the power of data
Today, data scientists moving data from Bronze to Silver still follow a largely manual process. But it is an essential process since raw data in the Bronze stage is often too dirty, complex, and detailed to be useful for any type of analytics — even for BI.
After ETL or ELT processes, Silver data is clean enough and ready “enough” for traditional BI applications. But it is not yet fit for Machine Learning. Duplicate values, outliers, and missing values are just some examples of data problems that are not that critical in the world of BI but could cause issues for ML.
The data industry continues to prove that the transformation of the tools, processes, and technologies is part of its evolution as the role of data, and its uses expand. While logical, the Bronze, Silver, and Gold approach requires a shift in mentality, new investments, and resources. The road ahead will continue to evolve data processes as new standards are demanded by innovations. It’s essential to understand that data quality management — and data lifecycles — are highly interconnected to ML and AI and their efficient performance.