Data Observability Capabilities

Data Observability Capabilities

- by Sanjeev Mohan, Expert in Data Management

Data observability is part of the DataOps category that manages agile development, testing of data products, and operationalization of data management.

Getting data from diverse data producers to data consumers to meet business needs is a complicated and time-consuming task that often traverses many products. The incoming data elements are enriched, correlated, and integrated so that the consumption-ready data products are meaningful, timely and trustworthy.

Figure 3 shows the key components of data observability.

Figure 3. Data observability workflow

Monitor

Continuous monitoring of the data and its accompanying metadata is the most fundamental capability. A data observability product should have connectors to the stack subsystems in order to profile data characteristics and calculate statistics and detect patterns. This information is then stored in a persistence layer and used for the subsequent steps.

A common question is what data should organizations monitor? The best practice is to not boil the ocean, and instead, identify critical data elements (CDE) based on your strategic business priorities. The business priorities vary across organizations. Some may be interested in identifying new sales opportunities, while others may be more concerned with meeting compliance regulations.

Another common question is what data sources should an organization monitor. To solve architectural issues, organizations often create multiple copies of data to serve different consumers. As a result, it is imperative to get stakeholder engagement to identify the correct data sources. Once there is an agreement, monitor the in-scope application logs, traces, and data packets. The scope of monitoring includes:

  • Data and schema drift
  • Volume of data
  • Data quality dimensions, like completeness, missing values, duplicates, uniqueness etc. DAMA International’s Netherlands team has identified 60 dimensions that comply with the ISO 704 standard.
  • Resource usage and configurations

Data monitoring is not a new concept. What is new is that, unlike static and point-in-time approaches in the past, modern tools are proactive, and continuous and they span various systems to not only inform about anomalies but also to analyze their root cause. Data monitoring can be an expensive operation. So, modern tools may use cheaper cloud spot instances.

Analyze

Monitoring is such a foundational need, that some basic data observability products simply provide monitoring with visualization as their only capability. However, most new products provide a rich statistical analysis of the data movement. The profiled data is compared to the baseline and hidden patterns are detected.

Similarly, analysis of data and metadata helps detect if a pipeline has failed or is taking more than the expected duration. By analyzing the volume of data, inferences can be made concerning undetected failure and drift. If these anomalies are not handled in a timely manner, they can cause downstream operations to fail or be rendered inaccurate.

Delays in data pipeline latencies could be because of under provisioning of resources when the volume of data increases. In such a case, the overall analytics SLAs may be impacted. A comprehensive data observability product should analyze the traffic and recommend the right resource types. Conversely, it should detect when the resources are over provisioned and wasting money.

Data observability products proactively and dynamically detect drift from expected outcomes. They perform time series analysis of incremental data over a number of periods of data, such as daily, weekly, monthly, and quarterly, etc. In this process, they continuously retrain ML models.

Alert

Data observability products proactively inform impacted teams on inferred anomalies. They contribute in resolving issues faster and leading to higher uptime of the pipelines.

One of the biggest problems of observability tools is “alert fatigue”. It happens when the system constantly generates more alerts than what the team can consume. As a result, many highly critical alerts are lost in an ocean of notifications and hence go unattended.

Data observability products intelligently handle notifications to reduce alert fatigue. For example, they may aggregate alerts by categories or through customization.

Incident Management

Traditionally, when errors are detected, they are fixed in downstream systems. In other words, we remediate the symptoms, and not the root cause. Fixing data problems in downstream applications creates technical debt. This approach is not sustainable, but is common as IT teams cannot detect the source of errors in a complex pipeline.

Data observability detects the problems closer to the origin and impact to the downstream applications (“impact analysis”). It uses the pipeline lineage to visually inspect hot spots. It then kicks off an incident management workflow to quickly remediate the problems.

The incident management process allows collaboration across various business units. It should also automate the necessary steps. However, some of the remediation of data quality happens in the source systems which can’t always be automated. Here, the data observability product will integrate with the necessary solutions of the stack to ensure quick resolution.

Often, some problems should be remediated automatically with no need for human intervention. For example, if the Spark clusters are under provisioned the product should not only give recommendations, but kick off a process to auto-tune the configuration parameters, within the cost threshold.

Feedback

We have seen the process to detect and remediate anomalies in data, pipeline, configuration, and business operations expediently. The last step in the process is to ensure that the data observability is a continuous process, and it evolves with the system. It is used to maintain SLAs.

Operational feedback, such as latency, missing data etc. is easy to calculate and analyze. But, business feedback is what will ensure data observability adoption. For example, the ability to perform multi-attribute data quality checks, or even the ability to deploy data products more frequently because of higher transparency. This feedback will ensure that data observability products are actively deployed and will open doors to even more use cases in the future as the next section explains.