Expert Advice for Evaluating Data Observability Software

By Lior Gavish
Best Practices,

Data Observability Software

This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Monte Carlo Data CTO and Co-Founder Lior Gavish provide expert advice for evaluating data observability software.

SR Premium Content It’s Wednesday morning, and your phone won’t stop buzzing. You wake up to messages from your CMO saying, “The numbers in this report don’t seem right…again.” You drop what you’re doing and begin to troubleshoot the issue at hand. Other team members across the organization are repeating efforts, and your CMO is left in the dark while no updates are being sent out to the rest of the organization. As all of this is going on, you get texted by John in finance about an errant table in his spreadsheet, and Eleanor in Operations about a query that pulled “interesting” results.

If this situation sounds familiar to you, know you’re not alone. We call this phenomenon data downtime, and it’s becoming increasingly common for even the most robust and well-staffed data teams. Data downtime refers to periods of time when data is missing, erroneous, or otherwise inaccurate, and data teams spend upwards of 40 percent of their time tackling it instead of working on revenue-driving projects.

As organizations increasingly rely on data to drive decision-making and power digital products, the need for this data being ingested, stored, processed, analyzed, and transformed to be trustworthy and reliable has never been higher. Simply put, organizations can no longer afford for data to be down i.e., partial, inaccurate, missing, or erroneous.

By applying the same principles of application observability and infrastructure design to data systems, data teams can ensure data is usable, actionable, and most importantly, trustworthy. Nowadays, modern data teams are investing in data observability platforms to help monitor, alert for, and help them resolve anomalies and other issues in their data.

The Rise of Data Observability

Data teams need a way to seamlessly monitor and alert for issues with the data feeding their dashboards, giving them a holistic view of the health and reliability of their data assets.

To tackle this, data observability automatically monitors across key features of your data ecosystem, including data freshness, distribution, volume, schema, and lineage. Without the need for manual threshold setting, data observability answers such questions as:

When was my table last updated?
Is my data within an accepted range?
Is my data complete? Did 2,000 rows suddenly turn into 50?
Who has access to our marketing tables and made changes to them?
Where did my data break? Which tables or reports were affected?

With the right approach to data observability, data teams can trace field-level lineage across entire data workflows, facilitating greater visibility into the health of their data and the insights those pipelines deliver. Such functionality allows data engineers, analysts, and scientists to identify why their dashboards aren’t pulling the freshest data for your stakeholders (i.e., is there a missing data set? A null value? Did someone use the CSV file type instead of XLS?).

A data observability platform must be able to monitor and alert for the following five pillars of observability:

Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
Volume: has all the data arrived?
Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision-making?

An effective, proactive data observability solution will connect to your existing stack quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies. Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.

How to Select the Right Solution for Your Stack

When choosing the right data observability vendor for your company’s needs, there are also five key features you should look for:

End-to-end visibility

To ensure your data team is the first to know about data downtime through automated monitoring and alerting, your data observability platform should:

Infer information about table operations, such as load patterns and expected volume
Detect anomalies based on historical data and patterns
Track table updates and alert teams when updates don’t occur as expected
Track changes in data volume in individual tables and alert teams to abnormal size changes
Track and alert on schema changes, distribution changes in low cardinality fields, and null rates, uniqueness, and other changes in values within select fields
Allow team members to create custom thresholds, including multiple/dual thresholds, for anomalies
Group related anomalies across tables based on inferred dependencies

Rapid, ML-based detection and resolution of data downtime

To help your team resolve data quality issues swiftly and automatically, your data observability platform should:

Automatically create data lineages to display upstream and downstream data relations, including BI reports and dashboards
Filter and intelligently route alerts by dataset based on dataset owners
Automatically understand and prioritize issue resolution based on business impact
Enable incident management collaboration in a centralized interface with comprehensive activity logs to speed up root cause analysis across each stage of the pipeline
Offer API access to all information presented in the UI for customization and/or workflow integration

Unified, self-service platform

When it comes to data trust, you should be able to understand the health of your data from a central, all-in-one UI. Long gone are the days of data silos and playing the bad data name game between data engineer and analyst teams. With data observability, all stakeholders are able to collaborate in a single, self-service platform.

This interface should:

Make it easy to search for and explore data assets with a simple UI
Collect and display information required for investigating and resolving issues
Deliver all the relevant information required to conduct root cause analysis, down to the field level
Maps out data incidents over time to that make it easy to view impacted tables, and every action that was taken to manage and resolve an incident
Share comprehensive query logs that reveal periodic ETL queries, ad hoc/backfill queries, changes in query patterns, and more hints that help teams identify the root cause of data incidents.
Seamlessly connect to Slack, Opsgenie, PagerDuty, webhooks, email, or your communication channel of choice to alert about downtime to the individuals who need to know
Display sample data, to help users immediately understand what data involved in the incidents looks like, and what typical data looks like

Automated data discovery and metadata management

To support the growing demand for data democratization and decentralized data ownership, your data observability platform should:

Dynamically create a data catalog that enables data discoverability and searchability
Include self-service diagnostic tools that perform data profiling and understand data lineage
Provide standard reporting for data quality dimensions on data sets
Deliver value-add insights on table importance, monitor coverage, unused tables, and other information
Provide information on queries with deteriorating performance
Offer a centralized interface for self-service incident analysis, impact assessments, and cleansing requirements
Allow users to track and discover details on any dataset or environment
Automatically update schema metadata and information, without requiring any manual changes

Security-first architecture

To ensure your data’s full protection and security, your data observability platform should:

Monitor data at rest by extracting query logs, metadata, and statistics about data usage—without exposing your data warehouse, lake, or other infrastructure to external environments
Offer SOC-2 Type II certification
Never extract or store individual records, PII, or other sensitive information outside of your environment
Allow you to comply with HIPAA, PCI, GDPR, CCPA, FINRA, and other compliance frameworks that you are subjected to
Allow easy and simple deployment with little to no ongoing operational overhead and frequent automatic upgrades

With these pieces in place, your data observability solution will be able to accelerate the adoption of data at your company and keep your CMO’s ad-hoc messages about “missing data” at bay. And who knows, you may even get a few more hours of precious sleep in the process.

This article was written by Lior Gavish on January 26, 2022

Lior Gavish

Lior Gavish is CTO and Co-Founder of Monte Carlo, a data reliability company backed by Accel, Redpoint Ventures, GGV, ICONIQ Growth, and Salesforce Ventures. Prior to Monte Carlo, Lior co-founded cybersecurity startup Sookasa, which was acquired by Barracuda in 2016. At Barracuda, Lior was SVP of Engineering, launching award-winning ML products for fraud prevention. Lior holds an MBA from Stanford and an MSC in Computer Science from Tel-Aviv University.

Best Practices

Expert Advice for Evaluating Data Observability Software

The Rise of Data Observability