Expert Advice for Evaluating Data Observability Software

Data Observability Software

This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Monte Carlo Data CTO and Co-Founder Lior Gavish provide expert advice for evaluating data observability software.

SR Premium ContentIt’s Wednesday morning, and your phone won’t stop buzzing. You wake up to messages from your CMO saying, “The numbers in this report don’t seem right…again.” You drop what you’re doing and begin to troubleshoot the issue at hand. Other team members across the organization are repeating efforts, and your CMO is left in the dark while no updates are being sent out to the rest of the organization. As all of this is going on, you get texted by John in finance about an errant table in his spreadsheet, and Eleanor in Operations about a query that pulled “interesting” results.

If this situation sounds familiar to you, know you’re not alone. We call this phenomenon data downtime, and it’s becoming increasingly common for even the most robust and well-staffed data teams. Data downtime refers to periods of time when data is missing, erroneous, or otherwise inaccurate, and data teams spend upwards of 40 percent of their time tackling it instead of working on revenue-driving projects.

As organizations increasingly rely on data to drive decision-making and power digital products, the need for this data being ingested, stored, processed, analyzed, and transformed to be trustworthy and reliable has never been higher.  Simply put, organizations can no longer afford for data to be down i.e., partial, inaccurate, missing, or erroneous.

By applying the same principles of application observability and infrastructure design to data systems, data teams can ensure data is usable, actionable, and most importantly, trustworthy. Nowadays, modern data teams are investing in data observability platforms to help monitor, alert for, and help them resolve anomalies and other issues in their data.

The Rise of Data Observability

Data teams need a way to seamlessly monitor and alert for issues with the data feeding their dashboards, giving them a holistic view of the health and reliability of their data assets.

To tackle this, data observability automatically monitors across key features of your data ecosystem, including data freshness, distribution, volume, schema, and lineage. Without the need for manual threshold setting, data observability answers such questions as:

  • When was my table last updated?
  • Is my data within an accepted range?
  • Is my data complete? Did 2,000 rows suddenly turn into 50?
  • Who has access to our marketing tables and made changes to them?
  • Where did my data break? Which tables or reports were affected?

With the right approach to data observability, data teams can trace field-level lineage across entire data workflows, facilitating greater visibility into the health of their data and the insights those pipelines deliver. Such functionality allows data engineers, analysts, and scientists to identify why their dashboards aren’t pulling the freshest data for your stakeholders (i.e., is there a missing data set? A null value? Did someone use the CSV file type instead of XLS?).

A data observability platform must be able to monitor and alert for the following five pillars of observability:

  • Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
  • Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
  • Volume: has all the data arrived?
  • Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
  • Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision-making?

An effective, proactive data observability solution will connect to your existing stack quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies. Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.

How to Select the Right Solution for Your Stack

When choosing the right data observability vendor for your company’s needs, there are also five key features you should look for:

End-to-end visibility 

To ensure your data team is the first to know about data downtime through automated monitoring and alerting, your data observability platform should:

  • Infer information about table operations, such as load patterns and expected volume
  • Detect anomalies based on historical data and patterns
  • Track table updates and alert teams when updates don’t occur as expected
  • Track changes in data volume in individual tables and alert teams to abnormal size changes
  • Track and alert on schema changes, distribution changes in low cardinality fields, and null rates, uniqueness, and other changes in values within select fields
  • Allow team members to create custom thresholds, including multiple/dual thresholds, for anomalies
  • Group related anomalies across tables based on inferred dependencies

Rapid, ML-based detection and resolution of data downtime 

To help your team resolve data quality issues swiftly and automatically, your data observability platform should:

  • Automatically create data lineages to display upstream and downstream data relations, including BI reports and dashboards
  • Filter and intelligently route alerts by dataset based on dataset owners
  • Automatically understand and prioritize issue resolution based on business impact
  • Enable incident management collaboration in a centralized interface with comprehensive activity logs to speed up root cause analysis across each stage of the pipeline
  • Offer API access to all information presented in the UI for customization and/or workflow integration

Unified, self-service platform 

When it comes to data trust, you should be able to understand the health of your data from a central, all-in-one UI. Long gone are the days of data silos and playing the bad data name game between data engineer and analyst teams. With data observability, all stakeholders are able to collaborate in a single, self-service platform.

This interface should:

  • Make it easy to search for and explore data assets with a simple UI
  • Collect and display information required for investigating and resolving issues
  • Deliver all the relevant information required to conduct root cause analysis, down to the field level
  • Maps out data incidents over time to that make it easy to view impacted tables, and every action that was taken to manage and resolve an incident
  • Share comprehensive query logs that reveal periodic ETL queries, ad hoc/backfill queries, changes in query patterns, and more hints that help teams identify the root cause of data incidents.
  • Seamlessly connect to Slack, Opsgenie, PagerDuty, webhooks, email, or your communication channel of choice to alert about downtime to the individuals who need to know
  • Display sample data, to help users immediately understand what data involved in the incidents looks like, and what typical data looks like

Automated data discovery and metadata management 

To support the growing demand for data democratization and decentralized data ownership, your data observability platform should:

  • Dynamically create a data catalog that enables data discoverability and searchability
  • Include self-service diagnostic tools that perform data profiling and understand data lineage
  • Provide standard reporting for data quality dimensions on data sets
  • Deliver value-add insights on table importance, monitor coverage, unused tables, and other information
  • Provide information on queries with deteriorating performance
  • Offer a centralized interface for self-service incident analysis, impact assessments, and cleansing requirements
  • Allow users to track and discover details on any dataset or environment
  • Automatically update schema metadata and information, without requiring any manual changes

Security-first architecture 

To ensure your data’s full protection and security, your data observability platform should:

  • Monitor data at rest by extracting query logs, metadata, and statistics about data usage—without exposing your data warehouse, lake, or other infrastructure to external environments
  • Offer SOC-2 Type II certification
  • Never extract or store individual records, PII, or other sensitive information outside of your environment
  • Allow you to comply with HIPAA, PCI, GDPR, CCPA, FINRA, and other compliance frameworks that you are subjected to
  • Allow easy and simple deployment with little to no ongoing operational overhead and frequent automatic upgrades

With these pieces in place, your data observability solution will be able to accelerate the adoption of data at your company and keep your CMO’s ad-hoc messages about “missing data” at bay. And who knows, you may even get a few more hours of precious sleep in the process.

Lior Gavish
Follow Lior
Latest posts by Lior Gavish (see all)