Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, data.world Co-founder and Chief Product Officer Jon Loyens asks and answers the question: “Business lineage vs. technical lineage; what’s the difference?”
Ask a random sample of people working in data governance to explain data lineage, and you’ll get an answer along the lines of, “lineage shows where data comes from, how it’s been changed, and where it goes.”In a broad sense, this is true. But explanations from people in your random sample might vary, because the specific definition and makeup of lineage varies depending on the use case.
Is lineage a map of your data and analytics, a graph of nodes and edges that describes — and sometimes visually shows — the journey your data takes, from start to finish, from raw source data, to transformed data, to compute metrics and everything in between?
Or is it the path from a specific metric, backward to what defines that metric, to the part of your enterprise to which it’s related, to who owns it, to the tables, pipelines, and processes that compute it?The truth is, data lineage means all those things. Here, I’ll explore the difference between and uses for two frequently used — and frequently confused — types of lineage: business lineage and technical lineage.
Business Lineage vs. Technical Lineage
Data lineage delivers a top-down picture of your data and analytics ecosystem. It provides clear visibility into where your data is coming from, where it’s going, and how it’s been changed.
Lineage can also be thought of as the “provenance” of the data, with “provenance” defined as, “the beginning of something’s existence; something’s origin” or “a record of ownership of a work of art or an antique, used as a guide to authenticity or quality.” Both definitions of provenance apply to data lineage.
When creating lineage metadata, there are several great open standards you can adopt (One of these is OpenLineage, which represents the technical provenance of data, and the systems and processes that move and transform the data into something usable for analysis and data science work). These standards are great if you’re considering only the first definition of provenance. But what if you’re trying to drive trust and understanding along with specifying custody?
Sometimes we need to be more expansive. For these more extensive lineage requirements, the best option is a standard from the Semantic and Linked Data community that can represent both technical and semantic ideas in a single graph. That standard is PROV-O, and it’s particularly well-suited to building a representation of technical lineage and representing semantic and business concepts at the same time.
While both of the most commonly used types of lineage help users understand their data’s origin, and build understanding and trust for its authenticity, veracity, and context, each is designed to serve a different audience:
- Business lineage: Provides a summary view of how data flows from its source to where it is consumed. It covers the semantics: What does the data mean? Who owns it? Etc.
- Technical lineage: Which is much more granular and gives data engineers and other technical users a zoomed-in view of infrastructure and data transformations: How is this data made?
Say, for example, you are collecting data about exceptions in a manufacturing process. A typical technical lineage flow might look like this:
Here we have the technical software (and more specifically software data) processes that collect manufacturing events and ultimately creates a dashboard about exceptions. This view has the artifact names and the systems where these artifacts reside. However, there’s a lot of missing context.
Technical lineage is used by the people charged with keeping the data flowing, such as data engineers, data architects, and analytics engineers — the data producers. Lineage helps them observe and maintain an efficient pipeline of accurate data.
Business lineage is the realm of analysts, data scientists, and business users who need context around data in order to understand it to gain insights and knowledge — the data consumers. For these stakeholders, lineage confirms the data they’re using to make business decisions is sound. If we layer in business context into the above diagram, you end up with something that looks like the following:
Here we see that ideas exist, like manual overrides for alerts, which may change the shape of the data. We see a chain of custody and ownership. We see who depends on these artifacts downstream. Overall, as a consumer of this data, I now have a much more complete understanding of its overall context and who might be impacted if I start using it.
(Note that there will inevitably be times when data consumers need to see and understand technical lineage, and data producers need to see and understand business lineage. Data people also need to be business literate.)
Lineage Use Cases
Now we’ve identified the primary personas associated with the two principal types of lineage, let’s talk use cases — when does each type of lineage come into play, and for what purpose?The purpose of business lineage is to aid in data discovery and establish trust that data is accurate. When you’re analyzing data, business lineage provides context and reveals relationships — it tells you who’s responsible for specific data, who created the data, what parts of your enterprise are associated with the data, and so on. For example, if your shipping department has questions about the address they’ve been provided for an order, tracing the lineage of that data will show them that the address was entered by your sales team. Shipping then knows to contact sales to confirm the accuracy of the address.
The purpose of technical lineage is twofold. It allows data teams to predict the impact to data flows downstream if an upstream change is made. At the same time, it enables easy identification of an upstream problem when downstream data is “out of band,” disconnected from the overall data flow. For example, if the flow of data to a critical sales dashboard stops populating, your data team can trace the flow back from the dashboard to find the error and correct it.
(In both of the above technical lineage use cases, the associated business lineage enables teams to identify and alert stakeholders when changes are being made or errors need to be repaired.)
Good Lineage Requires Both Business and Technical
Now that you understand business and technical lineage, you can see why both are critically important to any lineage program: business lineage for context and ownership, and technical lineage to ensure your data is flowing as it should. Additionally, you might recognize lineage as a bridge to foster tighter collaboration between your data producers and data consumers, an outcome that would further grow your data-driven culture.
- Business Lineage vs. Technical Lineage; What’s the Difference? - January 24, 2023