Solutions Review highlights the most common data lineage use cases you need to know about so you can select the best software.
Today’s data systems are so complex that sometimes, even asking a simple question is complicated unless you have the right augmented data management tools at your disposal. Augmented data management takes advantage of ripe AI and machine learning capabilities to make important information management tasks what analyst house Gartner, Inc. calls “self-configuring and self-tuning.” The increasingly complex nature of modern data stacks combined with a shortage of engineering talent limits the ability for organizations to adapt to changes in real-time, increases the risk of data incidents, and can lead to regulatory compliance headaches.
In effect, data lineage was traditionally used to see the data journey through an organization’s entire collection of data processing systems. Data lineage started as a simple way to describe that data journey, but now it has evolved and become the main tool for organizations to map, understand, and gain insights into their data pipelines. There are multiple very different views of data lineage and several linked approaches to its discovery, each with its advantages and disadvantages.
With these things in mind, our editors have compiled this list of the most common data lineage use cases you need to know. For an even deeper breakdown of data lineage, we recommend this short MANTA guide which puts it into a broader perspective of current data management trends.
Common Data Lineage Use Cases
Incident Prevention via Impact Analysis
Incident Response refers to the actions an enterprise undertakes after a hacker or insider threat begins a cyber-attack or data breach. Often, this involves a security operations center’s (SOC) incident response team beginning the actions necessary to mitigate and remove the threat. This may include threat hunting (to find the threat or any lingering malicious code). Yet it can also include alerting relevant departments (such as legal) of the breach, locking down sensitive databases, tracking the progress and history of the threat, and more.
According to MANTA: “Organizations with better incident prevention strategies achieve higher productivity and significant cost reductions. One key technique of the most successful companies is the extensive use of impact analysis for all planned changes early in the process in the design phase.”
Data Pipeline Observability
Data lineage expands the impact of traditional data quality and observability tools by focusing on the data infrastructure, not just the data itself. Although it’s a common issue, most data incidents don’t originate from the source of questionable data. In fact, most issues arise from data pipeline problems like API calls not matching database column type due to recent changes in the system. It’s not even just a cost issue, as dedicated data lineage software enables organizations to trace issues back to the source with greater speed and accuracy as well.
According to MANTA: “Thanks to data lineage, these incidents can be prevented in the design phase (see the previous section) or identified in the implementation and testing phase to achieve higher productivity and reduce maintenance costs.”
The growing presence of regulations is putting a strain on the enterprise, especially those organizations that store sensitive customer data. On the whole, these laws require companies in possession of personal information to manage it in a specific way. Companies also must be able to produce the data, as well as its location as it pertains to an audit. Compliance requires the mapping and identification of data, an understanding of data processing, associated risks, and provisioning data lineage and impact analysis.
According to MANTA: “The number of regulations that require data lineage has increased rapidly over the past few years, and we can suppose that there are more waiting in line, including BASEL, HIPAA, GDPR, CCPA/CPRA, and CCAR, just to name a few.”
A recent study by SingleStore found that 52 percent of IT professionals consider cloud migration is driving them to consider modernization strategies. More than a fifth of companies stated that they have faced six to seven bottlenecks amid the COVID-19 pandemic. The increase in bottlenecks and higher focus on modernization through cloud migration pushed 72 percent of IT professionals to consider changing their database services in the past year.
According to MANTA: “A successful strategy is to divide the system into smaller chunks of objects (reports, tables, workflows, etc.), which poses other challenges— how to migrate one part without breaking another, and how do we even know what pieces can be grouped together to minimize the number of external dependencies? Successful organizations use data lineage to complete their migration projects 40% faster with 30% fewer resources.”
Data virtualization tools are being deployed by organizations that want to light a fire under their data discovery projects. With modern, distributed analytics solutions becoming the new norm, companies crave the ability to obtain a unified view of their data without having to move it. As an added benefit, users are able to make real-time changes to data sets without disrupting the data as it physically sits, allowing them to virtually integrate disparate data sources quickly.
According to MANTA: “Data continues to grow and increase in complexity. Many enterprises are consolidating their data from multiple sources in one place or exploring data virtualization technologies that make it appear that the data is in one place.”
Self-Service Data Management
Due to the explosion in demand for data engineers (as a result of the complexity of modern data stacks), self-service data management is quickly becoming a necessity. In addition, few organizations are thrilled about spending big bucks on data engineering talent to do routine or manual tasks like chasing data incidents and assessing the impact of planned changes. Data lineage is increasingly being used to handle these tasks via automation, and the result is real self-service.
According to MANTA: “Armed with the right solution, data scientists and other data users have the power to retrieve up-to-date information about all the details surrounding lineage and data origin on their own whenever they need it. A detailed data lineage map also enables faster on-boarding of new data engineers and allows organizations to hire less experienced people for the role without jeopardizing the stability and reliability of their data environment.”