Data Lake vs. Data Mesh: Trending Data Management Strategies Compared
This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, ChaosSearch Founder and CTO Thomas Hazel pits data lake vs. data mesh to compare the different trending data management strategies.
As modern organizations struggle to deal with constantly growing quantities of enterprise data, many are reevaluating their data management strategies to determine the optimal approach for delivering business insights and analytics at scale.
With this goal in mind, most organizations are looking for ways to analyze data without having to spend additional time and resources on moving or transforming it. As a result, we’ve seen data lake and data mesh architectures rise in popularity; these approaches promise to fulfill the accessibility, consistency, data quality, and data governance requirements organizations need to achieve data analytics at scale.
But the question remains: which of these two solutions is better? Whether keeping data distributed in a data mesh, or centralizing it within a data lake, every organization should consider a unique set of criteria to determine the best solution for their business.
Setting the Scene: the Data Lake
The rise of big data and the challenges it brought to light for traditional enterprise solutions inspired James Dixon to coin the term “data lake” over a decade ago (2010). At their core, the best data lake solutions promise to eliminate data silos by serving as a single landing repository that centralizes, organizes, and protects large amounts of data from multiple sources. It follows a schema-on-read approach and can store data that is structured, semi-structured, and unstructured, typically on cloud storage platforms such as AWS S3.
These flexible storage solutions have become increasingly popular among modern enterprises, but one common misconception is that they inherently include analytic features. In order to perform indexing, transformation, querying, and analytics, the data lake must be connected to a combination of other cloud-based services and software tools. In a typical data lake architecture, a self-service data analytics engine will sit on top of a cloud-based data repository. That’s when an organization can realize the true benefits of a data lake and act on the full value of their data resources.
The Rise of the Data Mesh
Until recently, data warehouses and data lakes represented the two leading solutions for enterprise data management. But a new approach has risen over the last year – the concept of a “data mesh.” In fact, it’s becoming one of the top buzz words being discussed more every day. Thoughtworks defines a data mesh as “a shift in a modern distributed architecture that applies platform thinking to create self-serve data infrastructure, treating data as the product.”
This type of architecture supports the idea of distributed data, where all data is accessible for those with the right authority to access it. One key differentiator between a data lake and a data mesh is that in a data mesh, data does not have to be consolidated into a single data lake and can remain within different databases. Because of this, a data mesh architecture connects various data sources, including data lakes, into a coherent infrastructure.
Data Lake vs. Data Mesh; What’s the Difference?
Data lakes have come a long way since the failure of the first instances built on Hadoop. While many industry pundits still remember the “data swamps” byproduct, there’s been tremendous innovation in this space since then. New data lakes remove constraints inherent to traditional approaches storage, infrastructure, and access for analytics purposes. Today’s modern data lakes are cloud-native and can be activated to index multiple data types and make this data easily available and accessible to diverse stakeholders across the business.
Querying data within a data mesh will be limited by its slowest query. For organizations that store data in multiple silos but are looking for more efficient queries, it would make more sense to leverage a data lake platform for analytics within the existing data mesh architecture.
There are solutions that can remove some of these challenges known to data mesh architectures. For example, cloud data platforms can virtually publish logical data views to query within the data lake without complex extract, transform and load (ETL) pipelines. This is one approach to improve the democratization of data within an organization, without needing data scientists or data engineers.
Looking Ahead
Since data lakes and data mesh architectures take different approaches (e.g. data integration), these two strategies can be viewed as complementary versus having to choose one over the other. However, while everyone likes the vision of ubiquitous data, the truth is companies don’t realize the requirements to get there. Data mesh and data democratization are one and the same — you can’t have a decentralized data architecture when there are gatekeepers limiting who has access. Therefore, to achieve this goal of a distributed data mesh, companies need to first enable the free flow of data across the entire organization, an intrinsic byproduct of data lakes.
There isn’t a one-size-fits-all solution to becoming a data-driven org. For some, a data mesh would be helpful if they store their data across multiple databases, while those looking for a solution that enables queries without data movement might benefit more from a data lake. Ultimately, the desired goal for most organizations leveraging one of these data management solutions is to have a unified platform for analytics that can provide powerful insights without needing complex support behind the scenes from intermediaries. As more organizations develop new approaches to democratize access to data, this space will be an important area to watch in the coming years.