Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, data.world CTO and co-founder Bryon Jacob offers commentary on constructing a data mesh roadmap including two essential keys to consider.
A lot has been written about data mesh, but even for those who are well-versed in what it is and its myriad benefits to business, there’s still a lot of confusion about how to implement it. Luckily, there are tangible steps and specific technologies that simplify the process and enable a truly product-focused approach.
Starting with the Right Questions
Building your own data mesh starts with taking stock of what you have. What data assets are already in existence in your enterprise? How are those assets being managed today? What key business entities and relationships for your business, and how do those break down across the various domains? What are the regulatory constraints and key dimensions of interoperability that need to be managed across the enterprise, and not left to individual domains to define?
The answers to all these questions should live in a data catalog – the starting point for your data mesh journey. A data catalog captures the state of your existing data assets, serves as the agile roadmap on the journey to a data mesh where the lines of ownership and governance are clear, and becomes the place where data products are discovered and consumed.
Data Mesh Roadmap
Building Your Foundation
Data mesh is a socio-technical system, meaning it involves people and processes at least as much as it does technology. At the foundation of a data mesh is the self-service data platform that is shared across domains. Within that part of your data mesh, you must have a basic technology toolkit for supporting a self-service data platform, beginning with a data catalog that sits across your data assets and analytical tools, a clear “gold standard” for publishing new data products, and data virtualization that represents an inventory of data assets and the knowledge of how they interconnect.
But in order to get started, you must have a solid foundational knowledge of your world as it exists today. You need a comprehensive inventory of existing data assets and as much of the knowledge about how things are organized and managed as you can gather. A data catalog captures the knowledge about your data assets, including what assets exist, how they relate to one another, and how they relate to people, processes, and policies within your enterprise. The data catalog is where you start to draw the lines that define the key data domains within your organization so that you can start to define ownership and give those domain owners a way to publish their catalog of data products. It’s also where consumers will come to discover, explore, and gain access to data. The secret of a good data catalog is that it’s an iterative and agile data asset. It’s a knowledge graph that can visually represent your data landscape as it exists today, and it’s the place to represent and operationalize the roadmap for where you’re going tomorrow.
Leveraging Your Data Sources & Analytical Tools
It is also critical to address the collection of data sources and analytical tools that make up your self-service data platform. In all likelihood, you already have many of these technologies in place– your data warehouses, data lakes, SaaS systems, and analytical tools like Tableau and PowerBI. These form the backbone of the shared data platform, and you can fit them into the system you’re designing using the data mesh architectural pattern. You may be tempted to leave out technologies that need work themselves. However, if you’re building a data mesh architecture for an enterprise with any existing history, there are data systems already in place and it’s vital that you make them a part of your data mesh. Even if the intent is to completely replace them eventually, they need to be a part of the system as long as they’re actively serving the business.
At the same time, you don’t want to encourage domain owners to expand the footprint of systems that you hope to eventually sunset. The self-service data platform should provide the basic building blocks for producing and distributing new data products, along with “gold standard” technologies and clear best practices. These can be systems you already use, but by using them in your data mesh, you give the enterprise a path to incrementally deploy more data products onto the preferred technologies. You also make the overall system simpler, easier to consume, and cheaper to manage.
Even with clear standards in place, existing data assets can often be too large, too sensitive, or too tightly interwound with a history of existing applications to consider moving them into a new system for storage or computation. In these cases, data virtualization is a key component of your data mesh because it allows you to put a new interface on top of an existing data system that enables it to more easily integrate with the rest of your data mesh. Data virtualization can be a powerful way to implement parts of your data mesh as a facade, leaving data in place and continuing to service other applications, while participating as a fully integrated part of the data mesh.
Ultimately, the creation of a data mesh should come full circle – it starts with a foundational data catalog that enables an inventory of existing data assets and ends with empowering the owners of data domains to publish their data products into the catalog. Then, the goal of empowering data consumers to discover and utilize those products can run at full speed, demonstrating return on investment for the enterprise through data-backed decision-making.