Ad Image

Gartner D&A Summit 2023: The Gartner View of the Data Lake & Lakehouse

Data Lake Gartner

Data Lake Gartner

Independent analyst for data and analytics Philip Russom PhD. offers commentary on the Gartner view of data lake and lakehouse from the recent Gartner Data & Analytics Summit 2023.

When the Gartner Data & Analytics Summit met recently (March 19-22, 2023 in Orlando, Florida), I was there attending sessions presented by Gartner analysts, vendor sponsors, and other experts. The summit covered most aspects of data and analytics (D&A), plus their best practices, tools, technologies, and team structures. However, a topic that came up in multiple sessions was the data lake and its variation: the data lakehouse. Allow me to summarize the Gartner View of data lakes and lakehouses, as presented at the Gartner D&A Summit 2023.

Background for Data Lakes and Lakehouses

Data Lakes and Data Lakehouses are drivers of innovation for databases and storage. For example, many D&A programs deploy a data warehouse and a data lake side-by-side, because they are complementary. The warehouse is mostly relational data for business reporting and keeping a history of corporate performance, whereas the lake is mostly about data science and advanced analytics, using data of any structure or file format. The Gartner view is that the warehouse and lake are now converging into the data lakehouse, which is a single data architecture that combines and unifies the architectures and capabilities of lakes and warehouses. The point of the data lakehouse is to enable greater agility for all analytics, but with less data redundancy, a simpler architecture, and a more consistent view of semantics for all analytics data.

Download Link to Data Management Buyers Guide

Data Lake Gartner


Avoiding Data Lake Failures

Over the fifteen years of the data lake’s existence, Gartner analysts have seen many failures of data lake implementations among its user clients and other end-user organizations. Much-needed guidance for users hoping to succeed with a data lake was offered in the summit session “Avoid Data Lake Failures by Addressing Modern Lake Requirements,” presented by Donald Feinberg, a Distinguished VP Analyst at Gartner Inc.

The primary premise of the presentation is that organizations fail with a data lake when they build it the way early lakes were designed and managed. Instead, organizations need to understand the business requirements and use the available technologies that are appropriate to today’s data lakes.

For example, the earliest data lakes were simple and raw, because most were built for data scientists who had minimal data requirements for data. Most early lakes had little or no support for:

  • Use cases besides data science
  • The relational paradigm
  • Metadata and other data semantics
  • Governance or curation
  • Coherent and repeatable data flows (most data integration was ad hoc)
  • Data zones to organize lake data (each lake was one big sandbox)
  • ACID properties

Furthermore, most early data lakes were deployed on Hadoop, on premises. At that time, Hadoop was one of the few data platforms that provided the open-ended and unstructured data environment (at massive scale) that data scientists wanted.

Since then, two trends changed the data lake forever. First, data lakes evolved to support many more use cases beyond data science, resulting in a multi-purpose enterprise data lake. Second, cloud-native storage and data management evolved into data platforms with elastic scale.

Today, almost all data lakes are deployed on object storage on clouds, with the storage managed via cloud-native multi-model database management systems (DBMSs). Indeed, the cloud has replaced Hadoop as the preferred data platform for data scientists and many other user constituencies. And this change has greatly reduced the likelihood of data lake failures. Even so, a few legacy data lakes still linger on Hadoop.

The presenter Donald Feinberg shared a humorous observation: “According to Gartner user clients, Hadoop adoption has dropped to almost nothing in new implementations of data lakes. Does that mean that Hadoop is dead? Well, that’s one way to put it!”

The secondary premise of this presentation is that the so-called “data swamp” is an all-too-common manifestation of a failed data lake. A swamp results when a data lake lacks documentation and governance.

  • Documentation usually takes the form of metadata and other data semantics; these should be required of all data in the lake.
  • Governance may take the form of curation, where a data curator decides which data will be allowed into the lake, thereby reducing redundant data, minimizing data dumping, and keeping the lake focused on domains and use cases that are high priorities.

In short, a data lake is prone to failure when (1) users design and manage it like an old-fashioned Hadoop-based data science implementation and (2) when a lack of documentation and governance turn the data lake into a data swamp.

Lakes, Warehouses, and Lakehouses, Oh My!

At the beginning of his presentation on “Data Lakes, Data Warehouses, Data Hubs and Now Lakehouses: What Are They and How to Choose,” Donald Feinberg (Distinguished VP Analyst at Gartner Inc.) made a humorous observation: “Look at how packed this large room is! Why have so many of you showed up today? It’s because most people still don’t understand the differences among data warehouses, lakes, and hubs. And now it’s worse, because data lakehouses have arrived!”

To cure the confusion, presenter Donald Feinberg spent a lot of time (and slides) comparing warehouses, lakes, lakehouses, and hubs, showing how the capabilities and repository models of each map to specific business use cases and business value. For a summary, see Figure 1 below:

Figure 1. What’s the Difference Between Data Lakes, Data Warehouses, Lakehouses and Data Hubs?

Figure 1. What’s the Difference Between Data Lakes, Data Warehouses, Lakehouses and Data Hubs?

Near the end of the session, the presenter shared an interesting insight, namely: Deciding whether to use a data lake, data warehouse, data lakehouse, or data hub is rarely an “either/or” decision. For many data-driven solutions, a user organization may need two or more of these; in some cases, an organization may deploy all four (as illustrated in Figure 2). For example, when an organization has many use cases to support, it may implement a data lake for data science, a data warehouse for business reports and dashboards, a data hub for controlled data-product distribution, and a data lakehouse for analytics that require data from both lake and warehouse.

When multiple implementations are deployed, they are usually “unified” in multiple ways. For example, data flows may synchronize data across implementations; metadata and other semantics may represent data managed in all implementations; and data views can make data distributed across the implementations look as if it is in one consolidated dataset.

Figure 2. Hubs, Lakes and Warehouses Work Together — They Aren’t Exclusive Choices

 

© 2023 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. and its affiliates

The Data Lakehouse the USPS Built

At the Gartner D&A Summit, one of the best sessions from a technical end-user was presented by Ben Joseph, the chief data officer (CDO) for the Office of Inspector General (IOG) within the United States Postal Service (USPS). His presentation was titled “More with Less: How the USPS OIG Delivers Mission Outcomes with Databricks Lakehouse.”

Ben Joseph and his team needed to modernize their legacy data analytics stack to analyze the volume of streaming data generated by 128 billion pieces of mail across 161 million delivery points every year. The analyses should focus on detecting and preventing fraud, waste, and abuse (FWA), which is part of the larger mission of assuring efficiency, accountability, and integrity across the USPS. The biggest barrier to achieving these goals was their legacy analytic platform, which suffered from low scale, high maintenance, great cost, and limited capabilities for advanced analytics.

To enable their analytic mission, CDO Joseph and his team decided to deploy a data lakehouse on a modern cloud platform. The cloud-based lakehouse makes sense, given that they needed an integrated mixture of big data analytics (lake) and performance metrics (warehouse).

The data lakehouse platform that USPS OIG selected gave them:

  • Unified data platform for lake, warehouse, and data integration programming
  • Centralized analytics
  • Workforce productivity
  • Ease of data operations
  • Feature-rich and high-performance support for SQL and Python

“Our data lake today is managing between 7 and 8 terabytes of data. But that data is only for oversight purposes,” said presenter Ben Joseph. “As we expand to other use cases, we anticipate that the data volume will increase substantially.”

After finishing his description of work done so far, CDO Ben Joseph looked into his crystal ball and said: “Now that our data lake infrastructure is in place and optimized, it’s time to implement the next phase of our future-state vision. That will be all about self-service analytics based on data from the lake.”

At the end of his presentation, Ben Joseph concluded with a few ‘lessons learned,’ each stated as an actionable recommendation:

  • Start with a data strategy
  • Build foundational components before embarking on advanced analytics
  • Establish a data governance framework
  • Depend on the Golden Triangle – people, process, and technology

Download Link to Data Management Vendor Map

Share This

Related Posts