Embracing the Open-Source Data Lakehouse: 5 Things to Know

By Girish Baliga
Best Practices,

Open Source Data Lakehouse

Solutions Review’s Contributed Content Series is a collection of contributed articles written by thought leaders in enterprise technology. In this feature, Presto Foundation Chair Girish Baliga, in collaboration with IBM VP of Hybrid Data Management Vikram Murali offer commentary on embracing the data lakehouse through open-source.

As two active members of an open-source data community, we often get asked about where we see the future of data going. And right now, we are witnessing a significant shift in the data landscape with companies increasingly embracing the data lakehouse architecture. This is no longer just an idea or theory – companies are adopting the data lakehouse to power mission critical workloads and applications at big scale. At Uber and IBM respectively, we see the Data Lakehouse playing a crucial role in our world, whether it’s powering the data analytics behind Uber Eats or scaling AI workloads for all of your data.

While there are many pieces to the Data Lakehouse architecture, we see open source playing a big role to drive this transformation. As the volume, velocity, and variety of data continue to grow exponentially, traditional data warehousing approaches are no longer price performant to meet evolving demands. The data lakehouse is a powerful paradigm that combines the best of data lakes and data warehouses offering better price performance while leveraging open technologies and open data formats.

Embracing the Open Source Data Lakehouse

The Importance of Open-Source in the Data Lakehouse Architecture

Open-source has become an integral part of the modern technology landscape, revolutionizing the way software is developed, distributed, and maintained. An open-source community fosters collaboration, innovation, and transparency. When it comes to data processing and analytics, the open-source approach provides numerous advantages. It enables organizations to harness the collective expertise of a vast community of developers on a project and contribute back to that project. Add to that no vendor lock-in and more flexibility in choosing the technologies that work best for their specific workload, we can see why this approach has become mainstream over the past year.

Based on our experience and belief in open-source, these are a few of the critical pieces to consider as you move to the data lakehouse.

Open-Source Query Engine

The engine for the lakehouse enables efficient and high-performance data processing and analysis. That engine should be able to provide:

Unified Data Access for ad-hoc queries across various data sources, including traditional data warehouses, data lakes, and streaming platforms
Scalability and Performance through its distributed architecture enables parallel query execution across a cluster of nodes, allowing organizations to process large-scale data analytics workloads efficiently
Federated Querying for accessing and analyzing data across disparate sources for a unified and consistent experience
Ecosystem Integration so users to connect and query data using familiar SQL interfaces, business intelligence (BI) tools, and programming languages

Open Data Formats

As data is generated and stored in various formats, interoperability becomes a crucial factor. Open data formats, such as Apache Parquet and Apache ORC, have gained prominence due to their ability to store and process data efficiently. These columnar storage formats enable high-performance analytics by leveraging compression techniques and predicate pushdown. A query engine’s compatibility with open data formats allows organizations to seamlessly access and analyze data from diverse sources within the data lakehouse, without the need for time-consuming and resource-intensive data transformations.

Open Table Formats

The data lakehouse requires robust table formats to manage metadata, schema evolution, and transactional consistency. It should provide schema evolution and time travel queries to adapt to changing data requirements while ensuring data integrity, and transactional capabilities and upsert support for large-scale data ingestion.

The Building Trend Towards a Data Lakehouse Architecture

The data lakehouse architecture represents a convergence of the best features of data lakes and data warehouses. It combines the flexibility and scalability of data lakes with the structured querying and governance of data warehouses. This trend has been fueled by the increasing need for real-time analytics, machine learning, and AI applications, which require unified access to diverse data sources.

As the industry moves towards a data lakehouse architecture, we see open source and open data formats as key drivers in that transformation. Open source plays a vital role by promoting collaboration, accelerating innovation, and ensuring transparency. Organizations can leverage the collective expertise of the open-source community to build scalable and customizable data architectures.

The growing trend toward data lakehouses represents a transformative shift in how organizations approach data processing and analytics. Open-source technologies and open data formats are essential elements that drive this evolution. By embracing these principles and selecting the right components, organizations can unlock the full potential of their data, accelerate insights, and make data-driven decisions effectively. The increasing adoption of the data lakehouse architecture demonstrates the industry’s commitment to the principles of open source and the collaborative spirit of the open-source community.

This article was written by Girish Baliga on August 3, 2023

Girish Baliga

Girish Baliga is Chair of the Presto Foundation, the governing body behind the open-source Presto project, and director of engineering at Uber, where he manages the core platform that powers search for all of Uber’s products. Girish holds a Ph.D. and a master’s degree in computer science, as well as a master’s degree in math, from the University of Illinois Urbana-Champaign.

Best Practices

Embracing the Open-Source Data Lakehouse: 5 Things to Know

Embracing the Open Source Data Lakehouse