3 Incorrect Assumptions About Open-Source Data Warehousing Software

By Alex Morozov
Best Practices,

This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Datafold CTO and Co-Founder Alex Morozov offers three incorrect assumptions about open-source data warehousing software.

SR Premium Content There are many reasons to use open-source software (OSS) in your data stack, specifically, data warehousing and processing. Today, we have at our disposal dozens of mature OSS technologies with active communities, including:

Apache Spark is a Swiss Army knife data processing engine that is also developer-friendly
Trino provides great performance for SQL ETL and analytical queries while abstracting the user from underlying complexity
Druid and ClickHouse offer subsecond query performance for interactive analytics

However, due to the modularity of the OSS ecosystem (the stack is assembled from multiple OSS projects and standards) and distributed ownership of codebase (and of bugs), adopting an open-source technology is often a very different experience than buying a similar proprietary product from a vendor. It seems that many teams default to open-source without properly considering trade-offs and carefully evaluating their assumptions. In this article, I challenge what I consider the top three erroneous assumptions about OSS data warehousing technologies.

Before we proceed, let’s define the scope for “data warehousing tech” as delivering the following use cases:

Ingest and store all analytical data
Execute data transformations (the “T” of “ELT”)
Serve data to consumers (Dashboards, Ad-hoc analysis, Consuming applications (ML, microservices, etc.)

I often hear from enterprise data teams, “We are choosing open-source data warehousing technology because it’s cheaper than proprietary options, to avoid vendor lock-in, and to be able to extend the system if we need to.” There are actually three statements here, so let’s break down the assumptions.

OSS is Cheaper

It’s difficult to compare pricing for data warehousing tech, especially with structurally different pricing models: for example, pay-per-use (BigQuery, Athena) vs. pay-per-infra time (Snowflake, Databricks). Therefore, I suggest a counterexample to this assumption.

To compare pricing, we first need to establish a common denominator in terms of the amount of work and performance that we are buying for a given price. Let’s consider an often-cited Fivetran TPC benchmark that compared Snowflake, a proprietary DWH product, with Trino (formerly known as PrestoSQL), a popular open-source data processing engine, among others. In that benchmark, Trino shows an average query runtime of 14.78 sec vs. 10.74 sec (38 percent difference) for Snowflake, but the medians are roughly the same, so let us assume the performance is comparable.

Here’s a back-of-the envelope cost calculation: Base Costs

Mean TPC query time

Price/hr

Size

Mean cost per query

Snowflake

10.74 sec

$16.00

Large

$0.047

Trino

14.78 sec

$8.02

4x n2-highmem-32

$0.033

At first glance, Trino seems to be ~30 percent cheaper per query. However, in a real production scenario, we need to consider total cost of ownership and, in case of OSS, factor in expenses such as DevOps; products like Trino are sophisticated distributed systems that require proper deployment, monitoring, tuning and maintenance. And while it may seem easy to spin up a cluster with Kubernetes in a couple of clicks, matching Snowflake’s availability SLA of 99.9 percent is a completely different game.

Let’s consider the DevOps cost factor in two primary scenarios: vendor and in-house managed deployments.

Vendor-Managed Trino Costs

For example, the leading Trino vendor, Starburst, charges ~60 percent markup on top of the AWS infrastructure cost, which makes vendor-hosted Trino more expensive than Snowflake.

Mean TPC query time

Price/hr

Size

Mean cost per query

Snowflake

10.74 sec

$16.00

Large

$0.047

Trino via Starburst Enterprise

14.78 sec*

$12.82

4x n2-highmem-32

$0.053

*Assuming similar performance for plain Trino and Starburst Enterprise distribution.

Trino via Starburst Enterprise becomes 13 percent more expensive.

In-House Trino Costs

If you are up for running Trino in-house, consider the cost of two senior distributed systems engineers (most companies can’t do with just one because they need to be on call to maintain the SLA). That’s $220K * 2 * 2.7 overhead factor = $111 per calendar hour (Silicon Valley pricing).

That would be tremendous overhead for just a 4-node cluster as in the benchmark, so let’s assume a more realistic scenario for a large company: a ~110-node Trino cluster. Even with such a large cluster size, you would be paying a ~100 percent markup on top of infrastructure costs for the DevOps labor, making Trino 40 percent more expensive than Snowflake per query.

Mean TPC query time

Price/hr

Size

Mean cost per query

Snowflake

10.74 sec

$16.00

Large

$0.047

Trino including DevOps

14.78 sec

$16.04

4x n2-highmem-32

$0.066

This should not be surprising because a vendor like Snowflake, with tens of thousands of customers, is able to perfect their technology and distribute their DevOps costs thinly.

We’ve been making quite a few assumptions along the way, so the numbers may differ from case to case, but the key takeaway is that there is no magic 1.5x/2x/3x savings we can expect from going the OSS route for a typical team, unless you have your own hardware infrastructure and are FAANG-big. Note that DoorDash and Block (aka Square), two public and very data-driven companies with hundreds of internal data users, decided against hosting their data infrastructure in favor of Snowflake.

OSS Eliminates Vendor Lock-in

Unless your company is so small that you don’t care about SLAs or so big (Facebook-big) that you can actually pull it off while benefiting from economies of scale, you will probably contract a vendor to host the OSS data warehouse for you instead of running it in-house.

Many vendors run their custom distributions of OSS, which can be significantly behind the master branch of the respective OSS projects. That, in turn, limits your ability to tinker with the OSS and to take advantage of its latest features and optimizations. And now your infrastructure is also integrated with the vendor’s APIs, making it harder to move out. All in all, you end up with a similar vendor lock-in that you tried to avoid by going the OSS route. With one exception: a theoretical ability to integrate some of the community-contributed components. But unless you have very specific requirements here, it’s hardly an advantage as the best proprietary database products have excellent interoperability as well.

OSS Gives More Flexibility

First, consider all of the challenges with vendor-flavored OSS described above. Finally, ask yourself (and your engineering partners) what exactly and why would you want to add to a database solution that isn’t readily available from serverless offerings such as Snowflake or BigQuery? Is that custom feature so critical to your business that you are going to hire expensive engineering talent to develop database tech instead of building your core product?

End-User Experience

It didn’t come up in the “three reasons,” but I am going to mention end-user experience anyway. Even the most mature OSS products in the data warehousing space are still rough around the edges when it comes to user experience in comparison to top proprietary products such as Snowflake and BigQuery. If you are unsure what I am referring to, try optimizing the performance of a couple of Spark jobs that take forever to complete — for no apparent reason. Or try to see inside your Kafka data queue. In a world where analytics powers the business and where talent is the scarcest resource, a data platform’s user experience matters a lot — so don’t compromise on it, especially without a good reason.

Where Does OSS Make Sense in the Data Stack?

While core data processing infrastructure, as we argue, is a tricky one, other parts of the stack — for example, data orchestration and transformation layers — are perfect examples where open-source solutions are most effective. Note the success stories of Airflow, dbt, and Dagster.

If we were to draw a line to distinguish “good for open-source” and “challenging for open-source” places in the stack, it would mostly come down to customization and extensibility. It is unlikely for an average company to tinker with core infra, such as Linux, Docker or their data warehousing solution. But building dbt packages, utils or connectors is very common.

Lastly, even with core data infrastructure, the open-source world is rapidly converging with fully managed. Databricks started its business as a hosted Spark platform and currently offers a serverless SQL data engine that directly challenges Snowflake in both price and performance while supporting open-source file formats.

Open-source software can be profoundly effective if adopted for the right reasons and can turn into a terrible money sink when the fit isn’t right.

This article was written by Alex Morozov on May 3, 2022

Alex Morozov

Alex Morozov is the chief technology officer and co-founder at Datafold. A nuclear physicist by training, Alex has served as CTO at numerous startups and brings over 15 years of experience in delivering sophisticated software systems across data, IoT, embedded and communications.

Best Practices