Ad Image

The Next Iceberg Bottleneck Is Operations, Not Adoption

Dremio’s Alex Merced offers this commentary on how the next big Iceberg bottleneck is operations, not adoption. This article originally appeared in Insight Jam, an enterprise IT community that enables human conversation on AI.

Apache Iceberg won the format debate, and that part is now over. In the first half of 2026, Snowflake shipped Iceberg v3 to general availability at its June Summit and Databricks pushed Managed, Foreign, and v3 Iceberg support across Unity Catalog. As if that wasn’t proof enough, both Cloudflare R2 and AWS Glue added automatic compaction to their managed catalogs. When every major vendor builds on the same table format, the question stops being whether to adopt Iceberg. It becomes whether you can run it well at scale.

The short answer is most teams cannot, at least not yet. Not because Iceberg is hard to start with but because the work that keeps an Iceberg table healthy never shows up in the demo. You write data, queries run, dashboards load, then three months later the same dashboard takes 40 seconds instead of 4, and nobody knows why.

The gap between adoption and operation has become a real bottleneck, because what it looks like in production and why it keeps surfacing in practitioner forums instead of conference keynotes comes down to a few things:

Small Files are the Cost You Do Not See Coming

Every insert into an Iceberg table tends to create new data files rather than rewrite existing ones. That design keeps writes fast and snapshots immutable. It also means a table fed by streaming ingestion or frequent micro-batches piles up thousands of tiny files.

Small files punish you three ways at once. First, query planning slows down because the engine reads more manifest entries. Next, storage costs climb because cloud providers charge per request and scanning ten thousand 2 MB files burns far more GET operations than scanning forty 512 MB files. Finally, metadata bloats because Iceberg tracks every file in manifests, so file count drives manifest size directly.

The solution is compaction: read the small files, merge them into those in the 128 to 512 MB range, and commit the result. It’s simple to describe, but the trouble is that nobody owns it by default. dbt, the tool most teams use to build their tables, writes data and stops there. It does not compact files, expire snapshots, or clean orphans. Someone has to build that layer, and on most teams that someone does not exist until the dashboards already feel slow.

Delete Files Quietly Break Read Performance

Here compaction gets the attention, yet deleted files cause more pain. Iceberg supports row-level deletes through position and equality delete files. This avoids rewriting a 500 MB data file just to remove a handful of rows, which is the right tradeoff for change-data-capture pipelines and frequent updates. The cost lands on reads as every scan has to reconcile base data against the delete files that apply to it.

By letting them accumulate, read amplification turns ugly because a table with 50 data files and 200 delete files can scan three to five times more data than the same table after cleanup. One widely cited figure calls this the single most common performance problem in dbt-plus-Iceberg setups. But what most teams miss is that compacting data files does not fix the problem. You can pack your data into perfect 256 MB blocks and still crawl if every scan applies thousands of delete fragments. As a result, delete compaction is a separate maintenance loop, and if your delete workload is heavy, run two loops instead of one.

Snapshots, Orphans, and Manifests Add Up

Beyond data and delete files, three more jobs keep a table healthy; however, skipping them is how storage bills creep.

Snapshot expiry removes old table versions you no longer need for time travel, so keep a sensible window, say seven days, then expire the rest. Orphan file cleanup deletes files that no live snapshot references, which collect from failed writes and aborted compactions. Manifest rewrites consolidate the metadata files that track everything else, because thousands of tiny manifests slow query planning the same way thousands of tiny data files do.

None of these are hard on their own. The hard part is sequencing and automation. Run orphan cleanup wrong against a stale file listing, and you can delete live data. The Iceberg docs warn about this directly for file systems where paths change over time. So, you need the operations, you need them in the right order, and you need them running on a schedule without a human babysitting Spark jobs every night.

Why This is the Conversation Right Now

Look at where “vendor effort” actually went in 2026 and the pattern is obvious. Cloudflare R2 Data Catalog added one-click automatic compaction. AWS Glue built table maintenance into its catalog. Databricks markets Unity Catalog managed tables as automatically optimized. Now, a wave of dedicated control planes sell themselves entirely on running these loops.

That is not a coincidence. The market spent three years answering, “should we use Iceberg” and is now answering “how do we keep ten thousand Iceberg tables from rotting.” Adoption was the easy half. The expensive half is the maintenance nobody scoped into the migration plan.

Teams that handle this well, will treat table health as a first-class system and not a cleanup chore. They monitor small-file counts and delete-file ratios through Iceberg’s own metadata tables and automate the five maintenance operations as coordinated loops rather than disconnected cron jobs. They also measure the cost of doing nothing, because a slow table is a tax paid on every single query against it.

The Engine Layer is Part of the Answer

This is where the choice of query engine matters more than people expect. An engine that only reads Iceberg leaves all the operational weight on you. But an engine that understands table state and automates maintenance can carry some of it.

The market moved in this direction across the board in 2026, as managed catalogs from cloud providers now run compaction on a schedule. Dedicated control planes coordinate the full set of maintenance operations, so they reinforce each other instead of fighting. Several query platforms contribute to Apache Iceberg and run on it natively, and fold table maintenance and acceleration into the platform rather than leaving it to external Spark jobs. None of these solve operations by themselves. What they show is that operational maturity is now a real selection criterion, and tooling that automates maintenance saves you from staffing it by hand.

Why Migration Was Never the Hard Part

Here is the prediction. Over the next year, fewer teams will brag about adopting Iceberg and more will quietly struggle to operate it. The conference slides will keep celebrating format wars and v3 features. However, the actual work will be small-file compaction at 2 a.m., delete-file ratios on CDC tables, and storage bills that no longer make sense.

If you are planning an Iceberg rollout, budget for the second half. Pick tooling that automates maintenance instead of assuming your team will hand-tune Spark forever. Run one experiment before you commit: load a table, hammer it with updates for a week, and watch what happens to query latency with no maintenance at all. That single test tells you more about your future than any benchmark. Adoption is done. Operations is the work.

Share This

Related Posts

Insight Jam Ad

Insight Jam Ad

Follow Solutions Review