When a customer crossed three billion event rows per day last summer, our fleet-wide p95 query latency nearly doubled. The table was not that large — about 180 TB compressed — but metadata was the thing collapsing under its own weight. Manifest list scans had become the single largest wall-clock cost for every read, and our compaction pipeline was slipping by 40 minutes per day.
This is the write-up we wish we had at the start: what we actually changed in the Iceberg tables, what we changed in the dxdata-catalog service that manages them, and what we now consider table stakes for any workload north of one billion rows per day.
The problem: O(partitions) planning
Our largest tenant had a table partitioned by day(event_ts) and bucket(16, tenant_id). That looked fine on paper — sixteen buckets per day, roughly 5,800 active partitions after we rolled off the 365-day hot window. In practice, manifest lists were growing unbounded because every streaming micro-batch was creating a new data file and a new manifest entry.
By the time planning hit a 90-day query, the coordinator was opening 4,800 manifests just to prune partitions. The scan itself was fast; the plan was taking 3.1 seconds.
What we changed
Three changes, in the order we shipped them:
- Manifest rewrites every 30 minutes, not 24 hours. We had inherited the default schedule. For high-ingest tables we now rewrite manifests whenever the count of manifests in a partition exceeds 12, which in practice fires every ~27 minutes.
- Partition evolution from `day` → `hour` for the hot window. Iceberg partition spec evolution lets us keep the `day` spec for historical files and use `hour` for everything in the last seven days. Readers never see the transition.
- Split the streaming sink into two tables joined by a view. Late-arriving fact rows were ruining our file sizing. We now land them into a second table with a looser partition spec and expose them as a single UNION ALL view.
The compaction loop
Our compaction worker is a deceptively short function. The whole loop fits on one screen:
# runs every 30 minutes per table
def maybe_rewrite_manifests(table: IcebergTable) -> Outcome:
stats = table.current_snapshot().manifest_stats()
if stats.count_per_partition_p95 < 12:
return Outcome.SKIPPED
plan = RewriteManifestsPlan(
table=table,
target_manifest_size_bytes=8 * 1024 * 1024,
cluster_by=["partition_key"],
)
with table.transactions.begin("rewrite-manifests") as tx:
tx.apply(plan)
tx.commit(producer="dxdata-compactor/0.18.4")
return Outcome.REWROTE(plan.new_manifest_count)The important part is the transaction boundary. We used to compact in place, which produced a surprising amount of read-side churn when a long-running query snapshot-isolated mid-rewrite. Wrapping the rewrite in a single commit made the before/after snapshots fully consistent.
The numbers, six months later
Planning time for the 90-day query fell from 3.1 s to 180 ms. Our compaction worker now averages 14 minutes of work per hour instead of 50. Total storage went up about 4% because we keep more small rewritten manifests around during the hot window, which we consider a fine trade.
The Iceberg spec gives you many levers. Most of the production wins come from running fewer levers more often, not more levers with more tuning.
What comes next
We are rolling this configuration out to every table above 500 M rows/day as an opinionated default. The next thing to land is adaptive partition spec selection — our catalog service will soon pick the initial spec from the ingest profile instead of asking the user.
Written by
Maya Okafor
Staff Engineer, Query at DXData.