How we run 4.2B rows/day through a single Iceberg catalog

When a customer crossed three billion event rows per day last summer, our fleet-wide p95 query latency nearly doubled. The table was not that large — about 180 TB compressed — but metadata was the thing collapsing under its own weight. Manifest list scans had become the single largest wall-clock cost for every read, and our compaction pipeline was slipping by 40 minutes per day.

This is the write-up we wish we had at the start: what we actually changed in the Iceberg tables, what we changed in the dxdata-catalog service that manages them, and what we now consider table stakes for any workload north of one billion rows per day.

The problem: O(partitions) planning

Our largest tenant had a table partitioned by day(event_ts) and bucket(16, tenant_id). That looked fine on paper — sixteen buckets per day, roughly 5,800 active partitions after we rolled off the 365-day hot window. In practice, manifest lists were growing unbounded because every streaming micro-batch was creating a new data file and a new manifest entry.

By the time planning hit a 90-day query, the coordinator was opening 4,800 manifests just to prune partitions. The scan itself was fast; the plan was taking 3.1 seconds.

What we changed

Three changes, in the order we shipped them:

Manifest rewrites every 30 minutes, not 24 hours. We had inherited the default schedule. For high-ingest tables we now rewrite manifests whenever the count of manifests in a partition exceeds 12, which in practice fires every ~27 minutes.
Partition evolution from `day` → `hour` for the hot window. Iceberg partition spec evolution lets us keep the `day` spec for historical files and use `hour` for everything in the last seven days. Readers never see the transition.
Split the streaming sink into two tables joined by a view. Late-arriving fact rows were ruining our file sizing. We now land them into a second table with a looser partition spec and expose them as a single UNION ALL view.

The compaction loop

Our compaction worker is a deceptively short function. The whole loop fits on one screen:

compaction_worker.pypython

# runs every 30 minutes per table
def maybe_rewrite_manifests(table: IcebergTable) -> Outcome:
    stats = table.current_snapshot().manifest_stats()
    if stats.count_per_partition_p95 < 12:
        return Outcome.SKIPPED

    plan = RewriteManifestsPlan(
        table=table,
        target_manifest_size_bytes=8 * 1024 * 1024,
        cluster_by=["partition_key"],
    )
    with table.transactions.begin("rewrite-manifests") as tx:
        tx.apply(plan)
        tx.commit(producer="dxdata-compactor/0.18.4")
    return Outcome.REWROTE(plan.new_manifest_count)

The important part is the transaction boundary. We used to compact in place, which produced a surprising amount of read-side churn when a long-running query snapshot-isolated mid-rewrite. Wrapping the rewrite in a single commit made the before/after snapshots fully consistent.

The numbers, six months later

Planning time for the 90-day query fell from 3.1 s to 180 ms. Our compaction worker now averages 14 minutes of work per hour instead of 50. Total storage went up about 4% because we keep more small rewritten manifests around during the hot window, which we consider a fine trade.

The Iceberg spec gives you many levers. Most of the production wins come from running fewer levers more often, not more levers with more tuning.
— From our internal postmortem doc

What comes next

We are rolling this configuration out to every table above 500 M rows/day as an opinionated default. The next thing to land is adaptive partition spec selection — our catalog service will soon pick the initial spec from the ingest profile instead of asking the user.

Written by

Maya Okafor

Staff Engineer, Query at DXData.

The problem: O(partitions) planning

By the time planning hit a 90-day query, the coordinator was opening 4,800 manifests just to prune partitions. The scan itself was fast; the plan was taking 3.1 seconds.

What we changed

Three changes, in the order we shipped them:

Manifest rewrites every 30 minutes, not 24 hours. We had inherited the default schedule. For high-ingest tables we now rewrite manifests whenever the count of manifests in a partition exceeds 12, which in practice fires every ~27 minutes.
Partition evolution from `day` → `hour` for the hot window. Iceberg partition spec evolution lets us keep the `day` spec for historical files and use `hour` for everything in the last seven days. Readers never see the transition.
Split the streaming sink into two tables joined by a view. Late-arriving fact rows were ruining our file sizing. We now land them into a second table with a looser partition spec and expose them as a single UNION ALL view.

The compaction loop

Our compaction worker is a deceptively short function. The whole loop fits on one screen:

compaction_worker.pypython

# runs every 30 minutes per table
def maybe_rewrite_manifests(table: IcebergTable) -> Outcome:
    stats = table.current_snapshot().manifest_stats()
    if stats.count_per_partition_p95 < 12:
        return Outcome.SKIPPED

    plan = RewriteManifestsPlan(
        table=table,
        target_manifest_size_bytes=8 * 1024 * 1024,
        cluster_by=["partition_key"],
    )
    with table.transactions.begin("rewrite-manifests") as tx:
        tx.apply(plan)
        tx.commit(producer="dxdata-compactor/0.18.4")
    return Outcome.REWROTE(plan.new_manifest_count)

The numbers, six months later

The Iceberg spec gives you many levers. Most of the production wins come from running fewer levers more often, not more levers with more tuning.
— From our internal postmortem doc

What comes next

Written by

Maya Okafor

Staff Engineer, Query at DXData.

How we run 4.2B rows/day through a single Iceberg catalog

The problem: O(partitions) planning

What we changed

The compaction loop

The numbers, six months later

What comes next

Read next

Branching production tables: a 6-month postmortem

Query federation: what finally made it fast

Why we built Nessie-style branching into DXData

Start building with DXData.

How we run 4.2B rows/day through a single Iceberg catalog

The problem: O(partitions) planning

What we changed

The compaction loop

The numbers, six months later

What comes next

Read next

Branching production tables: a 6-month postmortem

Query federation: what finally made it fast

Why we built Nessie-style branching into DXData

Start building with DXData.