Git-like branching
Branch your data. Ship with confidence.
Write-audit-publish workflows powered by Nessie. Every table, every schema, every dataset — versioned, reversible, and zero-copy.
// the old way
Shipping a schema change used to mean a maintenance window. Not anymore.
The traditional playbook for changing a production table reads like a deploy from 2012: stop the pipeline, copy the table, run the migration, backfill, swap names, pray. Teams burn weekends on it — and that is before anyone asks what happens if last-minute bug in the transform corrupts a million rows.
DXData inherits the answer from software engineering. Branch the data, do the risky work in isolation, run your tests, open a PR, merge on green. When something goes wrong — and it will — roll back to a tag in seconds instead of restoring from last night's backup. Your pipelines never stop. Your dashboards never blink.
Why data branching
Branch any table
Fork a single table, a schema, or your whole catalog. Every branch has its own working history.
Zero-copy
Branches share underlying Iceberg snapshots until you write. No storage tax for keeping six branches around.
Reversible
Every commit is a pointer. Roll back a bad migration in seconds — no restore job required.
# fork main into an isolated working branch$ dxdata branch create feature/new-metric --from main created feature/new-metric at snapshot 8a4f1e9 # list branches$ dxdata branch list * main 8a4f1e9 2h ago feature/new-metric 8a4f1e9 just now experiment/churn 6d2ac71 3d ago// branch.create
Create a branch in seconds.
Nessie gives DXData the same primitives Git gave code: named refs, tags, immutable commits, atomic merges. A branch is a pointer — not a copy, not an ETL job, not a new schema you have to garbage-collect next quarter.
Your analysts branch the catalog the way they branch a repo. Your platform team enforces policy on the merge, the same way a GitHub Actions workflow gates a release.
- Fork a single table or the whole catalog
- Works with every source DXData ingests
- Audit log on every branch, tag, and merge
// zero-copy.isolation
Isolated branches that cost nothing.
Under the hood, each branch is a Nessie reference that points at the same Iceberg snapshot as its parent. Until you actually write, both refs resolve to the exact same parquet files on object storage — 0 bytes of new data, 0 scheduler pressure, 0 copy jobs.
The moment you write, only the changed snapshot metadata and any rewritten files live on the branch. That is what makes it safe to leave a dozen experimental branches running overnight without paging your FinOps team.
- Iceberg snapshots are immutable — branches are just new pointers
- Copy-on-write at the file level, not the table level
- Ephemeral branches cost effectively zero
- unique_key on orders.id1.2s
- not_null on orders.total0.4s
- row count within +/- 2%0.8s
- schema contract: v4.20.1s
// wap.one-command
Write. Audit. Publish.
Write-Audit-Publish is the industry-standard pattern for safely landing data changes — write to a side table, audit it, then swap it in. Great idea, historically a mountain of YAML to implement.
DXData collapses it into one command. The branch is your side table. The quality checks are your audit. Fast-forwarding main is your publish. Green means ship, red means the branch stays put and nothing in production moves.
- dbt tests, great-expectations checks, and native quality gates — all first-class
- Policy hooks can block a merge on compliance, PII, or lineage rules
- Every merge writes a signed commit to the audit log
- Row count1,284,992 rows (+0.4%)
- Null rate on customer_id0.00%
- Downstream dashboards2 use legacy_flag
// review.like-code
Review data changes the way you review code.
When someone opens a branch for review, DXData renders a diff your team actually understands: columns added, types changed, rows added or deleted, with a side-by-side of quality metrics before and after.
Approvers comment on specific rows or columns, block merges, or request changes — all in the same place. No Slack threads, no screenshots, no guessing which version of the dashboard is current.
- Column-level diff with rename detection
- Quality checks and lineage impact surfaced inline
- Approve, request changes, or require a second reviewer — per-branch policy
# tag the last known-good state$ dxdata tag prod-2026-04-19-090000 --branch main tag created # bad migration shipped — roll back$ dxdata reset --to prod-2026-04-19-090000 rollback complete · 820ms · 0 bytes moved- prod-2026-04-19-090000livetoday · 09:00 · last known good
- prod-2026-04-18-090000yesterday · 09:00 · pre-migration
- prod-2026-04-17-0900002 days ago · baseline
- prod-2026-04-16-0900003 days ago · baseline
// rollback.tagged
Rollbacks are a first-class operation.
Tags pin a commit so you can always get back to a known-good state. Roll a nightly tag like prod-2026-04-19-090000 before every production deploy, and bad migrations become a one-command undo.
Because rollbacks only rewrite the catalog reference — not the underlying files — they are instant, atomic, and safe to run from the on-call laptop at 2am.
- Automated nightly tags or tag on every merge
- Rollbacks complete in under a second at any table size
- Historical tags remain queryable forever — great for audit
// time-travel
Query any point in history.
Every table in DXData is time-travelable out of the box. Query a past commit for debugging, reconcile yesterday's report against today, or reproduce a bug from last Tuesday without restoring a backup.
The syntax is standard SQL — the DXData query engine resolves the historical snapshot and plans the query against it, so your existing tooling just works.
- FOR VERSION AS OF <commit> and FOR TIMESTAMP AS OF <ts>
- Works across branches and tags — including historical rollbacks
- Zero performance penalty for recent snapshots
# .dxdata/ci.yaml — ephemeral branch per pipeline runpipeline: orders_dailyon: schedule(hourly) branch: strategy: ephemeral name: "ci/orders_daily-${RUN_ID}" from: main audit: - run: "dbt test --select tag:critical" - run: "dxdata quality check orders" publish: on_success: merge_to_main on_failure: discard_branch// ci.ephemeral
Ephemeral branches for every pipeline run.
Point your CI at a branch, not at main. Each pipeline run forks a fresh branch, writes to it, and runs tests — all in full isolation from what your dashboards are reading.
If every check goes green, DXData merges the branch forward. If anything fails, the branch is discarded and your production catalog never sees the bad write. It is the same testing pattern every engineering team uses for code, applied to data.
- Per-run branches keep concurrent pipelines from stepping on each other
- Failures roll back for free — the branch just disappears
- Works with GitHub Actions, GitLab CI, or any shell-driven runner
// branching visualized
Your catalog, four weeks of history, one picture.
Four active branches. Fifteen commits. Two tagged releases. No storage bloat, no blocked merges, no sweat.
// in practice
Three shapes of problem, one pattern.
Safe schema migrations
Run a full backfill on a branch, prove the shape with production queries, then fast-forward main. No maintenance window. No frozen dashboards.
A/B test dataset variants
Spin up `variant/churn-model-v2` off main. Route half the inference traffic to it. Keep the winner, delete the loser — all branches, no copies.
Reproduce a customer report
A support ticket cites numbers from last Tuesday. Create a branch from that tag, run the exact report, diff against today. Case closed in five minutes.
// faq
Questions, answered.
Not until you modify data. A fresh branch is a 28-byte Nessie reference that points at the same Iceberg snapshot as main. Only the diff — new snapshot metadata and any rewritten parquet files — lives uniquely on the branch.
Merges work at the snapshot level, so conflicts are rare and easy to reason about. When both branches touch the same table since the fork point, the merge is blocked and you get a side-by-side diff. Resolve by picking one side, rebasing the branch, or running a custom merge query — same mental model as a code conflict.
DXData exposes each branch as a distinct catalog, so dbt can target `catalog_feature_new_metric` the same way it targets `catalog_main`. Our dbt adapter does the wiring automatically from a branch name; your `dbt build` output lands on the branch, your tests run against it, and CI merges on green.
Yes. Everything DXData does on top of Nessie is standard Iceberg REST and the open-source Nessie protocol. Point DXData at your self-hosted Nessie endpoint — branching, tagging, and time-travel keep working, and your data stays in the storage buckets you already manage.
// related capabilities
Branching gets better with the rest.
Treat your data like your code
Branches, reviews, rollbacks. On your data.
Get started in under five minutes. No credit card, no data migration — point DXData at your lakehouse and start branching.