Concepts

Core concepts

Six ideas make everything else in DXData easier to reason about. Read them in order the first time; come back to any single one when you need a refresher.

Lakehouse architecture

A lakehouse is exactly what it sounds like: the flexibility of a data lake (object storage, open formats, cheap per-terabyte) with the semantics of a warehouse (ACID transactions, SQL, governance). DXData assembles the open-source best-in-class pieces — Apache Iceberg for tables, Project Nessie for the catalog, and Trino for the engine — and operates them so you don't have to.

The practical upshot: one query surface covers your warehouse, your data lake, and any operational databases you choose to federate. No more “where does this data live again?” — it's all addressable as catalog.schema.table.

Object storage under one query engine

S3 / GCS / Azureparquet + orc

Iceberg metadataschemas + snapshots

Trino engineSQL + optimizer

iceberg tables
federation
observability

Iceberg tables

Iceberg is a columnar table format built for large analytic datasets. Tables are immutable files plus a small tree of metadata that describes the current schema, partitions, and snapshot history. Writes produce new files and a new snapshot; they never overwrite past data.

That snapshot chain is what makes features like time travel, schema evolution, and branches possible. Query any historical state with AS OF, add a column without rewriting the table, or roll back a bad load by pointing HEAD at an earlier snapshot.

Every write produces a new snapshot

snapshot-422026-04-19 09:12

snapshot-432026-04-19 10:04

snapshot-44main · HEAD

time travel
schema evolution
branches

Nessie branches — git for data

Nessie brings Git semantics to your data catalog. Every table, schema, and configuration lives at a named reference — main by default — and you can create as many branches or tags as you need. Branches are zero-copy: they point at the same underlying files until you write to them.

This unlocks CI-style workflows (open a “data PR” on a branch, review the diff, merge), safe experimentation on production data, and instant rollback when a pipeline misbehaves.

Branches isolate writes without copies

mainproduction

exp/region-cohortbranched off main · 3 commits

incident/rollback-2026-04-18branched off snapshot-41

Federated queries

Federation means the engine queries external systems in place instead of copying data into the lakehouse. DXData ships connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Kafka, MongoDB, Elastic, and more. Register a source once and every user in your workspace can query it as a first-class catalog.

The planner pushes projections, predicates, and sometimes partial aggregates down to the source so you pay for the cheapest execution that's still correct. What comes back over the wire is the filtered, narrowed result.

Push work to the source that holds the data

SELECT ... JOINDXData planner

Postgres (CRM)predicate pushdown

Iceberg (events)parallel scan

supported connectors
sql: cross-catalog join
performance tips

Governance & lineage

Every table, column, and branch has an owner, a set of policies, and a full lineage graph that's built automatically from query history. Fine-grained access control is expressed as policy rules (row filters, column masks, tag-based grants) that apply at compile time, not post-hoc.

Lineage lets you answer “what downstream dashboards break if I drop this column?” or “where did this number come from?” without manual effort. Audit logs are append-only, immutable, and exportable to your SIEM.

Policies and lineage attach to every object

RBAC policyrole: analyst

lake.orderscolumn: customer_email → masked

Audit logwho · what · when

rbac
audit logs
compliance

Observability

Every query emits structured telemetry: plan, per-stage stats, per-split wall time, and memory. That data backs the query profile, slow-query log, and the alerts you'll actually read on-call. We don't bury operators in raw metrics — we surface the bottleneck.

Data quality is a first-class citizen too. Define expectations (NOT NULL, uniqueness, freshness, custom SQL) and DXData runs them on every load, blocks promotion on failure, and publishes results to lineage.

Metrics, traces, and quality checks on one pane

Query traceplanner · stages · splits

Resource metricscpu · memory · io

Quality checksexpect_not_null · unique

query profile
slo alerts
data quality

Lakehouse architecture

Object storage under one query engine

S3 / GCS / Azureparquet + orc

Iceberg metadataschemas + snapshots

Trino engineSQL + optimizer

Iceberg tables

Every write produces a new snapshot

snapshot-422026-04-19 09:12

snapshot-432026-04-19 10:04

snapshot-44main · HEAD

Nessie branches — git for data

This unlocks CI-style workflows (open a “data PR” on a branch, review the diff, merge), safe experimentation on production data, and instant rollback when a pipeline misbehaves.

Branches isolate writes without copies

mainproduction

exp/region-cohortbranched off main · 3 commits

incident/rollback-2026-04-18branched off snapshot-41

Federated queries

Push work to the source that holds the data

SELECT ... JOINDXData planner

Postgres (CRM)predicate pushdown

Iceberg (events)parallel scan

Governance & lineage

Policies and lineage attach to every object

RBAC policyrole: analyst

lake.orderscolumn: customer_email → masked

Audit logwho · what · when

Observability

Metrics, traces, and quality checks on one pane

Query traceplanner · stages · splits

Resource metricscpu · memory · io

Quality checksexpect_not_null · unique