Core concepts
Six ideas make everything else in DXData easier to reason about. Read them in order the first time; come back to any single one when you need a refresher.
Lakehouse architecture
A lakehouse is exactly what it sounds like: the flexibility of a data lake (object storage, open formats, cheap per-terabyte) with the semantics of a warehouse (ACID transactions, SQL, governance). DXData assembles the open-source best-in-class pieces — Apache Iceberg for tables, Project Nessie for the catalog, and Trino for the engine — and operates them so you don't have to.
The practical upshot: one query surface covers your warehouse, your data lake, and any operational databases you choose to federate. No more “where does this data live again?” — it's all addressable as catalog.schema.table.
Iceberg tables
Iceberg is a columnar table format built for large analytic datasets. Tables are immutable files plus a small tree of metadata that describes the current schema, partitions, and snapshot history. Writes produce new files and a new snapshot; they never overwrite past data.
That snapshot chain is what makes features like time travel, schema evolution, and branches possible. Query any historical state with AS OF, add a column without rewriting the table, or roll back a bad load by pointing HEAD at an earlier snapshot.
Nessie branches — git for data
Nessie brings Git semantics to your data catalog. Every table, schema, and configuration lives at a named reference — main by default — and you can create as many branches or tags as you need. Branches are zero-copy: they point at the same underlying files until you write to them.
This unlocks CI-style workflows (open a “data PR” on a branch, review the diff, merge), safe experimentation on production data, and instant rollback when a pipeline misbehaves.
Federated queries
Federation means the engine queries external systems in place instead of copying data into the lakehouse. DXData ships connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Kafka, MongoDB, Elastic, and more. Register a source once and every user in your workspace can query it as a first-class catalog.
The planner pushes projections, predicates, and sometimes partial aggregates down to the source so you pay for the cheapest execution that's still correct. What comes back over the wire is the filtered, narrowed result.
Governance & lineage
Every table, column, and branch has an owner, a set of policies, and a full lineage graph that's built automatically from query history. Fine-grained access control is expressed as policy rules (row filters, column masks, tag-based grants) that apply at compile time, not post-hoc.
Lineage lets you answer “what downstream dashboards break if I drop this column?” or “where did this number come from?” without manual effort. Audit logs are append-only, immutable, and exportable to your SIEM.
Observability
Every query emits structured telemetry: plan, per-stage stats, per-split wall time, and memory. That data backs the query profile, slow-query log, and the alerts you'll actually read on-call. We don't bury operators in raw metrics — we surface the bottleneck.
Data quality is a first-class citizen too. Define expectations (NOT NULL, uniqueness, freshness, custom SQL) and DXData runs them on every load, blocks promotion on failure, and publishes results to lineage.