Auto-discovery
Every table, column, and constraint catalogued the moment it lands.
- Iceberg, Postgres, S3, Kafka — all indexed
- Schema inference with type hints
- No manual yaml, ever
Loading DXData
Data catalog
Auto-inferred schemas, column-level lineage, searchable tags, and git-style change history across every table in your stack.
iceberg · sales
// the problem
Every data team has the same ghost story: the dashboard that quietly uses a deprecated column, the pipeline built on a table nobody owns, the compliance review that turns into a month of Slack archaeology. The cost is not just wasted time — it is confidence in the numbers.
DXData's catalog closes the loop. Every table is documented automatically, every change is auditable, and every column can be traced end-to-end. Search replaces tribal knowledge.
// three promises
Every table, column, and constraint catalogued the moment it lands.
Trace any column back to the source and forward to every consumer.
Every schema change is a commit, every rollback a revert.
// catalog.auto
Point DXData at Iceberg, Postgres, S3, Kafka, or any of our 100+ connectors and the catalog indexes every table, view, and topic — including partitions, constraints, and inferred types.
Metadata refreshes on a schedule you control. New tables appear within minutes of creation, and dropped tables are preserved in history so nothing gets lost.
// lineage.column-level
DXData parses your SQL, dbt models, and pipeline YAML to build a true column-level lineage graph — no manual annotation, no tag-before-you-ship gate.
Click any column to see every upstream source that feeds it and every downstream dashboard, model, or pipeline that depends on it. Impact analysis becomes a first-class operation.
Canonical orders table — one row per order, refreshed every 15 minutes.
Daily-grain summary of orders used in the finance close report.
Feature-engineered view of orders with 42 derived columns.
// catalog.search
Filter by owner, tag, freshness, tier, or any combination. Chips compose as a typed query — the same query you can save, share, or bookmark as a view.
Trending tables and starred views help new hires find the canonical answer instead of guessing. Every search is indexed in sub-100ms.
README.md
One row per confirmed order. Refreshed every 15 minutes from the upstream CDC stream. Use finance.orders_daily for daily-grain reporting.
Conventions
// docs.as-code
Every table has an owner badge, tag set, and Markdown README rendered inline. Docs live in your git repo alongside the transforms they describe — so review flows you already use cover catalog changes too.
Tag taxonomies are optional but supported: bring your own ontology, enforce allowed values, or leave it free-form.
Freshness
Completeness
Uniqueness
Schema stability
// quality.scored
Freshness, completeness, uniqueness, and schema stability are scored continuously for every table. Thresholds are configurable per tier — a tier-1 dashboard gets paged, a scratch table does not.
Scores come with trend sparklines so you can see a regression the moment it starts, not the Monday morning it ships.
order_id uuid PKcustomer_email varchar(320) PII+ region varchar(24)~ total_cents bigint (was: order_total)- promo_code varchar(40)shipped_at timestamptz
// history.commits
When a column is added, renamed, or dropped, the catalog records it as a commit — with author, message, and diff. Rollbacks are a revert, not an incident post-mortem.
Pair this with branches and you get a full write-audit-publish loop for your data shape, not just your data values.
// catalog.branches
Powered by Nessie, a table's state on feature/new-metric can differ from its state on main — new columns, new tags, new docs — all previewable before anything lands in production.
production
+ margin_cents bigint (preview)
// ecosystem
DXData exports and imports OpenMetadata, DataHub, and Amundsen payloads natively. Keep your existing index or adopt DXData as the source of truth — your call.
// use cases
New hires can find the canonical table, its owner, and its README without pinging anyone on Slack.
Before you drop a column, see every downstream dashboard, model, and pipeline that depends on it.
Filter to every column tagged PII across every source — and prove who can see what.
// faq
Most stacks finish a full index pass in under an hour. The catalog streams entries as it goes, so your team can start searching within minutes of connecting the first source.
Yes. Tags are free-form by default, or you can upload a YAML taxonomy that enforces allowed values and ownership. Existing taxonomies from OpenMetadata, Amundsen, or Atlan import cleanly.
Mark columns, schemas, or whole sources as private and they are excluded from indexing entirely — not hidden with policy, never read. Row-level filters and masked previews are also available when you do want them indexed but gated.
Yes. Kafka, Kinesis, and Pulsar topics appear alongside batch tables with their own freshness and throughput signals, and lineage carries through streaming SQL the same way it does for batch.
Ship faster, own your data
Everything in DXData runs on open standards you can walk away with — Iceberg tables, Nessie history, standard SQL.