Data Catalog
The Data Catalog is a metadata layer (Phase 6 of the Lakehouse roadmap) that documents every table, column, relationship, and freshness timestamp in the lakehouse. It answers: what does this column mean, who owns this table, how fresh is the data, and how do tables relate.
Architecture
The catalog lives in a _metadata DuckDB schema, created as part of the DuckDB initialization SQL. It is populated from YAML definitions in shared-core/catalog/entities/ via the seed script.
_metadata schema
|
+-- catalog_tables # Entity names, descriptions, owners
+-- catalog_columns # Field names, types, PII flags
+-- catalog_relationships # Foreign key joins between entities
Catalog Tables
_metadata.catalog_tables
One row per lakehouse table:
| Column | Description |
|---|---|
table_name |
Table identifier (e.g., orders, users) |
description |
Human-readable description |
owner |
Team or person responsible |
classification |
Data classification level |
freshness_sla_minutes |
Expected maximum data staleness |
_metadata.catalog_columns
One row per column per table:
| Column | Description |
|---|---|
table_name |
Parent table |
column_name |
Field name |
data_type |
DuckDB data type |
semantic_type |
Business meaning (e.g., "currency", "identifier") |
is_pii |
Whether the column contains personally identifiable information |
is_nullable |
Whether null values are expected |
description |
Human-readable field description |
_metadata.catalog_relationships
Foreign key relationships:
| Column | Description |
|---|---|
source_table |
Table with the foreign key |
source_column |
FK column name |
target_table |
Referenced table |
target_column |
Referenced column |
relationship_type |
e.g., "many-to-one" |
Quality Fields
The catalog defines data quality expectations declaratively. The lakehouse pipeline auto-generates quality checks from these fields -- no separate configuration is needed.
null_expectation (field-level)
Controls null-rate quality checks for nullable fields:
| Value | Meaning | Threshold |
|---|---|---|
never |
Column should never contain nulls | 0% null rate -- alert on any null |
rare |
Nulls happen but are uncommon | Up to 5% null rate -- alert above |
common |
Nulls are expected and normal | No null-rate check generated |
Fields with is_nullable: false do not need null_expectation -- they are enforced at the database level.
freshness (entity-level)
Defines how stale lakehouse data is allowed to become before alerting:
freshness:
sla_minutes: 5
severity: critical
sla_minutes: Maximum age in minutes of the most recent row (based on the field markedfreshness_indicator: true).severity: Alert severity when the SLA is breached (critical,warning, orinfo).
table_type (in sources.lakehouse)
Classifies whether a lakehouse table is an append-only event stream or a full snapshot:
| Value | Meaning | Quality behavior |
|---|---|---|
event |
Append-only, new rows arrive continuously | Row count drop checks are generated |
snapshot |
Full table replacement on each sync | No row count drop check (count can legitimately shrink) |
constraints (field-level)
Existing constraint definitions (min, max, enum, pattern) auto-generate value range checks at the quality layer.
YAML Source Definitions
Catalog data is defined in YAML files under shared-core/catalog/entities/. Each file describes one lakehouse entity with its columns, types, semantics, and relationships. The seed script reads these files and populates the _metadata tables.
Adding or Updating Entities
- Create or edit
catalog/entities/<entity_name>.yamlwith required keys:entity,description,owner,classification,sources,lineage,fields. - Add
freshnessif the entity has a lakehouse source. - Add
table_typeinsources.lakehouseif applicable. - Set
null_expectationon every field whereis_nullable: true. - Run
python catalog/validate.py-- must show "OK -- N entities validated".
Metabase Dashboard
The "Data Catalog" Metabase dashboard provides a browsable view of:
- Entity registry with descriptions and owners
- Column glossary with types and PII flags
- Relationship diagram
- Data freshness status against SLA thresholds
Key Files
shared-core/catalog/entities/-- YAML entity definitionssmackz-lakehouse/metabase/seed/catalog.py-- Catalog seed logicsmackz-lakehouse/docs/Lakehouse-Data-Catalog-FRD.md-- Full FRD