Skip to content

Data Catalog

The Data Catalog is a metadata layer (Phase 6 of the Lakehouse roadmap) that documents every table, column, relationship, and freshness timestamp in the lakehouse. It answers: what does this column mean, who owns this table, how fresh is the data, and how do tables relate.

Architecture

The catalog lives in a _metadata DuckDB schema, created as part of the DuckDB initialization SQL. It is populated from YAML definitions in shared-core/catalog/entities/ via the seed script.

_metadata schema
     |
     +-- catalog_tables         # Entity names, descriptions, owners
     +-- catalog_columns        # Field names, types, PII flags
     +-- catalog_relationships  # Foreign key joins between entities

Catalog Tables

_metadata.catalog_tables

One row per lakehouse table:

Column Description
table_name Table identifier (e.g., orders, users)
description Human-readable description
owner Team or person responsible
classification Data classification level
freshness_sla_minutes Expected maximum data staleness

_metadata.catalog_columns

One row per column per table:

Column Description
table_name Parent table
column_name Field name
data_type DuckDB data type
semantic_type Business meaning (e.g., "currency", "identifier")
is_pii Whether the column contains personally identifiable information
is_nullable Whether null values are expected
description Human-readable field description

_metadata.catalog_relationships

Foreign key relationships:

Column Description
source_table Table with the foreign key
source_column FK column name
target_table Referenced table
target_column Referenced column
relationship_type e.g., "many-to-one"

Quality Fields

The catalog defines data quality expectations declaratively. The lakehouse pipeline auto-generates quality checks from these fields -- no separate configuration is needed.

null_expectation (field-level)

Controls null-rate quality checks for nullable fields:

Value Meaning Threshold
never Column should never contain nulls 0% null rate -- alert on any null
rare Nulls happen but are uncommon Up to 5% null rate -- alert above
common Nulls are expected and normal No null-rate check generated

Fields with is_nullable: false do not need null_expectation -- they are enforced at the database level.

freshness (entity-level)

Defines how stale lakehouse data is allowed to become before alerting:

freshness:
  sla_minutes: 5
  severity: critical
  • sla_minutes: Maximum age in minutes of the most recent row (based on the field marked freshness_indicator: true).
  • severity: Alert severity when the SLA is breached (critical, warning, or info).

table_type (in sources.lakehouse)

Classifies whether a lakehouse table is an append-only event stream or a full snapshot:

Value Meaning Quality behavior
event Append-only, new rows arrive continuously Row count drop checks are generated
snapshot Full table replacement on each sync No row count drop check (count can legitimately shrink)

constraints (field-level)

Existing constraint definitions (min, max, enum, pattern) auto-generate value range checks at the quality layer.

YAML Source Definitions

Catalog data is defined in YAML files under shared-core/catalog/entities/. Each file describes one lakehouse entity with its columns, types, semantics, and relationships. The seed script reads these files and populates the _metadata tables.

Adding or Updating Entities

  1. Create or edit catalog/entities/<entity_name>.yaml with required keys: entity, description, owner, classification, sources, lineage, fields.
  2. Add freshness if the entity has a lakehouse source.
  3. Add table_type in sources.lakehouse if applicable.
  4. Set null_expectation on every field where is_nullable: true.
  5. Run python catalog/validate.py -- must show "OK -- N entities validated".

Metabase Dashboard

The "Data Catalog" Metabase dashboard provides a browsable view of:

  • Entity registry with descriptions and owners
  • Column glossary with types and PII flags
  • Relationship diagram
  • Data freshness status against SLA thresholds

Key Files

  • shared-core/catalog/entities/ -- YAML entity definitions
  • smackz-lakehouse/metabase/seed/catalog.py -- Catalog seed logic
  • smackz-lakehouse/docs/Lakehouse-Data-Catalog-FRD.md -- Full FRD