Skip to content

Data Quality

The Data Quality layer (Phase 8 of the Lakehouse roadmap, now complete) provides automated, catalog-driven quality checks that validate lakehouse data against expectations defined in the Data Catalog. Checks are auto-generated from entity YAML definitions -- there is no separate quality configuration to maintain.

How It Works

  1. Generate checks -- quality.py reads catalog entity YAML definitions and generates QualityCheck instances based on field properties (nullability, primary keys, constraints, freshness SLAs, table types)
  2. Execute checks -- quality_runner.py runs each check against DuckDB and records structured results
  3. Store results -- Results are written as Parquet files to _quality_results/ on R2 (auto-pruned after 30 days)
  4. Surface results -- The _quality.quality_results DuckDB view powers the Data Quality Metabase dashboard

Check Types

Type What It Validates
row_count Minimum rows and drop percentage between runs
null_rate Percentage of null values in non-nullable columns
freshness Time since last record vs. SLA threshold
value_range Values within expected bounds
uniqueness No duplicate values in unique/PK columns
schema_drift Actual schema matches expected schema

Running Quality Checks

# Run all checks
python scripts/run_quality_checks.py

# Filter by table
python scripts/run_quality_checks.py --table orders

# Filter by severity
python scripts/run_quality_checks.py --severity critical

# Preview without executing
python scripts/run_quality_checks.py --dry-run

Exit Codes

Code Meaning
0 All checks passed
1 Warning-level failures only
2 Critical failures detected

Severity Levels

Checks are classified by severity:

  • Critical -- Data correctness issues that affect business metrics (e.g., missing order IDs, null prices)
  • Warning -- Degraded quality that should be investigated (e.g., higher-than-expected null rates)
  • Info -- Informational observations (e.g., schema changes, unusual patterns)

Results Storage

Results are stored as Parquet files on R2 at _quality_results/{date}/run_{timestamp}.parquet. Files older than 30 days are automatically pruned.

Metabase Dashboard

The "Data Quality" dashboard shows:

  • Overall pass rate across all checks
  • Failure counts by severity (critical, warning, info)
  • Per-table drilldown with individual check results
  • Trend analysis over time

Key Files

  • smackz-lakehouse/metabase/seed/quality.py -- Check generation from catalog
  • smackz-lakehouse/metabase/seed/quality_runner.py -- Check execution engine
  • smackz-lakehouse/scripts/run_quality_checks.py -- CLI entry point
  • smackz-lakehouse/docs/Lakehouse-Data-Quality-FRD.md -- Full FRD