Data Quality

The Data Quality layer (Phase 8 of the Lakehouse roadmap, now complete) provides automated, catalog-driven quality checks that validate lakehouse data against expectations defined in the Data Catalog. Checks are auto-generated from entity YAML definitions -- there is no separate quality configuration to maintain.

How It Works

Generate checks -- quality.py reads catalog entity YAML definitions and generates QualityCheck instances based on field properties (nullability, primary keys, constraints, freshness SLAs, table types)
Execute checks -- quality_runner.py runs each check against DuckDB and records structured results
Store results -- Results are written as Parquet files to _quality_results/ on R2 (auto-pruned after 30 days)
Surface results -- The _quality.quality_results DuckDB view powers the Data Quality Metabase dashboard

Check Types

Type	What It Validates
`row_count`	Minimum rows and drop percentage between runs
`null_rate`	Percentage of null values in non-nullable columns
`freshness`	Time since last record vs. SLA threshold
`value_range`	Values within expected bounds
`uniqueness`	No duplicate values in unique/PK columns
`schema_drift`	Actual schema matches expected schema

Running Quality Checks

# Run all checks
python scripts/run_quality_checks.py

# Filter by table
python scripts/run_quality_checks.py --table orders

# Filter by severity
python scripts/run_quality_checks.py --severity critical

# Preview without executing
python scripts/run_quality_checks.py --dry-run

Exit Codes

Code	Meaning
0	All checks passed
1	Warning-level failures only
2	Critical failures detected

Severity Levels

Checks are classified by severity:

Critical -- Data correctness issues that affect business metrics (e.g., missing order IDs, null prices)
Warning -- Degraded quality that should be investigated (e.g., higher-than-expected null rates)
Info -- Informational observations (e.g., schema changes, unusual patterns)

Results Storage

Results are stored as Parquet files on R2 at _quality_results/{date}/run_{timestamp}.parquet. Files older than 30 days are automatically pruned.

Metabase Dashboard

The "Data Quality" dashboard shows:

Overall pass rate across all checks
Failure counts by severity (critical, warning, info)
Per-table drilldown with individual check results
Trend analysis over time

Key Files

smackz-lakehouse/metabase/seed/quality.py -- Check generation from catalog
smackz-lakehouse/metabase/seed/quality_runner.py -- Check execution engine
smackz-lakehouse/scripts/run_quality_checks.py -- CLI entry point
smackz-lakehouse/docs/Lakehouse-Data-Quality-FRD.md -- Full FRD