Data Quality
The Data Quality layer (Phase 8 of the Lakehouse roadmap, now complete) provides automated, catalog-driven quality checks that validate lakehouse data against expectations defined in the Data Catalog. Checks are auto-generated from entity YAML definitions -- there is no separate quality configuration to maintain.
How It Works
- Generate checks --
quality.pyreads catalog entity YAML definitions and generatesQualityCheckinstances based on field properties (nullability, primary keys, constraints, freshness SLAs, table types) - Execute checks --
quality_runner.pyruns each check against DuckDB and records structured results - Store results -- Results are written as Parquet files to
_quality_results/on R2 (auto-pruned after 30 days) - Surface results -- The
_quality.quality_resultsDuckDB view powers the Data Quality Metabase dashboard
Check Types
| Type | What It Validates |
|---|---|
row_count |
Minimum rows and drop percentage between runs |
null_rate |
Percentage of null values in non-nullable columns |
freshness |
Time since last record vs. SLA threshold |
value_range |
Values within expected bounds |
uniqueness |
No duplicate values in unique/PK columns |
schema_drift |
Actual schema matches expected schema |
Running Quality Checks
# Run all checks
python scripts/run_quality_checks.py
# Filter by table
python scripts/run_quality_checks.py --table orders
# Filter by severity
python scripts/run_quality_checks.py --severity critical
# Preview without executing
python scripts/run_quality_checks.py --dry-run
Exit Codes
| Code | Meaning |
|---|---|
| 0 | All checks passed |
| 1 | Warning-level failures only |
| 2 | Critical failures detected |
Severity Levels
Checks are classified by severity:
- Critical -- Data correctness issues that affect business metrics (e.g., missing order IDs, null prices)
- Warning -- Degraded quality that should be investigated (e.g., higher-than-expected null rates)
- Info -- Informational observations (e.g., schema changes, unusual patterns)
Results Storage
Results are stored as Parquet files on R2 at _quality_results/{date}/run_{timestamp}.parquet. Files older than 30 days are automatically pruned.
Metabase Dashboard
The "Data Quality" dashboard shows:
- Overall pass rate across all checks
- Failure counts by severity (critical, warning, info)
- Per-table drilldown with individual check results
- Trend analysis over time
Key Files
smackz-lakehouse/metabase/seed/quality.py-- Check generation from catalogsmackz-lakehouse/metabase/seed/quality_runner.py-- Check execution enginesmackz-lakehouse/scripts/run_quality_checks.py-- CLI entry pointsmackz-lakehouse/docs/Lakehouse-Data-Quality-FRD.md-- Full FRD