Writer Service
The Writer is the ingestion component of the SMACKZ Lakehouse. It consumes events from Redis Streams, buffers them in memory, and writes Parquet files to Cloudflare R2.
How It Works
- Consume -- Reads events from Redis Streams using
XREADGROUP(consumer group pattern) - Buffer -- Accumulates events in memory until a flush threshold is met (time-based or count-based)
- Convert -- Uses PyArrow to convert buffered events into typed Parquet tables
- Write -- Uploads Parquet files to Cloudflare R2 (or MinIO locally)
Architecture
Redis Streams
|
| XREADGROUP
v
consumer.py # Reads events, dispatches to buffer
|
v
buffer.py # Time/count flush thresholds
|
v
parquet_writer.py # PyArrow -> Parquet -> R2
|
v
schemas.py # PyArrow table schemas for each entity
Configuration
The Writer uses settings from config.py:
- R2/S3 credentials -- Access key, secret key, endpoint, bucket
- Redis connection -- URL for Redis Streams
- Buffer thresholds -- Time interval and record count for flush triggers
- Firebase -- For auth token verification (if needed)
Deployment
Runs as a single replica (smackz-lakehouse-writer) on Control Plane. Only one replica is needed because:
- Redis Streams consumer groups ensure each event is processed exactly once
- A single writer avoids write conflicts on Parquet files
- The workload is I/O-bound (reading streams, writing files), not CPU-bound
Backfill
For initial data loading or re-ingestion, the backfill/postgres_to_parquet.py script reads directly from the Yum PostgreSQL database and writes Parquet files to R2, bypassing the Redis Streams pipeline.
Key Files
smackz-lakehouse/writer/main.py-- Entry pointsmackz-lakehouse/writer/consumer.py-- Redis Streams readersmackz-lakehouse/writer/buffer.py-- Flush logicsmackz-lakehouse/writer/parquet_writer.py-- Parquet writingsmackz-lakehouse/writer/schemas.py-- PyArrow table schemassmackz-lakehouse/backfill/postgres_to_parquet.py-- One-time backfill