Writer Service

The Writer is the ingestion component of the SMACKZ Lakehouse. It consumes events from Redis Streams, buffers them in memory, and writes Parquet files to Cloudflare R2.

How It Works

Consume -- Reads events from Redis Streams using XREADGROUP (consumer group pattern)
Buffer -- Accumulates events in memory until a flush threshold is met (time-based or count-based)
Convert -- Uses PyArrow to convert buffered events into typed Parquet tables
Write -- Uploads Parquet files to Cloudflare R2 (or MinIO locally)

Architecture

Redis Streams
     |
     | XREADGROUP
     v
consumer.py        # Reads events, dispatches to buffer
     |
     v
buffer.py          # Time/count flush thresholds
     |
     v
parquet_writer.py   # PyArrow -> Parquet -> R2
     |
     v
schemas.py         # PyArrow table schemas for each entity

Configuration

The Writer uses settings from config.py:

R2/S3 credentials -- Access key, secret key, endpoint, bucket
Redis connection -- URL for Redis Streams
Buffer thresholds -- Time interval and record count for flush triggers
Firebase -- For auth token verification (if needed)

Deployment

Runs as a single replica (smackz-lakehouse-writer) on Control Plane. Only one replica is needed because:

Redis Streams consumer groups ensure each event is processed exactly once
A single writer avoids write conflicts on Parquet files
The workload is I/O-bound (reading streams, writing files), not CPU-bound

Backfill

For initial data loading or re-ingestion, the backfill/postgres_to_parquet.py script reads directly from the Yum PostgreSQL database and writes Parquet files to R2, bypassing the Redis Streams pipeline.

Key Files

smackz-lakehouse/writer/main.py -- Entry point
smackz-lakehouse/writer/consumer.py -- Redis Streams reader
smackz-lakehouse/writer/buffer.py -- Flush logic
smackz-lakehouse/writer/parquet_writer.py -- Parquet writing
smackz-lakehouse/writer/schemas.py -- PyArrow table schemas
smackz-lakehouse/backfill/postgres_to_parquet.py -- One-time backfill