Skip to content

Writer Service

The Writer is the ingestion component of the SMACKZ Lakehouse. It consumes events from Redis Streams, buffers them in memory, and writes Parquet files to Cloudflare R2.

How It Works

  1. Consume -- Reads events from Redis Streams using XREADGROUP (consumer group pattern)
  2. Buffer -- Accumulates events in memory until a flush threshold is met (time-based or count-based)
  3. Convert -- Uses PyArrow to convert buffered events into typed Parquet tables
  4. Write -- Uploads Parquet files to Cloudflare R2 (or MinIO locally)

Architecture

Redis Streams
     |
     | XREADGROUP
     v
consumer.py        # Reads events, dispatches to buffer
     |
     v
buffer.py          # Time/count flush thresholds
     |
     v
parquet_writer.py   # PyArrow -> Parquet -> R2
     |
     v
schemas.py         # PyArrow table schemas for each entity

Configuration

The Writer uses settings from config.py:

  • R2/S3 credentials -- Access key, secret key, endpoint, bucket
  • Redis connection -- URL for Redis Streams
  • Buffer thresholds -- Time interval and record count for flush triggers
  • Firebase -- For auth token verification (if needed)

Deployment

Runs as a single replica (smackz-lakehouse-writer) on Control Plane. Only one replica is needed because:

  • Redis Streams consumer groups ensure each event is processed exactly once
  • A single writer avoids write conflicts on Parquet files
  • The workload is I/O-bound (reading streams, writing files), not CPU-bound

Backfill

For initial data loading or re-ingestion, the backfill/postgres_to_parquet.py script reads directly from the Yum PostgreSQL database and writes Parquet files to R2, bypassing the Redis Streams pipeline.

Key Files

  • smackz-lakehouse/writer/main.py -- Entry point
  • smackz-lakehouse/writer/consumer.py -- Redis Streams reader
  • smackz-lakehouse/writer/buffer.py -- Flush logic
  • smackz-lakehouse/writer/parquet_writer.py -- Parquet writing
  • smackz-lakehouse/writer/schemas.py -- PyArrow table schemas
  • smackz-lakehouse/backfill/postgres_to_parquet.py -- One-time backfill