Skip to content

Async Inter-Service Communication

This page summarizes the architecture review documented in docs/ARCHITECTURE_ASYNC_MIGRATION.md. The review maps every inter-service call across the platform, identifies sync-vs-async anti-patterns, and proposes a phased migration to an async-first default.

Context -- The 504 Incident

A production 504 on QA login triggered this review. The root cause: yum-API's login flow synchronously awaited a Redis Streams adapter.create_customer command against the POS adapter. When the adapter's Postgres was hibernated, the command timed out after ~60s (retried silently inside the library), while the client's axios gave up at ~5s -- user saw a 504.

The code comment on the call site said "best effort, non-blocking" yet used await. The immediate fix converted that one path to fire-and-forget. The deeper question: which other inter-service interactions have the same anti-pattern?

Current State Findings

The audit covered yum-API, POS Adapters, Loyalty, and Lakehouse. Key findings:

yum-API has ~37 inter-service call sites. Most menu.*, tax.*, order.* events are already fire-and-forget. However, 6 command call sites use synchronous request-reply (adapter.connect_provider, adapter.handle_callback, adapter.complete_onboarding, adapter.fetch_provider_locations, adapter.get_card_detail, adapter.create_customer). Three loyalty commands are also sync (calculate_best_offer, preview_points, and the earn/usage events).

POS Adapters have gold-standard reliability patterns on the pos:webhooks stream (idempotency via SHA-256 dedup, DLQ with exponential backoff, XAUTOCLAIM recovery). However, the smackz:commands and smackz:events streams lack these protections entirely.

Loyalty has 12 command handlers with no idempotency. The order.payment_confirmed and order.cancelled event consumers are Phase 2 stubs -- they log but do not award or reverse points.

Lakehouse is already fully async with proper at-least-once delivery. No changes required.

Gaps and Anti-Patterns

  1. No stream-layer idempotency on commands. Retries re-execute side effects (duplicate POS customers, duplicate payments).
  2. smackz:events silently drops handler failures. No DLQ, no retry. Errors are logged but messages are treated as processed.
  3. No standardized reply-event envelope. The one working example (yum.location.created) is bespoke.
  4. Loyalty Phase 2 incomplete. Point earn/reversal handlers are stubs.
  5. Sync commands in admin onboarding block on Square/Clover APIs that routinely exceed 5s p99.
  6. No reconcilers. If a reply event is lost, nothing catches the drift.

Target Architecture Principles

  1. Default to fire-and-forget events with at-least-once delivery, idempotency, DLQ, and a reconciler.
  2. Reply-events for confirmation flows. Caller publishes a command, consumer emits <domain>.<action>.completed / .failed correlated by correlation_id. Caller tracks state in a DB row updated when the reply arrives.
  3. Keep sync only for checkout hot paths under 500ms: get_card_detail, calculate_best_offer, preview_points.
  4. Reconcile, don't retry indefinitely. Every async flow gets a companion cron that scans for drift and republishes as needed.
  5. Idempotency + DLQ + Reconciler is the required triad. Each alone is insufficient.

Migration Plan

Phase 1 -- Foundation

Extract the pos:webhooks gold-standard pattern (idempotency, DLQ, XAUTOCLAIM) into a shared module in shared-redis-streams. Apply it to all command and event streams. Add OpenTelemetry spans and Prometheus counters. Estimated effort: ~1.5 weeks. No breaking changes.

Phase 2 -- Fire-and-Forget Conversions

Convert remaining adapter.create_customer callers from sync to fire-and-forget. Implement the posCustomerSynced flag so yum tracks sync state via reply events. Estimated effort: ~3-4 days. No breaking changes.

Phase 3 -- Reply-Event Conversions for Admin Wizard (SSE)

Replace sync commands in the admin onboarding wizard with async flows using SSE for progress updates. Admin posts a request, yum creates a job row and returns immediately, the adapter processes asynchronously and emits reply events, yum broadcasts progress via Redis pub/sub to SSE clients. Estimated effort: ~2 weeks. Breaking change: Admin wizard UX (coordinated release). Feature-flagged per tenant via ADAPTER_ASYNC_ONBOARDING.

Phase 4 -- Reconcilers

Cron workers in yum-API that scan for drift: unsynced POS customers (15 min), stuck orders (5 min), menu drift (hourly), loyalty earn gaps (10 min). Estimated effort: ~1 week.

Phase 5 -- Complete Loyalty Phase 2

Replace event consumer stubs with real point earn/reversal logic. Add DB-layer idempotency via unique constraints on loyalty_ledger(order_id, ledger_type). Backfill orders paid during the stub period. Estimated effort: ~2 days.

Key Risks

  • Data loss if reply events are lost -- mitigated by Phase 4 reconcilers and transactional writes.
  • Duplicate side effects from command retries -- mitigated by Phase 1 idempotency keyed on correlation_id.
  • Retry storms from compounding retries -- mitigated by exponential backoff, cooldown timestamps, and circuit breakers.
  • Observability gaps in async flows -- mitigated by OpenTelemetry correlation and per-stream dashboards.