Async Inter-Service Communication
This page summarizes the architecture review documented in docs/ARCHITECTURE_ASYNC_MIGRATION.md. The review maps every inter-service call across the platform, identifies sync-vs-async anti-patterns, and proposes a phased migration to an async-first default.
Context -- The 504 Incident
A production 504 on QA login triggered this review. The root cause: yum-API's login flow synchronously awaited a Redis Streams adapter.create_customer command against the POS adapter. When the adapter's Postgres was hibernated, the command timed out after ~60s (retried silently inside the library), while the client's axios gave up at ~5s -- user saw a 504.
The code comment on the call site said "best effort, non-blocking" yet used await. The immediate fix converted that one path to fire-and-forget. The deeper question: which other inter-service interactions have the same anti-pattern?
Current State Findings
The audit covered yum-API, POS Adapters, Loyalty, and Lakehouse. Key findings:
yum-API has ~37 inter-service call sites. Most menu.*, tax.*, order.* events are already fire-and-forget. However, 6 command call sites use synchronous request-reply (adapter.connect_provider, adapter.handle_callback, adapter.complete_onboarding, adapter.fetch_provider_locations, adapter.get_card_detail, adapter.create_customer). Three loyalty commands are also sync (calculate_best_offer, preview_points, and the earn/usage events).
POS Adapters have gold-standard reliability patterns on the pos:webhooks stream (idempotency via SHA-256 dedup, DLQ with exponential backoff, XAUTOCLAIM recovery). However, the smackz:commands and smackz:events streams lack these protections entirely.
Loyalty has 12 command handlers with no idempotency. The order.payment_confirmed and order.cancelled event consumers are Phase 2 stubs -- they log but do not award or reverse points.
Lakehouse is already fully async with proper at-least-once delivery. No changes required.
Gaps and Anti-Patterns
- No stream-layer idempotency on commands. Retries re-execute side effects (duplicate POS customers, duplicate payments).
smackz:eventssilently drops handler failures. No DLQ, no retry. Errors are logged but messages are treated as processed.- No standardized reply-event envelope. The one working example (
yum.location.created) is bespoke. - Loyalty Phase 2 incomplete. Point earn/reversal handlers are stubs.
- Sync commands in admin onboarding block on Square/Clover APIs that routinely exceed 5s p99.
- No reconcilers. If a reply event is lost, nothing catches the drift.
Target Architecture Principles
- Default to fire-and-forget events with at-least-once delivery, idempotency, DLQ, and a reconciler.
- Reply-events for confirmation flows. Caller publishes a command, consumer emits
<domain>.<action>.completed/.failedcorrelated bycorrelation_id. Caller tracks state in a DB row updated when the reply arrives. - Keep sync only for checkout hot paths under 500ms:
get_card_detail,calculate_best_offer,preview_points. - Reconcile, don't retry indefinitely. Every async flow gets a companion cron that scans for drift and republishes as needed.
- Idempotency + DLQ + Reconciler is the required triad. Each alone is insufficient.
Migration Plan
Phase 1 -- Foundation
Extract the pos:webhooks gold-standard pattern (idempotency, DLQ, XAUTOCLAIM) into a shared module in shared-redis-streams. Apply it to all command and event streams. Add OpenTelemetry spans and Prometheus counters. Estimated effort: ~1.5 weeks. No breaking changes.
Phase 2 -- Fire-and-Forget Conversions
Convert remaining adapter.create_customer callers from sync to fire-and-forget. Implement the posCustomerSynced flag so yum tracks sync state via reply events. Estimated effort: ~3-4 days. No breaking changes.
Phase 3 -- Reply-Event Conversions for Admin Wizard (SSE)
Replace sync commands in the admin onboarding wizard with async flows using SSE for progress updates. Admin posts a request, yum creates a job row and returns immediately, the adapter processes asynchronously and emits reply events, yum broadcasts progress via Redis pub/sub to SSE clients. Estimated effort: ~2 weeks. Breaking change: Admin wizard UX (coordinated release). Feature-flagged per tenant via ADAPTER_ASYNC_ONBOARDING.
Phase 4 -- Reconcilers
Cron workers in yum-API that scan for drift: unsynced POS customers (15 min), stuck orders (5 min), menu drift (hourly), loyalty earn gaps (10 min). Estimated effort: ~1 week.
Phase 5 -- Complete Loyalty Phase 2
Replace event consumer stubs with real point earn/reversal logic. Add DB-layer idempotency via unique constraints on loyalty_ledger(order_id, ledger_type). Backfill orders paid during the stub period. Estimated effort: ~2 days.
Key Risks
- Data loss if reply events are lost -- mitigated by Phase 4 reconcilers and transactional writes.
- Duplicate side effects from command retries -- mitigated by Phase 1 idempotency keyed on
correlation_id. - Retry storms from compounding retries -- mitigated by exponential backoff, cooldown timestamps, and circuit breakers.
- Observability gaps in async flows -- mitigated by OpenTelemetry correlation and per-stream dashboards.
Related Pages
- Service Map -- service topology
- Data Flow -- event flow diagrams
- POS Adapters -- POS adapter service details
- Loyalty -- loyalty domain