Retail E-Commerce Platform | Shantanu Bhusari

Summary

A production retail e-commerce platform — 20+ Go microservices handling products, orders, payments, inventory, promotions, delivery (Shiprocket), and customer engagement. The platform included a real-time order tracking system via WebSocket and an event-driven state machine for order lifecycle management.

The core challenge that emerged in production: Lambda cold starts of 6–9 seconds. The root cause was architectural — Fiber, an excellent server framework, was being used inside Lambda functions. Fiber initialises a full HTTP server, connection pool, and 5–10MB of framework on every cold start. In a persistent server this cost is paid once. In Lambda it is paid on every cold invocation.

What shipped: 20 microservices, gRPC internal communication, SNS pub/sub for notifications, WebSocket real-time order tracking, Razorpay and Shiprocket integrations, and a diagnosed and documented Lambda optimisation plan.

Architecture Decisions

The Fiber-in-Lambda mistake — diagnosis and fix

What happened: Fiber was chosen for Lambda functions because it was already used across server-based services. Consistent framework, familiar API, copy-paste deployment.

The cost: Fiber overhead: 5–10MB per cold start. MongoDB driver initialisation: 1–2s. Redis client initialisation: 0.5–1s. Total cold start on first invocation: 6–9 seconds. For a checkout flow, this is unacceptable.

Root cause: Fiber is designed for persistent servers. Connection pooling, full HTTP server initialisation, router setup — none of this is useful in a function that lives for one request and then terminates.

The fix: Replace Fiber in Lambda functions with minimal handlers — plain net/http or a lightweight adapter. Lazy database connections (connect on first request, reuse across warm invocations). Binary size reduction from 21MB → ~5MB.

What I'd change: Establish a rule at project start — Lambda functions get minimal handlers, servers get Fiber. The runtimes are different execution models and the framework choice is not portable between them. The cost of learning this in production was 6–9 seconds of user-visible latency.

Why event-driven state machine for orders

The options considered: A mutable status field on the order document (PENDING, PROCESSING, SHIPPED...), an event log with state derived on read, explicit event types with downstream subscribers.

The constraint: An order touches payment, inventory, delivery, invoicing, and notifications — 5 independent services. A mutable status field means each service either polls for changes or receives direct callbacks. Both approaches create tight coupling between the order service and every downstream service.

The decision: Explicit event types emitted to SNS: payment_confirmed, delivery_shipped, invoice_generation_queued, notification_sent. Each downstream service subscribes to the events it cares about. Adding a new consumer requires no changes to the order service.

The trade-off: Current order state must be derived from the event log or maintained in a separate read model. Debugging requires tracing events across services rather than reading a single document.

What I'd change: Add a materialised view service that consumes all order events and maintains a current-state read model from day one. We relied on the order service for state queries, which created unnecessary load on a service that should only handle writes.

Why SNS over SQS for notification fan-out

The options considered: SQS (point-to-point queue), SNS (pub/sub fan-out), direct Lambda invocations from the order service.

The constraint: A single order event needs to trigger multiple independent consumers: notification service, analytics service, inventory service. SQS is point-to-point — one queue, one consumer per message.

The decision: SNS topic per event type. Each interested service subscribes with its own SQS queue behind the SNS subscription. Adding a new consumer requires no changes to the publisher.

The trade-off: At-least-once delivery. Each subscriber must handle duplicate events idempotently — a payment_confirmed event may arrive twice, and processing it twice must produce the same result as processing it once.

What I'd change: Document the idempotency requirement explicitly in the service contracts before any subscriber is built. We discovered duplicate-processing bugs late because the requirement was implicit.

Why WebSocket for order tracking instead of polling

The options considered: Client polling on an interval, server-sent events (SSE), WebSocket.

The constraint: Order status changes (packed → shipped → delivered) need to be visible in near-real-time without page refresh. Polling every 5 seconds means up to 5 seconds of lag and wastes requests when nothing has changed.

The decision: WebSocket connection maintained per active user session. The notification service broadcasts order events to the connected client as they occur.

The trade-off: Persistent connections consume server resources. The system must handle reconnection gracefully and avoid missing events during a brief disconnection.

What I'd change: Add a catch-up mechanism — on reconnect, send missed events since the last known client state. Currently the client must refresh to see changes that occurred while disconnected.