Grocery E-Commerce Platform | Shantanu Bhusari

Summary

A production grocery e-commerce platform built as 15 independent Go microservices handling product catalog, orders, payments, inventory, analytics, delivery, and notifications. Built for a client and running in production across multiple cities.

The core challenge was inter-service consistency without distributed transactions. Every order touches inventory, payment, delivery, and notifications — all independent services with no shared database. The answer was an event-driven architecture where services emit facts (order_created, payment_confirmed) and downstream services react asynchronously. Eventual consistency for inter-service state; strong consistency within each service boundary.

What shipped: 15 microservices behind HAProxy, gRPC for internal communication, AWS SQS for async event distribution, Razorpay payment integration, real-time order updates via WebSocket, image processing with libvips, structured logging with uber/zap.

Architecture Decisions

Why gRPC over REST for internal communication

The options considered: REST/JSON between services, gRPC with Protocol Buffers, message queue for all inter-service calls.

The constraint: Type safety across 15 services. With REST, a field rename in the product service silently breaks every caller at runtime, not at build time. At 15 services this becomes unmanageable.

The decision: gRPC with .proto contracts. Every service interface is defined in a Protocol Buffer schema. The compiler catches breaking changes before they reach production.

The trade-off: Tooling overhead. Every interface change requires regenerating stubs across affected services. The feedback loop is longer than editing a JSON payload.

What I'd change: Nothing on this decision. The type safety paid off repeatedly — caught two breaking changes at compile time that would have caused production incidents with REST.

Why AWS SQS over Kafka for event distribution

The options considered: Apache Kafka, AWS SQS, RabbitMQ, direct HTTP calls between services.

The constraint: No dedicated ops team. Kafka requires cluster management, partition tuning, consumer group coordination, and ongoing operational attention. At this team size, that cost is not justified.

The decision: SQS. Fully managed, scales automatically, integrates natively with Lambda, zero operational overhead.

The trade-off: No message replay, no consumer groups, no persistent event log. If a consumer fails and the message visibility timeout expires, the message goes to the dead-letter queue — it doesn't automatically retry to the next consumer.

What I'd change: Add a dead-letter queue with CloudWatch alerting from day one. We added it after the first missed event in production. Every SQS queue should have a DLQ configured before it receives traffic.

Why Go + Node.js hybrid instead of Go only

The options considered: Go for all services, Node.js for all services, hybrid based on service requirements.

The constraint: Notification and discount services had existing Node.js logic. Rewriting them in Go had no performance or reliability justification — they were not on any hot path.

The decision: Go for all latency-sensitive services (product, cart, order, user). Node.js where it already existed and performance was not the constraint.

The trade-off: Two runtimes to maintain. Docker images differ. The deployment pipeline must handle both build toolchains.

What I'd change: Nothing. Forcing Go everywhere would have been dogma, not engineering. The right language for each service is the one that costs least to maintain at acceptable performance.

Why shared MongoDB over polyglot persistence

The options considered: One MongoDB instance shared across services (logical separation by collection prefix), separate database per service, PostgreSQL for transactional services.

The constraint: Timeline and operational simplicity. A separate database per service is the microservices ideal but adds operational complexity — more connection strings, more backups, more monitoring endpoints.

The decision: Shared MongoDB with logical separation. Redis for hot data (sessions, cart, frequently accessed products).

The trade-off: Services are coupled at the data layer. A slow analytics query can degrade product service response times if they share the same MongoDB instance.

What I'd change: Separate the analytics service onto its own read replica from day one. Analytics queries are expensive and unpredictable — they should never compete with the write path for database resources.