Design a Real-Time E-commerce Notification System

Compare model answers for this System Design benchmark and review scores, judging comments, and related examples.

X f L

Contents

Task Overview

Benchmark Genres

System Design

Task Creator Model The task creator is randomly selected from top task-generation models of supported providers.

Google Gemini 2.5 Pro

Answering Models In this benchmark, models from the same provider as the task creator are excluded from answering.

Answer A OpenAI GPT-5 mini

Answer B Anthropic Claude Opus 4.6

Judge Models Judging uses exactly 3 judge models, excluding the answering models. At least 1 judge is selected from flagship models, lightweight models are not selected as judges, and the 3 judges come from 3 distinct providers.

OpenAI GPT-5.2 Anthropic Claude Sonnet 4.6 Google Gemini 2.5 Pro

Task Prompt

Show more ▼

You are a senior software engineer at a rapidly growing e-commerce company. Your task is to design a real-time notification system. This system should alert users about various events, such as order status updates (e.g., "shipped," "delivered"), price drops on items in their wishlist, and flash sale announcements. Design a high-level architecture for this system. Your design should address the following requirements: 1. **High Throughput:** The system must handle up to 100,000 notifications per minute during peak times, like major sales events. 2. **Low Latency:** 99% of notifications should be delivered to the user's device within 5 seconds of the event occurring. 3. **Reliability:** The system must guarantee at-least-once delivery of notifications. No critical notification (like an order update) should be lost. 4. **Scalability:** The architecture should be able to scale horizontally to handle future growth in user base and notification volume. 5. **Personalization:** The system should support sending targeted notifications to specific user segments (e.g., users interested in a particular product category). Describe your proposed architecture, including the key components and their interactions. Explain your choice of technologies (e.g., message queues, databases, push notification services). Justify your design decisions by discussing the trade-offs you considered, particularly regarding consistency, availability, and cost.

Task Context

The e-commerce platform has 50 million active users. Users can access the platform via a mobile app (iOS and Android) and a web browser. The existing infrastructure is primarily cloud-based (assume AWS, GCP, or Azure).

Judging Policy

Show more ▼

A high-quality response will present a clear, coherent, and well-reasoned system design. The evaluation will focus on: - **Architecture Coherence:** The proposed components (e.g., API gateway, message queue, worker services, database) should work together logically to meet the requirements. - **Scalability and Reliability:** The design must convincingly explain how it will handle high load and prevent data loss, using concepts like message queues, retries, and dead-letter queues. - **Technology Choices:** The justification for choosing specific technologies (e.g., Kafka vs. RabbitMQ, NoSQL vs. SQL database) should be sound and relevant to the problem's constraints. - **Trade-off Analysis:** The response should demonstrate an understanding of system design trade-offs, such as latency vs. reliability, or cost vs. performance. Vague statements without justification will be penalized. - **Completeness:** The design should address all the specified requirements, including throughput, latency, reliability, scalability, and personalization.

Answers

Answer A: OpenAI GPT-5 mini

Overview
I propose a horizontally scalable, streaming-first notification platform built from decoupled microservices and managed cloud services. Core ideas: use a durable event log for ingestion and replay (Kafka/Kinesis/Pub/Sub), stream processors to enrich and personalize events, a delivery pipeline that persists delivery state and retries for at-least-once guarantees, and device-specific connectors for push (APNs/FCM) and real-time web (WebSocket). The design meets 100k notifications/min (≈1.7k/sec) peaks, targets sub-5s delivery for 99% of messages, and supports personalization and reliable delivery.

High-le...

Show Full Answer ▼

High-level architecture (components and interactions)

Event Producers

Sources: Order service (order updates), Pricing service (price changes), Marketing/CRM (flash sales). Each service emits lightweight events to the ingestion layer whenever a relevant change occurs. Events include event_id, event_type, payload, timestamp, and metadata (user_ids or product_ids).

Ingestion / Durable Event Log

Managed partitioned log: Apache Kafka (self-managed or Confluent Cloud) or cloud equivalents (AWS Kinesis Data Streams, GCP Pub/Sub). Producers publish events to topic(s) organized by event type and partition key (user_id or product_id) to preserve ordering where required (e.g., order updates per order).
Why durable log: provides replayability, retention for retries, and backpressure smoothing.

Stream Processing / Enrichment Layer

Stateless/Stateful stream processors (Apache Flink, Kafka Streams, or managed Dataflow) subscribe to event topics to: validate events, enrich with user profile and preferences, join with product/segment data, and decide notification eligibility and priority (e.g., critical order update vs marketing).
Output: normalized Notification Tasks (task_id, user_id(s), payload, type, priority, ttl, dedup_key) published to a Notification Task topic.

Personalization & Segmentation

Personalization rules live in a service combining: feature store / profile DB (DynamoDB/Cassandra/Postgres + Redis cache for hot reads), and a rule engine or ML model. Stream processors call this service or use local cached lookups to determine targeted recipients and content variants.
For broad segmentation events (flash sale to segment), use precomputed segments stored in a fast store (Redis, Druid, or BigQuery/ElastiCache lookup) to expand to user lists or to apply filter logic within streaming jobs.

Delivery Orchestration / Fan-out

A Delivery Orchestrator service subscribes to Notification Task topic, evaluates device registrations, throttling rules, and fan-out strategy. For single-user notifications (order update) it creates a delivery job per device; for segment-based broadcast it fans out into many delivery jobs via a partitioned queue.
Delivery jobs are placed into persistent per-shard delivery queues (Kafka topics, Redis Streams, or SQS with FIFO for ordering where needed). Jobs include retry counters and idempotency/dedup keys.

Delivery Workers / Connectors

Stateless worker fleet autoscaled by queue lag. Each worker pulls jobs, attempts delivery via the connector appropriate for the device channel:
- Mobile push: FCM (Android) and APNs (iOS) using device tokens stored in Device Registry.
- Web/Browser: Web Push (VAPID) or persistent WebSocket connections (managed via a connection service like AWS API Gateway WebSocket or self-managed socket clusters behind ELB).
- Fallback channels: Email (SES/SendGrid) or SMS (Twilio) for critical undelivered notifications.
Workers persist delivery attempts (success/fail) to a Delivery Status store and emit completion or retry events to the log for monitoring and further retries.

Device Registry & User Preferences

Durable store of user_id -> devices (token, platform, last_seen, preferences, opt-in flags). Use DynamoDB/Cassandra for high write throughput; cache active devices in Redis for low-latency lookups.

Delivery State & Replayability

All notification tasks and delivery attempts logged in durable stores (Kafka + archival to S3) and a Delivery Status DB. This enables at-least-once delivery, auditing, and reconciliation. Unacked/failed deliveries are retried by a retry orchestrator with exponential backoff.

Monitoring, Observability, and SLA Enforcement

Metrics: ingestion rate, processing latency, queue lag, delivery success rate. Traces for path-level latency (OpenTelemetry), and alerts for SLA breaches. Dashboards to monitor 99p latency and per-channel failure rates.

Key design choices and justifications

Durable log (Kafka/Kinesis/PubSub): provides high throughput and replayability which is essential for at-least-once semantics and debugging. Partitioning by user_id/product_id preserves per-entity ordering (critical for order updates). Managed cloud streaming reduces operational overhead.
Stream processing (Flink/Kafka Streams/Dataflow): enables sub-second enrichment and segmentation close to ingestion. Stateful streaming supports windowed joins (e.g., match price drop events to wishlists) with low latency.
Device Registry in NoSQL + cache: DynamoDB/Cassandra scales horizontally for tens of millions of users; Redis handles hot-path lookups for low-latency decisions.
Delivery queues and autoscaled workers: decouples heavy fan-out from upstream processing, enabling graceful scaling during flash sales while controlling downstream push provider rate limits.
Push connectors (APNs/FCM) + WebSockets: push services minimize client polling and achieve low latency. WebSockets are used for real-time in-app/web delivery; if WebSocket unavailable, fall back to push or pull.
At-least-once, idempotency and deduplication: store task-level dedup_key and make delivery idempotent on the client or use SDK acknowledgements where possible. On server-side, dedupe by task_id/dedup_key before creating user-visible notifications.

Meeting the requirements

High Throughput: Partitioned log and autoscaling workers support horizontal scaling; Kafka/Kinesis can handle millions of events/sec with multiple partitions. 100k/min is modest for such systems; the architecture can scale to much higher volumes by adding partitions and workers.
Low Latency: Streaming enrichment and direct push/WebSocket connectors are low-latency paths. Targeting <5s 99p: keep processing pipeline under 1–2s (streaming jobs), delivery queue lag low via autoscaling workers, and use device caches to avoid DB lookups in the hot path.
Reliability: Durable event log + persisted delivery states + retry orchestrator ensures at-least-once delivery. For critical notifications (order updates), enable stronger guarantees: synchronous acknowledgement from downstream services and storing a confirmed delivery receipt (e.g., device ack or fallback channel confirmation). Use exponential backoff and escalation to alternate channels.
Scalability: All stateful pieces use horizontally scalable stores (Kafka, DynamoDB/Cassandra, Redis clusters). Workers and streamers are stateless containers that autoscale. Use partitioning and sharding for growth.
Personalization: Real-time joins in stream processors plus cached profile store enable per-user personalization. Precomputed segments accelerate large fan-outs (flash sales) by avoiding per-user evaluate on the fly.

Trade-offs (Consistency, Availability, Cost)

Consistency vs Availability: We favor availability and eventual consistency for marketing notifications (acceptable if a promo arrives slightly out of order). For order-critical events, we use stronger ordering and persistence (partitioning and synchronous persistence) to ensure correct ordering and reliable delivery. This hybrid approach balances user experience and system resilience.
At-least-once vs Exactly-once: Achieving exactly-once across the whole pipeline adds complexity and cost (transactional Kafka, two-phase commit, or end-to-end idempotency). We choose at-least-once with idempotent handlers and dedup keys to avoid duplicate-visible notifications while keeping system simpler and more scalable.
Managed services vs self-hosted: Managed streaming (Kinesis/PubSub) and push infrastructure reduce operational burden and increase availability but cost more. For speed to market and reliability at scale, managed services are recommended. If cost becomes dominant, consider self-hosted Kafka with strong automation.

Operational considerations

Rate limiting / throttling: Per-user and per-provider quotas to avoid overload and provider rate-limit rejections.
Backfill and replay: Use durable log to replay missed notifications (e.g., after outage) and to rebuild delivery status.
Testing: Chaos testing for partitions and provider failures, load testing to validate 99p latency under peak.
Security & privacy: Encrypt tokens at rest, rotate push credentials, honor opt-outs, and PII controls.

SLA and KPIs to monitor

Throughput (events/min), stream processing latency (p50/p90/p99), end-to-end delivery latency (p50/p90/p99), delivery success rate per channel, retry rate, queue lag, and consumer group lag.

Summary
This design uses a durable, partitioned event log plus stream processing for enrichment/personalization, a persistent delivery pipeline with autoscaling workers for fan-out, and channel-specific connectors (APNs/FCM/WebSocket) for fast delivery. It prioritizes availability and replayability to meet at-least-once delivery, uses caching and partitioning to meet low latency and high throughput, and relies on managed cloud components to reduce operational burden while allowing horizontal scaling to support future growth.

Result

Winning Votes

0 / 3

Average Score

Judge Models Anthropic Claude Sonnet 4.6

Total Score

Overall Comments

Answer A presents a solid, well-structured high-level architecture covering all major components: event producers, durable log, stream processing, personalization, delivery orchestration, and monitoring. The technology choices are reasonable and the justifications are present. However, the answer tends to be somewhat abstract and list-heavy, often presenting options (Kafka/Kinesis/PubSub, DynamoDB/Cassandra/Postgres) without committing to a specific design, which weakens the decisiveness of the architecture. The trade-off analysis exists but is relatively brief and surface-level. Latency estimates are mentioned but not quantified with concrete numbers. The personalization and segmentation discussion is adequate but lacks depth on the staleness vs. accuracy trade-off. The answer is competent but reads more like a survey of options than a definitive design.

View Score Details ▼

Architecture Quality

Weight 30%

Answer A covers all major architectural layers and their interactions logically. However, it frequently lists multiple technology options without committing to one, which reduces the clarity and decisiveness of the design. The fan-out strategy and delivery orchestration are described but at a high level without concrete implementation details like priority partitioning or dual-write patterns.

Completeness

Weight 20%

Answer A addresses all five requirements (throughput, latency, reliability, scalability, personalization) and includes operational considerations, security, and monitoring. However, some areas like the in-app notification offline handling and status tracking are underdeveloped compared to Answer B.

Trade-off Reasoning

Weight 20%

Answer A discusses consistency vs. availability, at-least-once vs. exactly-once, and managed vs. self-hosted trade-offs. However, the analysis is relatively brief and lacks specific quantification or concrete examples tied to the system's requirements. The segmentation trade-off is not discussed.

Scalability & Reliability

Weight 20%

Answer A correctly identifies horizontal scaling mechanisms (partitioning, autoscaling workers, NoSQL stores) and reliability mechanisms (durable log, retry orchestrator, dedup keys). However, it lacks specifics like replication factor settings, priority partitioning for critical notifications, or concrete retry policies.

Clarity

Weight 10%

Answer A is well-organized with clear section headers and bullet points. However, the frequent listing of multiple technology alternatives without selection makes it harder to follow as a definitive design. The writing is clear but the lack of commitment reduces overall clarity of intent.

Judge Models OpenAI GPT-5.2

Total Score

Overall Comments

Answer A presents a strong streaming-first architecture with a durable event log, stream processing for enrichment/personalization, a delivery orchestration and worker model, and good reliability mechanisms (retries, DLQ conceptually, dedup keys). It is broadly cloud-agnostic and hits all the major building blocks, with solid discussion of ordering, replay, autoscaling, and observability. However, some parts stay at a more generic level (e.g., segmentation expansion strategy and state stores are listed as options without a crisp choice), and a few claims are a bit hand-wavy (e.g., “synchronous acknowledgement” for critical notifications without specifying where/how this is achieved with third-party push systems). Trade-offs are present but less concrete than B’s (e.g., fewer specific operational/cost levers and fewer precise failure-handling workflows like offset commit rules/DLQ handling).

View Score Details ▼

Architecture Quality

Weight 30%

Well-structured event-log + stream processing + delivery pipeline with appropriate stores and connectors; some components are described as interchangeable options rather than a crisp reference design, and a few flows (critical notification stronger guarantees) are not fully nailed down.

Completeness

Weight 20%

Addresses throughput, latency, reliability, scalability, personalization, monitoring, and security; segmentation and delivery receipts/fallback are mentioned but not as concretely specified as in B.

Trade-off Reasoning

Weight 20%

Includes CAP posture, at-least-once vs exactly-once, and managed vs self-hosted; reasoning is sound but relatively high-level with fewer concrete alternatives and cost levers.

Scalability & Reliability

Weight 20%

Good use of partitioning, autoscaling workers, retries, dedup keys, and durable log; reliability story is strong but less explicit on consumer semantics (commit/ack) and DLQ handling details.

Clarity

Weight 10%

Clear narrative and component breakdown, but many technology choices are presented as lists of options, which slightly blurs the final architecture.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer A presents an outstanding, textbook-perfect design for a streaming-first notification system. Its architecture is clean, with a logical separation of concerns into distinct layers like ingestion, stream processing, and delivery orchestration. It correctly identifies key technologies and principles like durable logs, autoscaling workers, and idempotency. The answer is comprehensive and clearly written. Its main weakness, when compared to Answer B, is a slightly lower level of implementation detail and less specific trade-off analysis.

View Score Details ▼

Architecture Quality

Weight 30%

The architecture is excellent, featuring a clean, logical separation of concerns with distinct layers for ingestion, stream processing, and delivery orchestration. It represents a modern, best-practice approach to this problem.

Completeness

Weight 20%

The answer thoroughly addresses all five requirements from the prompt, providing solid solutions for throughput, latency, reliability, scalability, and personalization.

Trade-off Reasoning

Weight 20%

The trade-off analysis is very good, covering the standard, important considerations like consistency vs. availability and at-least-once vs. exactly-once. The reasoning is sound and well-justified.

Scalability & Reliability

Weight 20%

The design is fundamentally scalable and reliable, built on a durable event log, autoscaling stateless services, and horizontally scalable databases. The principles for achieving at-least-once delivery are clearly explained.

Clarity

Weight 10%

The response is very clearly written and well-structured. The use of numbered lists and distinct sections makes the complex architecture easy to follow and understand.

Answer B: Anthropic Claude Opus 4.6

Designing a Real-Time E-commerce Notification System

Overview

The proposed system is an event-driven, microservices-based architecture that ingests events from multiple upstream services, processes and routes them through a message broker, and delivers notifications to 50 million users across mobile and web channels with at-least-once delivery guarantees, sub-5-second latency at the 99th percentile, and the ability to sustain 100,000 notifications per minute during peak events. The design assumes a cloud-native deployment on AWS, though the principles translate directly to GCP or Azure equivalents.

High-Level...

Show Full Answer ▼

Designing a Real-Time E-commerce Notification System

Overview

High-Level Architecture

The architecture consists of six logical layers: Event Producers, Event Ingestion, Processing and Routing, User Preference and Segmentation, Delivery, and Observability.

Event Producers

Multiple upstream services generate events that trigger notifications. These include the Order Service (emitting order-placed, shipped, out-for-delivery, delivered events), the Pricing Service (emitting price-drop events when a watched item's price falls below a threshold), and the Campaign Service (emitting flash-sale-start events created by the marketing team). Each producer publishes a structured event envelope containing an event type, a timestamp, a payload, and an idempotency key to a central ingestion layer. Producers use an asynchronous, fire-and-forget publish call so their own latency is unaffected.

Event Ingestion Layer — Apache Kafka

All events flow into Apache Kafka (Amazon MSK in AWS). Kafka is chosen for several reasons. First, it provides durable, append-only logs with configurable replication (replication factor of 3, min.insync.replicas of 2), which means an acknowledged write survives the loss of any single broker, directly supporting the reliability requirement. Second, Kafka's partitioned topic model allows horizontal scaling: we partition the order-events topic by user ID so that all events for a given user are processed in order, while the overall topic can be spread across dozens of partitions to absorb peak throughput. At 100,000 notifications per minute, the average rate is roughly 1,667 events per second, well within a modest Kafka cluster's capacity, but partitioning lets us scale to 10x or more without architectural change. Third, Kafka decouples producers from consumers, so a temporary slowdown in the delivery layer does not back-pressure the order service.

Topic design: separate topics for order-events, price-events, and campaign-events. This allows independent consumer groups with different scaling and retry policies.

Trade-off considered: We evaluated Amazon SQS plus SNS as a simpler managed alternative. SQS would satisfy at-least-once delivery and is operationally lighter, but it lacks Kafka's ordering guarantees per partition and its replay capability, which is valuable for reprocessing after a bug. The added operational cost of Kafka is justified by the ordering and replay benefits at our scale.

Processing and Routing Layer — Notification Service

A horizontally scaled microservice, the Notification Service, consumes from Kafka topics. It performs several tasks. First, it resolves the target audience. For order events, the target is a single user (extracted from the payload). For price-drop events, it queries the Wishlist Service or a precomputed materialized view to find all users watching that item. For flash-sale events, it queries the Segmentation Service to resolve a user segment. Second, it enriches the notification by fetching the user's display name, the product image URL, and the localized message template from a Template Service backed by a Redis cache over a PostgreSQL store. Third, it applies user preferences. Each user has a preference document (stored in DynamoDB for low-latency key-value access) specifying opted-in channels (push, email, SMS, in-app), quiet hours, and category interests. The service filters and respects these preferences, which directly supports the personalization requirement. Fourth, it fans out delivery tasks. For each resolved user-channel pair, the service produces a delivery message onto a per-channel Kafka topic (push-delivery, email-delivery, sms-delivery, in-app-delivery). This fan-out step is critical: it converts one logical event into potentially millions of individual delivery tasks (for a flash sale targeting all users), and Kafka absorbs this burst.

Scaling: The Notification Service runs as a Kubernetes Deployment (Amazon EKS) with a Horizontal Pod Autoscaler keyed on Kafka consumer lag. During a flash sale, consumer lag spikes, and additional pods spin up within seconds.

Idempotency: Every delivery message carries the original event's idempotency key combined with the user ID. Downstream delivery workers use this composite key to deduplicate, ensuring that even though Kafka provides at-least-once semantics, users do not receive duplicate notifications.

User Preference and Segmentation

User preferences are stored in Amazon DynamoDB, partitioned by user ID. DynamoDB is chosen for its single-digit-millisecond read latency and seamless horizontal scaling, which is important because every single notification requires a preference lookup. A DAX (DynamoDB Accelerator) cache sits in front for hot keys.

For segmentation (targeting users interested in a category, or users in a geographic region), we maintain precomputed segment membership lists. A nightly batch job (Apache Spark on EMR) and a real-time stream processor (Kafka Streams or Flink) keep these lists updated in a separate DynamoDB table keyed by segment ID, with the value being a list of user IDs stored in S3 for very large segments. When the Campaign Service creates a flash sale targeting the electronics segment, the Notification Service reads the segment membership and iterates through it, producing delivery tasks in batches.

Trade-off: Precomputing segments trades storage and staleness (a user who changed preferences an hour ago might still be in the old segment) for query speed. Real-time segment evaluation at delivery time would be more accurate but would require scanning millions of user records under time pressure, violating the latency requirement. The hybrid approach (nightly batch plus real-time stream updates) keeps segments fresh within minutes.

Delivery Layer

Separate consumer groups handle each channel.

Push Notifications (Mobile): Workers consume from the push-delivery topic and call Firebase Cloud Messaging (FCM) for Android and Apple Push Notification Service (APNs) for iOS. We use FCM as the unified gateway for both platforms where possible. Workers batch requests to FCM's HTTP v1 API (up to 500 messages per batch call) to maximize throughput. Failed deliveries (invalid tokens, rate limits) are retried with exponential backoff. Permanently failed tokens (unregistered devices) trigger an asynchronous event to purge the token from the User Device Registry.

Web Push: Workers send Web Push Protocol messages using the VAPID standard. The user's push subscription (endpoint URL and keys) is stored in the User Device Registry (DynamoDB). This channel reuses the same push-delivery topic with a channel sub-type field.

In-App Notifications: For users who are currently online, we maintain persistent WebSocket connections via an API Gateway WebSocket API (AWS API Gateway WebSocket or a self-managed service using Socket.IO on EKS). A Connection Registry (Redis Cluster) maps user IDs to WebSocket connection IDs. The in-app delivery worker looks up the connection, and if the user is connected, pushes the notification in real time. If not connected, the notification is written to an In-App Inbox (DynamoDB table partitioned by user ID, sorted by timestamp) and delivered when the user next opens the app. This dual-write ensures no notification is lost even if the user is offline.

Email and SMS: Lower-priority channels. Workers consume from email-delivery and sms-delivery topics and call Amazon SES and Amazon SNS (SMS) respectively. These channels have higher acceptable latency (30 seconds to minutes), so they can be scaled more conservatively.

Delivery Guarantees: At-least-once delivery is achieved end-to-end. Kafka consumers commit offsets only after the downstream delivery API acknowledges receipt (or the message is persisted to the in-app inbox). If a worker crashes before committing, the message is redelivered. The idempotency key at each stage prevents user-visible duplicates.

Notification Status Tracking

A lightweight Status Service records the lifecycle of each notification: created, sent, delivered, read. Delivery receipts from FCM/APNs and read receipts from the client app are ingested via a Kafka topic and written to a time-series store (Amazon Timestream or ClickHouse) for analytics, and to DynamoDB for per-user status queries (so the app can show read/unread badges). This data also feeds retry logic: if a push notification is not acknowledged as delivered within 30 seconds, the system can fall back to email.

Observability

Prometheus and Grafana monitor Kafka consumer lag, delivery latency percentiles, error rates per channel, and throughput. Alerts fire if p99 delivery latency exceeds 4 seconds (giving a 1-second buffer before the 5-second SLA). Distributed tracing (OpenTelemetry with Jaeger) tracks an event from producer through to delivery, enabling rapid debugging. Structured logs are shipped to Amazon OpenSearch for search and audit.

Addressing the Requirements

High Throughput: Kafka's partitioned topics and horizontally scaled consumer groups handle 100K notifications per minute comfortably. The fan-out for large campaigns is absorbed by Kafka's buffering, and delivery workers auto-scale based on lag.

Low Latency: The critical path for a single-user notification (e.g., order shipped) is: producer publishes to Kafka (less than 50ms), Notification Service consumes and enriches (less than 100ms including DynamoDB and Redis lookups), delivery worker sends to FCM/APNs (less than 500ms typical). Total well under 5 seconds. For large fan-out campaigns, the system parallelizes across many delivery workers; Kafka partitioning ensures no single worker is a bottleneck.

Reliability: Kafka replication, consumer offset management, and idempotent delivery ensure at-least-once semantics. A dead-letter queue (DLQ) topic captures messages that fail after a configurable number of retries, and an ops team is alerted to investigate. Critical order notifications are flagged with a priority field and routed to a dedicated high-priority partition with its own consumer group, ensuring they are never starved by a flood of promotional notifications.

Scalability: Every component scales horizontally. Kafka partitions can be increased. Kubernetes pods auto-scale. DynamoDB scales on demand. There are no single points of failure or single-threaded bottlenecks.

Personalization: User preferences in DynamoDB control channel selection and quiet hours. Precomputed segments enable targeted campaigns. The Template Service supports per-locale, per-category message templates, and A/B test variants can be selected at enrichment time.

Key Trade-offs and Justifications

Consistency vs. Availability: The system favors availability (AP in CAP terms). If the preference store is momentarily unreachable, the Notification Service uses a cached or default preference rather than blocking. This means a user might occasionally receive a notification on a channel they just opted out of, but no notification is lost or delayed. For an e-commerce notification system, availability and low latency are more valuable than strict consistency of preferences.

At-Least-Once vs. Exactly-Once: True exactly-once delivery across external systems (FCM, APNs) is impractical. Instead, we guarantee at-least-once and use idempotency keys to suppress duplicates at the application level. This is simpler, cheaper, and sufficient.

Managed Services vs. Self-Hosted: We use managed Kafka (MSK), managed Kubernetes (EKS), and managed databases (DynamoDB) to reduce operational burden. The trade-off is higher per-unit cost compared to self-hosted, but the engineering time saved and the reliability of managed services justify this at our scale.

Push vs. Pull for In-App: WebSockets (push) provide the lowest latency for online users but require maintaining persistent connections and a connection registry. Polling would be simpler but would add seconds of latency and generate unnecessary load. The hybrid approach (WebSocket when online, inbox when offline) balances latency and resource usage.

Cost Considerations: The largest cost drivers are Kafka throughput (MSK pricing), DynamoDB read capacity for preference lookups, and FCM/APNs API calls (free for FCM, but infrastructure to call them at scale is not). To control costs, we batch FCM calls, use DynamoDB on-demand pricing (which is cost-effective for spiky workloads), and use spot instances for non-critical delivery workers (email, SMS).

Conclusion

This architecture provides a robust, scalable, and low-latency notification system suitable for a 50-million-user e-commerce platform. By leveraging Kafka as the central nervous system, DynamoDB for fast user data access, and channel-specific delivery workers that auto-scale independently, the system meets all stated requirements while remaining operationally manageable and cost-effective. The design is extensible: adding a new notification channel (e.g., WhatsApp) requires only a new delivery worker consuming from a new Kafka topic, with no changes to the upstream event producers or the core processing logic.

Result

#1 | Winner

Winning Votes

3 / 3

Average Score

Judge Models Anthropic Claude Sonnet 4.6

Total Score

Overall Comments

Answer B is a notably stronger response. It commits to specific technology choices (Kafka on MSK, DynamoDB with DAX, EKS, FCM/APNs, API Gateway WebSocket) and justifies each decision with concrete reasoning tied directly to the requirements. The latency budget is explicitly broken down (50ms Kafka publish, 100ms enrichment, 500ms FCM delivery), making the sub-5s SLA claim credible and verifiable. The trade-off analysis is more thorough and specific, covering CAP theorem implications, at-least-once vs. exactly-once, managed vs. self-hosted, and push vs. pull with clear justifications. The segmentation design (precomputed batch + real-time stream updates) is well-reasoned with an explicit acknowledgment of the staleness trade-off. The dead-letter queue, priority partitioning for critical notifications, and the dual-write in-app inbox pattern demonstrate deeper reliability thinking. The extensibility note at the end adds practical value. Overall, Answer B is more complete, more concrete, and demonstrates stronger system design reasoning.

View Score Details ▼

Architecture Quality

Weight 30%

Answer B presents a coherent, committed architecture with specific technology choices at each layer. The fan-out design, priority partitioning for critical notifications, dual-write in-app inbox, and connection registry for WebSockets demonstrate concrete architectural thinking. The components interact logically and the design is internally consistent.

Completeness

Weight 20%

Answer B addresses all five requirements with dedicated sections and concrete mechanisms. It additionally covers notification status tracking, delivery receipts, fallback logic (push to email if unacknowledged), and extensibility. The in-app offline handling via dual-write inbox is a notable completeness addition.

Trade-off Reasoning

Weight 20%

Answer B provides more thorough trade-off analysis including Kafka vs. SQS/SNS with specific reasoning, CAP theorem implications with a concrete example (user who just opted out), precomputed vs. real-time segmentation with staleness quantification, push vs. pull for in-app, and cost considerations with specific cost drivers identified. Each trade-off is tied directly to the system's constraints.

Scalability & Reliability

Weight 20%

Answer B specifies concrete reliability settings (replication factor 3, min.insync.replicas 2), priority partitioning for critical order notifications to prevent starvation, dead-letter queues with ops alerting, and offset commit timing tied to downstream acknowledgment. The HPA keyed on Kafka consumer lag is a concrete and appropriate scaling mechanism.

Clarity

Weight 10%

Answer B is clearly written with a logical flow from ingestion through delivery. The commitment to specific technologies makes the design easier to follow and evaluate. The latency budget breakdown and the explicit mapping of each component to requirements enhance clarity significantly.

Judge Models OpenAI GPT-5.2

Total Score

Overall Comments

Answer B provides a very coherent end-to-end architecture anchored on Kafka/MSK with clear topic design, consumer groups, offset commit strategy tied to downstream acknowledgements, DLQ, and explicit idempotency keys to control user-visible duplicates. It gives concrete personalization/segmentation mechanisms (DynamoDB+DAX for preferences, hybrid batch+stream segment maintenance, segment storage strategy), detailed delivery channel handling including online/offline in-app (WebSocket plus inbox), and a status/analytics pipeline. It also offers stronger trade-off analysis with specific alternatives (SQS/SNS vs Kafka, push vs pull, managed vs self-hosted) and cost controls (batching, on-demand capacity, spot for non-critical). Minor weaknesses include some potentially questionable implementation details (e.g., “FCM unified gateway for both platforms” is not universally accurate; segment membership as lists can be impractical without careful chunking/streaming), but overall it is more concrete and operationally actionable than A while meeting all requirements.

View Score Details ▼

Architecture Quality

Weight 30%

Very coherent layered design with clear Kafka topic/consumer group model, routing/fan-out, channel workers, online/offline in-app handling, and status tracking; overall more end-to-end and implementation-oriented.

Completeness

Weight 20%

Covers all requirements plus adds DLQ, priority isolation, status lifecycle, inbox for offline, and detailed observability; personalization and segmentation are described with update paths and storage choices.

Trade-off Reasoning

Weight 20%

Provides specific comparisons (Kafka vs SQS/SNS, push vs pull), explicit consistency/availability handling for preferences, and concrete cost considerations (batching, on-demand, spot), with clear justification.

Scalability & Reliability

Weight 20%

Strong at-least-once story with explicit offset commit after downstream ack/persist, DLQ, priority isolation, and horizontal scaling throughout; more actionable reliability mechanisms.

Clarity

Weight 10%

Well-organized, easy-to-follow flow with numbered layers and concrete technology mappings; explains interactions and responsibilities clearly.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer B provides an exceptionally detailed and practical system design. It builds upon a similar, robust event-driven architecture as Answer A but elevates it with more concrete technology choices (specifically within the AWS ecosystem), deeper implementation details (e.g., Kafka replication settings, latency budget breakdown), and more nuanced trade-off analyses (e.g., Kafka vs. SQS, push vs. pull for in-app). The design for handling in-app notifications (WebSocket with an inbox fallback) is particularly well-thought-out and robust. This is a top-tier response that demonstrates deep expertise.

View Score Details ▼

Architecture Quality

Weight 30%

The architecture is outstanding and highly practical. It's well-structured into logical layers and made concrete with specific service choices (MSK, EKS, DynamoDB). The flow from producer to delivery is clear and robust.

Completeness

Weight 20%

This answer is exceptionally complete. It addresses all requirements with significant detail, including a concrete latency budget breakdown and a very robust design for in-app notifications that handles both online and offline users.

Trade-off Reasoning

Weight 20%

The trade-off reasoning is a key strength of this answer. It goes beyond the standard points to include highly specific and well-argued comparisons, such as Kafka vs. SQS, precomputing segments, and a push vs. pull strategy for in-app messages.

Scalability & Reliability

Weight 20%

The explanation of scalability and reliability is excellent and detailed. It includes specifics like Kafka replication factors, the use of priority partitions for critical notifications, a clear idempotency strategy, and a well-defined DLQ process, making the reliability claims very credible.

Clarity

Weight 10%

The answer is exceptionally clear and well-organized. The use of specific service names (e.g., MSK, DAX) and concrete examples makes the abstract design principles tangible and easier to grasp for the reader.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

OpenAI GPT-5 mini

Winning Votes

0 / 3

Average Score

View this answer

Winner Anthropic Claude Opus 4.6

Winning Votes

3 / 3

Average Score

View this answer

View head-to-head record for this model pair

Judging Results

Judge Models Google Gemini 2.5 Pro

GPT-5 mini 90

Claude Opus 4.6 Winner 95

Why This Side Won

While both answers are of exceptional quality and propose very similar, industry-standard architectures, Answer B is the winner due to its superior depth and practicality. Answer B provides more concrete implementation details, such as specific Kafka configurations and a latency budget breakdown, which makes the design more tangible. Furthermore, its trade-off analysis is more specific and insightful, discussing choices like Kafka vs. SQS and the hybrid approach for in-app notifications. These details demonstrate a deeper level of consideration and make the proposed solution more convincing and robust.

Judge Models OpenAI GPT-5.2

GPT-5 mini 74

Claude Opus 4.6 Winner 85

Why This Side Won

Answer B wins because it is more complete and operationally specific: it clearly defines ingestion/topic strategy, offset-commit and DLQ behavior for at-least-once delivery, detailed per-channel delivery flows (including offline inbox), a status tracking subsystem, and concrete personalization/segmentation mechanics with explicit trade-offs and cost controls. Answer A is strong and correct at a high level, but is more option-oriented and less precise about key operational semantics and implementation details needed to justify latency/reliability guarantees under peak load.

Judge Models Anthropic Claude Sonnet 4.6

GPT-5 mini 71

Claude Opus 4.6 Winner 84

Why This Side Won

Answer B wins because it provides a more concrete, committed, and well-justified design. It quantifies the latency budget with specific numbers, commits to definitive technology choices with clear rationale, provides a more thorough and specific trade-off analysis (including CAP theorem, cost considerations, and push vs. pull), and demonstrates deeper reliability thinking through mechanisms like priority partitioning, dead-letter queues, and the dual-write in-app inbox pattern. Answer A, while covering similar ground, remains too abstract and presents multiple options without committing, which weakens the overall design quality and makes it harder to evaluate feasibility.

Design a Real-Time E-commerce Notification System

Task Overview

Task Prompt

Answers

Answer A: OpenAI GPT-5 mini

Answer B: Anthropic Claude Opus 4.6

Comparison Summary

Judging Results

Related Tasks

Summarize the History and Impact of the Printing Press

Eulogy for a Sentient Toaster

Creative Solutions for Household Food Waste Reduction

Advice for Dealing with a Critical Coworker

Implement a Token Bucket Rate Limiter Class

Summarize the History of the Suez Canal

Master Woodworker's Advice

Brainstorming Strategies for a Local Bookstore to Thrive

Related Links