Orivel Orivel
Open menu

Design a Scalable Real-Time Notification System

Compare model answers for this System Design benchmark and review scores, judging comments, and related examples.

Login or register to use likes and favorites. Register

X f L

Contents

Task Overview

Benchmark Genres

System Design

Task Creator Model

Answering Models

Judge Models

Task Prompt

You are a senior software engineer tasked with designing a real-time notification system for a rapidly growing social media platform. The system must be able to deliver notifications (e.g., 'new like', 'new comment', 'friend request') to users who are currently online. **System Requirements:** * **Functional:** 1. Users can subscribe to different notification topics (e.g., updates on their own posts, updates from specific friends). 2. An event publishing service can send messages to specific topics or...

Show more

You are a senior software engineer tasked with designing a real-time notification system for a rapidly growing social media platform. The system must be able to deliver notifications (e.g., 'new like', 'new comment', 'friend request') to users who are currently online. **System Requirements:** * **Functional:** 1. Users can subscribe to different notification topics (e.g., updates on their own posts, updates from specific friends). 2. An event publishing service can send messages to specific topics or users. 3. Subscribed, online users receive relevant notifications in real-time. * **Non-Functional (Constraints):** 1. **Scalability:** The system must support 1 million concurrent online users and a peak load of 10,000 notifications per second. 2. **Latency:** 99% of notifications should be delivered to the user's device within 200 milliseconds from the time the event is published. 3. **Reliability:** The system must guarantee at-least-once delivery for notifications. 4. **Availability:** The system should have 99.95% uptime. **Your Task:** Provide a high-level system design. Your response should cover: 1. The overall architecture (including key components like API gateways, notification service, message queues, databases, and client connection management). 2. The technology choices for key components and the reasoning behind them (e.g., WebSockets vs. Long Polling, Kafka vs. RabbitMQ, NoSQL vs. SQL). 3. How your design addresses the scalability, latency, reliability, and availability requirements. 4. A discussion of the potential trade-offs you made in your design.

Task Context

A social media company is experiencing explosive growth. Their existing, simple notification system, which relies on periodic polling from the client app, is failing under the load. It's slow, inefficient, and causing a poor user experience. A new, robust, and scalable real-time system is needed to support the platform's future growth.

Judging Policy

A high-quality answer will present a clear, logical, and well-reasoned system design. The evaluation will focus on the following: * **Completeness:** Does the design address all the specified functional and non-functional requirements? * **Coherence:** Is the proposed architecture sound? Do the components work together logically? * **Justification:** Are the technology choices (e.g., communication protocol, message broker, database) well-justified with respect to the problem's constraints? * **Trade-off Ana...

Show more

A high-quality answer will present a clear, logical, and well-reasoned system design. The evaluation will focus on the following: * **Completeness:** Does the design address all the specified functional and non-functional requirements? * **Coherence:** Is the proposed architecture sound? Do the components work together logically? * **Justification:** Are the technology choices (e.g., communication protocol, message broker, database) well-justified with respect to the problem's constraints? * **Trade-off Analysis:** Does the response demonstrate an understanding of engineering trade-offs (e.g., consistency vs. availability, cost vs. performance)? * **Practicality:** Is the proposed solution realistic and feasible for the given scale? The design should not be overly simplistic or unnecessarily complex.

Answers

Design: Scalable Real-Time Notification System 1. OVERALL ARCHITECTURE The system is composed of several distinct layers that work together to ingest events, route them, and push them to connected clients with minimal latency. Client Layer: Mobile and web clients maintain persistent WebSocket connections to a fleet of Connection Gateway servers. Each client authenticates on connect and registers its user ID with the gateway. API Gateway / Load Balancer: A Layer-7 load balancer (e.g., AWS ALB or NGINX) sits in front of the Connection Gateways. It routes new WebSocket upgrade requests using consistent hashing o...

Show Full Answer

Design: Scalable Real-Time Notification System 1. OVERALL ARCHITECTURE The system is composed of several distinct layers that work together to ingest events, route them, and push them to connected clients with minimal latency. Client Layer: Mobile and web clients maintain persistent WebSocket connections to a fleet of Connection Gateway servers. Each client authenticates on connect and registers its user ID with the gateway. API Gateway / Load Balancer: A Layer-7 load balancer (e.g., AWS ALB or NGINX) sits in front of the Connection Gateways. It routes new WebSocket upgrade requests using consistent hashing on the user ID so that reconnects tend to land on the same gateway node, reducing state churn. It also exposes a REST endpoint for internal services to publish events. Event Publishing Service: Internal platform services (like service, comment service, friend service) publish events to a central message broker. They call a thin Publishing API that validates the payload, enriches it with metadata (timestamp, notification ID), and writes it to the broker. Message Broker (Kafka): Events are written to Kafka topics partitioned by user ID. This ensures ordered delivery per user and allows horizontal scaling of consumers. Kafka's durable log also provides the replay capability needed for at-least-once delivery guarantees. Notification Fanout Service: A pool of stateless consumer workers reads from Kafka. For each event, the worker looks up the target user's subscription preferences in a fast cache (Redis), determines which users should receive the notification, and then routes the message to the correct Connection Gateway. For high-fanout events (e.g., a celebrity post), a separate async fanout job is triggered to avoid blocking the hot path. Connection Gateway (WebSocket Servers): These are stateful servers that maintain the open WebSocket connections. Each gateway holds an in-memory map of user ID to connection handle. When a routed notification arrives (via an internal pub/sub channel like Redis Pub/Sub or a direct gRPC call), the gateway pushes it down the appropriate WebSocket connection. If the user is not connected, the gateway discards the push and relies on the persistence layer for later delivery. Presence & Routing Service: A Redis cluster stores a mapping of user ID to gateway node ID with a short TTL, refreshed by heartbeats. The Fanout Service queries this to know which gateway to route a notification to. If no entry exists, the user is offline. Notification Storage (Cassandra): All generated notifications are written to Cassandra, keyed by user ID and sorted by timestamp. This serves two purposes: it powers the notification inbox UI (users can scroll back through past notifications) and it enables at-least-once delivery — when a user comes online, the client fetches unread notifications from this store. Delivery Acknowledgment: Clients send an ACK message over the WebSocket after receiving a notification. The gateway writes this ACK to Kafka, and a consumer marks the notification as delivered in Cassandra. Unacknowledged notifications older than a threshold are re-queued for delivery. 2. TECHNOLOGY CHOICES AND REASONING WebSockets over Long Polling or SSE: WebSockets provide full-duplex, low-overhead persistent connections. Long polling wastes server resources with repeated HTTP handshakes and adds latency. Server-Sent Events (SSE) are unidirectional and less suitable for the ACK flow. At 1 million concurrent connections, WebSockets are the most resource-efficient choice. Each connection consumes roughly 10–50 KB of memory, making 1 million connections feasible across a moderately sized gateway fleet. Kafka over RabbitMQ: Kafka is chosen for its high throughput (millions of messages per second), durable log storage, consumer group semantics, and the ability to replay messages. RabbitMQ is a good broker for task queues but its message model is less suited to the fan-out and replay patterns needed here. Kafka's partitioning by user ID also naturally parallelizes consumption. At 10,000 notifications per second, Kafka handles this with significant headroom. Redis for Presence and Pub/Sub: Redis provides sub-millisecond reads for the presence lookup (user ID → gateway node). Redis Pub/Sub or Redis Streams can be used for the internal channel between the Fanout Service and the Connection Gateways, adding minimal latency to the delivery path. Cassandra over MySQL/PostgreSQL: Notification history is a write-heavy, time-series workload with high cardinality (one partition per user). Cassandra's wide-column model, tunable consistency, and linear horizontal scalability make it ideal. A relational database would require complex sharding and struggle with the write throughput. Cassandra's eventual consistency is acceptable here since notification history is not a transactional record. Stateless Fanout Workers: Keeping the fanout workers stateless allows them to scale horizontally by simply adding more Kafka consumer instances within the consumer group. 3. HOW THE DESIGN MEETS EACH REQUIREMENT Scalability (1M concurrent users, 10K notifications/second): The Connection Gateways are horizontally scalable. A single modern server can hold 50,000–100,000 WebSocket connections, so 10–20 gateway nodes handle 1 million users. The load balancer distributes new connections. Kafka partitions scale the fanout workers. Cassandra scales writes linearly with nodes. Redis Cluster shards the presence data. No single component is a bottleneck. Latency (P99 < 200ms): The critical path is: event published → Kafka write (~5ms) → Fanout worker consumes and looks up presence in Redis (~5ms) → routes to gateway via Redis Pub/Sub or gRPC (~5ms) → gateway pushes over WebSocket (~10ms network). Total is well under 50ms in the median case. The 200ms P99 budget accommodates Kafka consumer lag under peak load and network jitter. Keeping the fanout worker logic simple and the Redis lookups cached ensures the hot path stays fast. Reliability (at-least-once delivery): Notifications are persisted to Cassandra before or concurrently with the push attempt. If the WebSocket push fails or the client does not ACK, the notification remains in the unread state in Cassandra. On reconnect, the client fetches unread notifications. Kafka's consumer offset commit is done only after the fanout worker has successfully routed the message, ensuring no event is silently dropped. This provides at-least-once semantics end to end. Availability (99.95% uptime): All components are deployed in multiple availability zones. The load balancer, Kafka brokers, Redis Cluster nodes, Cassandra nodes, and fanout workers all run with N+1 or N+2 redundancy. Gateway node failures cause clients to reconnect (WebSocket reconnect logic with exponential backoff) and land on a healthy node within seconds. Kafka replication factor of 3 ensures broker failures do not cause data loss. Cassandra's replication factor of 3 with quorum reads/writes tolerates node failures. This architecture comfortably achieves 99.95% uptime. 4. TRADE-OFFS Complexity vs. Simplicity: This design has many moving parts — Kafka, Redis, Cassandra, WebSocket gateways, fanout workers, presence service. This is significantly more complex to operate than a simple polling system or a single-broker setup. The trade-off is justified by the scale requirements, but it demands a mature DevOps practice, good observability (distributed tracing, metrics per component), and on-call expertise. At-Least-Once vs. Exactly-Once: Exactly-once delivery would require distributed transactions across Kafka, Cassandra, and the gateway, adding significant latency and complexity. At-least-once is chosen instead, meaning a user might occasionally see a duplicate notification. This is handled on the client side by deduplicating on notification ID. For a social media notification (a like or comment), a duplicate is a minor UX annoyance, not a critical failure — an acceptable trade-off. Stateful Gateways: The WebSocket gateways are stateful (they hold live connections). This makes rolling deployments and failure recovery slightly more complex than a fully stateless service. Clients must implement reconnect logic. The trade-off is that stateful gateways are necessary for real-time push; there is no way to push to a client without a persistent connection. Fanout at Read vs. Write Time: This design fans out at write time (the fanout worker pushes to each recipient's gateway immediately). For users with millions of followers, this creates a hot path problem (a celebrity posting triggers millions of fanout operations). The mitigation is to use a separate high-fanout job queue for accounts above a follower threshold, processing them asynchronously. The trade-off is that high-follower users may experience slightly higher notification latency, which is generally acceptable. Redis Pub/Sub Reliability: Redis Pub/Sub does not persist messages. If a gateway is briefly disconnected from Redis when a message is published, it misses that message. This is mitigated by the Cassandra persistence layer — the client will fetch the notification on reconnect — but it means the real-time push is best-effort, with the persistent store as the safety net. Using Redis Streams instead of Pub/Sub would add persistence at the cost of slightly more complexity. Cost: Running Kafka, Redis Cluster, Cassandra, and a large WebSocket gateway fleet is expensive. A smaller platform might use a managed service like AWS SNS/SQS or Firebase Cloud Messaging to reduce operational burden. At the scale described (1M concurrent users), the control and performance of a custom stack justifies the cost.

Result

#2

Winning Votes

1 / 3

Average Score

84

Total Score

79

Overall Comments

Answer A presents a well-structured and coherent system design that covers all major components and requirements. It provides clear technology choices with solid justifications, a well-articulated critical path latency analysis, and a thoughtful trade-off section. The architecture is sound with WebSocket gateways, Kafka, Redis presence, and Cassandra storage. The trade-off analysis is particularly strong, covering complexity, at-least-once vs exactly-once, stateful gateways, fanout strategies, Redis Pub/Sub reliability, and cost considerations. The writing is clear and well-organized. However, it lacks some operational details like capacity planning numbers, failure mode analysis, backpressure mechanisms, security considerations, and batching/coalescing strategies.

View Score Details

Architecture Quality

Weight 30%
80

Answer A presents a clean, well-structured architecture with clearly defined components and data flows. The critical path is well-articulated, and the interaction between components (Kafka -> Fanout Workers -> Redis Presence -> Gateway -> WebSocket) is logical and sound. The consistent hashing on user ID for load balancing is a nice detail.

Completeness

Weight 20%
75

Answer A covers all four required areas (architecture, technology choices, requirement mapping, trade-offs) thoroughly. However, it lacks capacity planning numbers, explicit failure mode analysis, security considerations, backpressure mechanisms, and batching strategies that would make the design more complete.

Trade-off Reasoning

Weight 20%
80

Answer A's trade-off section is one of its strongest aspects. It covers six distinct trade-offs with clear reasoning: complexity vs simplicity, at-least-once vs exactly-once, stateful gateways, fanout at read vs write time, Redis Pub/Sub reliability, and cost. Each trade-off is well-explained with practical implications. The Redis Pub/Sub reliability discussion is particularly insightful.

Scalability & Reliability

Weight 20%
75

Answer A addresses scalability and reliability requirements clearly, with good estimates for WebSocket connections per server (50-100k) and a clear critical path latency breakdown. The at-least-once delivery mechanism via Cassandra persistence and client ACKs is well-explained. However, it lacks explicit capacity planning numbers and failure mode analysis.

Clarity

Weight 10%
85

Answer A is exceptionally well-written with clear, concise prose. The structure flows logically from architecture to technology choices to requirement mapping to trade-offs. Each section is focused and easy to follow. The latency breakdown with specific millisecond estimates is particularly clear and effective.

Judge Models OpenAI GPT-5.4

Total Score

86

Overall Comments

Answer A presents a coherent end-to-end design with clear component responsibilities, concrete data flow, and stronger linkage between requirements and implementation details. It gives specific choices such as Kafka, Redis, Cassandra, WebSockets, ACK flow, presence routing, and unread recovery, and it discusses practical concerns like high-fanout users, Redis Pub/Sub reliability, and duplicate handling. Its main weakness is that some guarantees are a bit loosely specified at the gateway-to-client path, and a few sizing claims are optimistic, but overall it is concrete, practical, and well argued.

View Score Details

Architecture Quality

Weight 30%
86

Strong end-to-end architecture with clear publish, fanout, presence, gateway, storage, and ACK flows. Components interact logically, and the routing path for online users is well defined. Minor weakness: internal routing via Redis Pub/Sub is acknowledged as lossy, leaving some ambiguity in the hot path reliability.

Completeness

Weight 20%
84

Covers architecture, technologies, requirements, and trade-offs well. It addresses online delivery, offline persistence, ACKs, availability, and high-fanout cases. Slightly less complete on observability, security, and operational controls than the other answer.

Trade-off Reasoning

Weight 20%
88

Trade-offs are specific and grounded in this design: at-least-once versus exactly-once, stateful gateways, write-time fanout versus high-fanout mitigation, and Redis Pub/Sub persistence trade-offs. The discussion is concrete and tied to user experience and operational cost.

Scalability & Reliability

Weight 20%
85

Scalability approach is convincing with partitioned Kafka, sharded Redis, scalable gateways, and Cassandra for writes. Reliability is thoughtfully handled with durable storage, ACKs, unread recovery, and multi-AZ deployment. Small concern: the real-time gateway delivery path relies on a best-effort mechanism before fallback recovery.

Clarity

Weight 10%
87

Clear structure and readable prose. The answer moves from architecture to choices, requirements, and trade-offs in a straightforward way, making it easy to follow the system behavior.

Total Score

88

Overall Comments

Answer A presents a very strong, clear, and correct system design. It follows a logical structure, makes sound technology choices with good justifications, and addresses all the core requirements of the prompt. Its primary strength is its clarity and conciseness. However, it lacks the exceptional depth and operational detail seen in Answer B, particularly concerning failure modes and advanced optimization strategies.

View Score Details

Architecture Quality

Weight 30%
88

The proposed architecture is excellent, featuring a standard and robust set of components (Kafka, Redis, Cassandra, WebSocket gateways). The data flow is logical and well-explained. It represents a solid, industry-standard solution.

Completeness

Weight 20%
85

The answer is very complete, addressing all four sections requested in the prompt thoroughly and effectively. It meets all the specified functional and non-functional requirements.

Trade-off Reasoning

Weight 20%
88

The trade-off analysis is strong and covers key decisions like at-least-once vs. exactly-once delivery and the stateful nature of gateways. The specific point about the reliability of Redis Pub/Sub is particularly insightful.

Scalability & Reliability

Weight 20%
85

The design clearly explains how each component scales horizontally and how at-least-once delivery is achieved. The reasoning is sound and directly addresses the NFRs.

Clarity

Weight 10%
95

The answer is exceptionally clear, concise, and well-structured. It follows the prompt's format exactly, making it very easy to read and digest the information.

High-level summary I propose a horizontally scalable, low-latency real-time notification platform built around: (1) edge/gateway infrastructure to accept events and client connections, (2) a durable pub/sub backbone for routing events to interested subscribers, (3) connection managers / WebSocket workers that maintain live client connections and perform fan-out, and (4) a fast presence store and durable notification store for reliability and offline delivery. The system uses partitioning and replication to meet throughput and availability goals, with end-to-end instrumentation, backpressure, and idempotence to en...

Show Full Answer

High-level summary I propose a horizontally scalable, low-latency real-time notification platform built around: (1) edge/gateway infrastructure to accept events and client connections, (2) a durable pub/sub backbone for routing events to interested subscribers, (3) connection managers / WebSocket workers that maintain live client connections and perform fan-out, and (4) a fast presence store and durable notification store for reliability and offline delivery. The system uses partitioning and replication to meet throughput and availability goals, with end-to-end instrumentation, backpressure, and idempotence to enforce at-least-once delivery and low latency. 1) Overall architecture (components and request flow) Core components - API Gateway / Ingress: Accepts event publishes (from internal services) and client connection upgrades. Exposes REST/gRPC endpoints for event publishers and handles authentication and rate limiting. Deployed behind global LBs and per-region endpoints. - Connection Layer (WebSocket / GRPC streams): A fleet of stateless connection servers (WebSocket/HTTP2 or gRPC) that maintain persistent connections to clients. They perform lightweight routing and acknowledgment handling and forward subscription changes to the presence store. - Presence & Routing Store: A low-latency key-value store (Redis cluster) that tracks which connection server(s) currently host each online user and which topics they subscribe to. Used to route notifications to the correct connection worker. - Pub/Sub Backbone: A durable, partitioned message bus (Kafka or Apache Pulsar) used for distributing events from publishers to notification workers. Topics are partitioned by logical keys (user id, topic id) to ensure ordered processing per key and scalable throughput. - Notification Service / Worker Pool: Consumers of the pub/sub topic that perform enrichment, filtering, and delivery routing. Workers look up the presence store to find target connection servers and push delivery tasks into fast delivery paths. - Delivery Layer / Fan-out Engine: The connection servers receive delivery requests (directly or via fast RPC) and push notifications over the persistent connection to the client. They handle per-connection flow control, batching, and ACKs from clients. - Durable Notification Store (NoSQL): A write-optimized, replicated store (e.g., Cassandra / DynamoDB) to persist notifications for offline users, retries, and audit. Stores notification payloads, delivery attempts, timestamps, TTLs. - Deduplication & Idempotence Store: Small key-value store (Redis or RocksDB) to record recent message IDs for deduplication when at-least-once semantics cause duplicates. - Monitoring & Control Plane: Metrics, tracing, SLO/alerting, circuit breakers, and throttling. Typical flows - Publish flow (event producer -> user): 1) Publisher posts event to API Gateway (REST/gRPC). 2) Gateway writes message to Pub/Sub topic (partitioned by target topic/user id). 3) Notification workers consume events, enrich and resolve subscription lists (or query subscription DB), consult presence store to find online recipients, and for each online connection push to the appropriate connection server(s). 4) Connection servers push over WebSocket/gRPC stream to client and await lightweight ACK. 5) If user offline (presence miss), write to durable Notification Store for offline retrieval. - Subscribe flow: Clients send subscribe/unsubscribe messages over their persistent connection. Connection server updates presence store atomically so the worker routing sees updated subscriptions quickly. - Recovery & replay: Durable pub/sub and Notification Store allow replay of missed events; connection servers re-register presence on reconnect. 2) Technology choices and rationale - Client connection: WebSockets or HTTP/2 (gRPC) streams - Choice: WebSockets for broad client compatibility (mobile/web). Consider HTTP/2 gRPC streams for internal/mobile apps supporting it. Both keep persistent TCP/TLS connections to meet <200ms latency. - Reason: Polling is too slow and inefficient. Long polling increases latency and resource usage. WebSockets provide push, low RTT, and ability to deliver binary messages with small overhead. - Pub/Sub backbone: Apache Kafka or Apache Pulsar - Choice: Kafka (managed like Confluent Cloud or self-hosted) or Pulsar if multi-tenancy and geo-replication needs dominate. - Reason: Both provide partitioned, durable, high-throughput messaging. Kafka has mature tooling, strong throughput, predictable latency, and exactly-once semantics support in producers. Partitioning allows scaling to 10k msg/s easily. Pulsar offers built-in geo-replication if multi-region is required. - Presence store & routing: Redis Cluster (in-memory) with replication - Reason: Sub-1ms reads/writes to map user -> connection servers, subscriptions, and connection metadata. Redis supports clustering, persistence (AOF/RDB), and fast lookups needed to route events within the 200ms budget. - Persistent notification store: Cassandra / DynamoDB (NoSQL) - Choice: Cassandra or DynamoDB-like key-value store. - Reason: High write throughput, linear scalability, and configurable TTLs, ideal for storing many writes per second and serving offline reads. SQL systems struggle at this scale of writes and horizontal sharding complexity. - Connection servers & worker communication: gRPC/internal RPC + protobuf - Reason: Low overhead binary messages, backpressure support, strong typing, and high perf. - Load balancing & routing: L4 (TCP) load balancers for WebSockets + L7 LB for REST APIs. Use sticky routing via consistent hashing or session affinity to route reconnections to same region. - Client payload format: Compact binary (protobuf/flatbuffers) for smaller payloads and faster serialization. 3) How the design meets non-functional requirements - Scalability (1M concurrent users, 10k notif/s peak) - Stateless connection servers: Horizontal scale-out; each server handles N concurrent WebSocket connections. With modern machines (e.g., 100k sockets per machine with proper tuning), a few dozen/hundreds of servers handle 1M concurrent sockets. Autoscaling group + container orchestration manages capacity. - Partitioned pub/sub (Kafka): Scale producers and consumers by adding partitions/workers. Partition by target user ID or topic ensures even load distribution. 10k msg/s is modest for Kafka clusters sized appropriately (dozens of brokers). - Sharded presence store: Redis Cluster shards user space so lookups remain O(1) and scale. - Partitioned durable store: Cassandra scales linearly by adding nodes for writes and reads. - Latency (99% within 200ms) - Persistent connections eliminate TCP handshake overhead. Once established, push from event publish to client is: API Gateway -> Kafka append (sub-ms-10s ms depending on tuning) -> worker consume -> presence lookup (sub-ms) -> RPC to connection server -> push over WebSocket (sub-ms to a few ms). With careful placement (deploy workers and connection servers in same region/az), network latency remains small. Use batching tuned to near-instant delivery (avoid batching delays) and tune Kafka consumer configuration (low linger.ms) to keep end-to-end latency low. - Edge/regional deployment: Place connection servers near clients (region-level) so last-mile latency is minimized. - Use binary serialization and small payloads to reduce serialization/network time. - Reliability (at-least-once delivery) - Durable writes: Publishers write events to Kafka (durable), workers track offsets; messages remain durable until acknowledged by consumer processing. - At-least-once: If delivery fails (connection server crash, network), the worker can re-enqueue or retry because the pub/sub retains messages and the worker commits offsets only after successful enqueue/confirmation. Clients ACK receipts, but ACKs only confirm client acceptance; server-side persistence allows retries. - Deduplication: Since at-least-once can produce duplicates, include unique message IDs and client-side dedupe or server-side dedupe using recent-ID caches to suppress duplicates. - Durable Notification Store: When user offline or when persistent confirmation is required, write the notification to Cassandra before acknowledging the event so it can be delivered later. - Availability (99.95% uptime) - Multi-AZ and multi-region deployment: Replicate critical components. Use Kafka with replication factor >2, Redis with master-replica and failover, Cassandra with multiple replicas and tunable consistency. - Stateless frontends & autoscaling: If any connection server fails, clients reconnect to next available server. Use health checks and fast failover. - Graceful degradation: If delivery overload occurs, accept events and persist them for later delivery, rather than dropping. Throttling and backpressure protect the system. - Observability & automation: Auto-restart, circuit breakers, and runbooks for operator intervention. SLOs and alerting tuned to maintain 99.95% availability. 4) Operational details and optimizations - Partitioning keys & routing: Partition pub/sub by user id for user-targeted messages and by topic id for topic broadcasts. For fanouts to many users (e.g., a post liked by thousands), perform hierarchical fan-out: 1) determine recipients; 2) group recipients by connection server; 3) send grouped delivery message to each connection server to avoid NRPC calls. - Fan-out strategies - Fan-out on write (push): Preferred for real-time online delivery — workers push only to online connections found via presence store. - Fan-out on read (pull): Use for large offline fanout or for infrequently-online users; store notification and let clients fetch when online. - Batching and coalescing: For high-volume similar events, coalesce multiple events into one compact notification when safe (e.g., “3 people liked your post”) to reduce load and improve UX. - Backpressure and smoothing: If connection servers or client cannot accept messages, apply backpressure: slow down consumers, buffer to durable store, and retry. Implement per-client rate limits. - Client ACK model: Use a lightweight ACK for successful receipt. If ACK not received within timeout, retry delivery. Maintain delivery attempt counters and send to dead-letter store after N attempts. - Security and privacy: End-to-end auth at gateway, verify publisher entitlement to publish; encrypt transport (TLS), and sanitize payloads. 5) Trade-offs and discussion - Complexity vs Simplicity - Trade-off: A Kafka-based, multi-component system is more complex operationally than a simple push server but necessary for scale and reliability. Operational cost and engineering complexity increase (managing Kafka, Redis cluster, Cassandra), but provide durability, replay, and fanout control. - At-least-once vs Exactly-once - Choice: At-least-once delivery with deduplication is chosen. Exactly-once end-to-end is very expensive (coordination across services and clients) and often unnecessary for notifications where duplicates are tolerable. Deduplication caches and client-side idempotence provide practical mitigation. - WebSockets vs HTTP/2 gRPC - Trade-off: WebSockets maximize compatibility but require more LB and connection tuning. gRPC offers better flow-control and type safety for clients that support it. Supporting both adds complexity but gives best coverage. - Real-time push vs storing-first - Push-first is prioritized for online users to meet latency. However, storing-first (persisting to DB first) adds durability at cost of additional write latency. We strike a balance: write to Kafka (durable) quickly, optionally persist to Cassandra depending on reliability policy. - Redis presence eventual consistency - Presence may be slightly stale in edge cases; a small window of false negatives/positives could cause a single missed notification which will be recovered from the durable notification store. This design favors low-latency presence lookups over strictly-synchronous global presence values. 6) Capacity planning & numbers (back-of-envelope) - 1M concurrent users: If each connection server handles ~10k concurrent WebSockets, need ~100 connection servers. Add safety factor (2x) for peaks -> ~200 instances across regions. - 10k notifications/s: Kafka cluster with 10–50 partitions is sufficient depending on message size and replication; target broker count around 5–10 with replication factor 3. Consumer groups scale horizontally. - Storage: At 10k notif/s, ~864M notifications per day (if every message persisted) — use TTLs and aggregation to reduce storage. Cassandra nodes sized accordingly. 7) Failure modes and mitigations - Connection server crash: Clients reconnect to other servers; presence store updated on reconnect. Buffered messages in worker are retried. Use sticky reconnection strategies to reduce state churn. - Kafka broker failure: Partition replication ensures no data loss when replication factor >=3 and ISR configured. Monitor under-replicated partitions. - Redis failover: Use Redis Cluster with replicas and automated failover (Sentinel or managed services). Rebuild presence from connection servers on large failover. - Large fanouts (viral events): Use hierarchical fanout and batching; offload to background workers and persist notifications for later delivery if real-time delivery would breach latency. Conclusion This design uses persistent connections (WebSockets/gRPC), a durable partitioned pub/sub (Kafka/Pulsar), a low-latency presence store (Redis), and a scalable NoSQL store (Cassandra/DynamoDB) to meet the 1M concurrent connections, 10k notif/s, <200ms 99% latency, at-least-once reliability, and 99.95% availability goals. The main trade-offs are increased operational complexity and handling duplicates. With careful partitioning, replication, monitoring, and per-component tuning (consumer configs, batching, backpressure), this architecture will scale and provide the low-latency, reliable real-time notifications the platform needs.

Result

#1 | Winner

Winning Votes

2 / 3

Average Score

87

Total Score

82

Overall Comments

Answer B provides a comprehensive and thorough system design that goes beyond the core requirements. It covers all the same fundamental components as Answer A but adds significant depth in several areas: capacity planning with back-of-envelope calculations, explicit failure mode analysis with mitigations, backpressure and flow control mechanisms, security considerations, batching and coalescing strategies, hierarchical fanout for viral events, binary serialization choices, and a deduplication store as a separate component. The trade-off analysis is solid though slightly less focused than Answer A's. The answer is well-organized with clear sections, though the additional detail sometimes makes it slightly more verbose. The inclusion of gRPC as an alternative to WebSockets and the discussion of regional deployment add practical value.

View Score Details

Architecture Quality

Weight 30%
85

Answer B presents a similarly sound architecture with the same core components but adds more depth. The inclusion of a dedicated deduplication store, explicit mention of backpressure mechanisms, hierarchical fanout for large fan-outs, and the distinction between L4 and L7 load balancing demonstrate more architectural sophistication. The publish and subscribe flows are clearly articulated.

Completeness

Weight 20%
85

Answer B is notably more complete, covering all required areas plus capacity planning with back-of-envelope calculations, explicit failure modes and mitigations, security and privacy considerations, backpressure and flow control, batching and coalescing strategies, and a deduplication component. The additional sections on operational details and failure modes add significant completeness.

Trade-off Reasoning

Weight 20%
75

Answer B covers trade-offs adequately but with slightly less depth per trade-off. It discusses complexity vs simplicity, at-least-once vs exactly-once, WebSockets vs gRPC, push vs store-first, and Redis presence consistency. The trade-offs are valid but some feel more surface-level compared to Answer A's more detailed analysis. The push-first vs store-first trade-off is a good addition not found in Answer A.

Scalability & Reliability

Weight 20%
85

Answer B provides more thorough scalability and reliability coverage. It includes explicit capacity planning (100-200 connection servers, 5-10 Kafka brokers, storage estimates), a dedicated failure modes section covering multiple scenarios, backpressure mechanisms, and graceful degradation strategies. The connection server estimate of 10k per server is more conservative but includes a 2x safety factor, showing practical engineering judgment.

Clarity

Weight 10%
75

Answer B is well-organized with clear section headers and logical flow. However, the additional detail and breadth sometimes make it more verbose and slightly harder to follow quickly. Some sections feel like they could be more concise. The numbered sections (1-7) provide good structure but the sheer volume of content slightly reduces readability compared to Answer A's more focused approach.

Judge Models OpenAI GPT-5.4

Total Score

81

Overall Comments

Answer B is comprehensive and organized, covering most required areas including architecture, technology options, operations, failure modes, and capacity planning. It shows good awareness of backpressure, batching, observability, and hierarchical fanout. However, it is less decisive in several key technology choices, mixes options instead of committing to a crisp design, and contains a notable inconsistency by calling connection servers stateless while also saying they maintain persistent client connections. Some reliability details are more generic than exact, which weakens the overall design precision.

View Score Details

Architecture Quality

Weight 30%
78

Good layered architecture that includes most major components and operational concerns. However, the design is less crisp because it keeps multiple alternatives open and has an internal inconsistency in describing connection servers as stateless while they maintain live sockets.

Completeness

Weight 20%
87

Very complete response covering architecture, technology rationale, reliability, availability, fanout strategies, failure modes, capacity planning, and operational details. It touches more auxiliary concerns such as security, backpressure, and monitoring.

Trade-off Reasoning

Weight 20%
82

Shows solid awareness of engineering trade-offs such as push-first versus store-first, WebSockets versus gRPC, and at-least-once versus exactly-once. The reasoning is good but sometimes stays generic and less tightly connected to one final chosen design.

Scalability & Reliability

Weight 20%
80

Strong discussion of scaling techniques, partitioning, backpressure, replay, and multi-AZ resilience. Reliability mechanisms are present, including dedupe and retries, but the exact commit and delivery semantics are described more abstractly, and the online delivery path is less concretely nailed down.

Clarity

Weight 10%
81

Well organized with headings and bullet points, but somewhat verbose and occasionally diffuse. The inclusion of many options and side notes reduces decisiveness and makes the final design slightly harder to pin down.

Total Score

97

Overall Comments

Answer B provides an exceptionally detailed and comprehensive system design that demonstrates a senior-level understanding of building and operating systems at scale. It not only covers all the requirements but goes significantly beyond the prompt by including sections on operational details, capacity planning, and failure modes. The level of detail, from discussing hierarchical fan-out to backpressure mechanisms, is outstanding. Its only minor drawback is that its density makes it slightly less concise than Answer A.

View Score Details

Architecture Quality

Weight 30%
95

The architecture is exceptionally well-defined and slightly more detailed than A's. It explicitly includes a deduplication store and a monitoring/control plane, and considers alternatives like Pulsar and gRPC streams, showing a broader perspective on the problem space.

Completeness

Weight 20%
100

This answer is exemplary in its completeness. It not only addresses all parts of the prompt but goes significantly beyond by adding highly relevant sections on operational details, capacity planning, and failure modes, which are critical for a real-world system of this scale.

Trade-off Reasoning

Weight 20%
95

The trade-off discussion is excellent and well-integrated with the rest of the design. It covers the same key points as A but also adds nuances like WebSockets vs. gRPC and push-first vs. store-first strategies, linking them back to the overall system goals.

Scalability & Reliability

Weight 20%
100

This answer demonstrates a masterful understanding of scalability and reliability. The dedicated sections on failure modes, mitigations, and operational details (like hierarchical fan-out and backpressure) provide a much deeper and more practical explanation of how the system would handle the required scale and remain resilient.

Clarity

Weight 10%
92

The answer is very well-structured with clear headings and subheadings that help navigate the large amount of information. While extremely clear, its sheer depth and detail make it slightly denser and less immediately scannable than Answer A.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Winning Votes

1 / 3

Average Score

84
View this answer

Winning Votes

2 / 3

Average Score

87
View this answer

Judging Results

Why This Side Won

Answer B is the clear winner due to its superior depth and comprehensiveness. While both answers propose excellent and viable architectures, Answer B demonstrates a more profound and practical understanding of real-world system design. Its inclusion of detailed sections on operational concerns, failure modes and mitigations, and capacity planning provides a much more robust and production-ready vision. These additional details, such as discussing hierarchical fan-out for the 'celebrity problem' and backpressure mechanisms, directly address the complexities of running such a system at the required scale, making it a more complete and expert-level response.

Judge Models OpenAI GPT-5.4

Why This Side Won

Answer A wins because it is more coherent and concrete as a high-level system design. It makes firmer architectural decisions, ties them directly to the latency and at-least-once requirements, and explains the end-to-end delivery path more precisely. Answer B is broad and thoughtful, but its use of multiple alternatives and a few architectural inconsistencies make it slightly less practical and less internally consistent for this specific prompt.

Why This Side Won

Answer B wins because it provides meaningfully more comprehensive coverage of the system design. While both answers share the same core architecture and technology choices, Answer B adds substantial value through: (1) explicit capacity planning with back-of-envelope calculations for connection servers, Kafka partitions, and storage, (2) a dedicated failure modes section covering connection server crashes, Kafka broker failures, Redis failover, and viral events, (3) operational details like backpressure mechanisms, batching/coalescing, security considerations, and binary serialization choices, (4) a dedicated deduplication store component, and (5) hierarchical fanout strategy for large fan-outs. These additions demonstrate deeper practical engineering knowledge and make the design more actionable. Answer A's trade-off section is slightly more polished and focused, but Answer B's broader coverage across all dimensions gives it the edge overall.

X f L