Answer A: Anthropic Claude Opus 4.6
Real-Time Notification Service — High-Level System Design ================================================================================ 0. BACK-OF-THE-ENVELOPE ESTIMATION ================================================================================ - 10M DAU × 20 notifications/user/day = 200M notifications/day - 200M / 86,400s ≈ 2,315 notifications/second (average) - Peak load (assume 5× average) ≈ 11,575 notifications/second - Each notification payload ~1 KB → ~200 GB/day raw data ingestion - Latency budget: < 2 seconds end-to-end (ingestion → delivery to device) ========================================...
Show Full Answer ▼
Real-Time Notification Service — High-Level System Design ================================================================================ 0. BACK-OF-THE-ENVELOPE ESTIMATION ================================================================================ - 10M DAU × 20 notifications/user/day = 200M notifications/day - 200M / 86,400s ≈ 2,315 notifications/second (average) - Peak load (assume 5× average) ≈ 11,575 notifications/second - Each notification payload ~1 KB → ~200 GB/day raw data ingestion - Latency budget: < 2 seconds end-to-end (ingestion → delivery to device) ================================================================================ 1. CORE ARCHITECTURE ================================================================================ The system follows an event-driven, microservices architecture with the following key components: ┌──────────────┐ │ Producers │ (Post Service, Like Service, Comment Service, Follow Service, etc.) └──────┬───────┘ │ Notification Event (gRPC / async message) ▼ ┌──────────────────┐ │ API Gateway / │ Rate limiting, authentication, routing │ Load Balancer │ (e.g., AWS ALB / Envoy / Kong) └──────┬───────────┘ │ ▼ ┌──────────────────┐ │ Notification │ Stateless service (horizontally scalable) │ Service (NS) │ - Validates & deduplicates events │ │ - Enriches with user preferences │ │ - Fans out to per-channel queues └──────┬───────────┘ │ Writes notification record to DB │ Publishes to Message Queue ▼ ┌──────────────────────────────────────────────────┐ │ Message Queue / Broker │ │ (Apache Kafka — partitioned by user_id hash) │ │ │ │ Topics: push_notifications │ │ email_notifications │ │ in_app_notifications │ └──┬──────────────┬─────────────────┬──────────────┘ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────────────┐ │ Push │ │ Email │ │ In-App │ │ Worker │ │ Worker │ │ Worker │ │ Pool │ │ Pool │ │ Pool │ └───┬────┘ └────┬─────┘ └───────┬────────┘ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────────────┐ │ APNs / │ │ SES / │ │ WebSocket │ │ FCM │ │ SendGrid │ │ Gateway │ └────────┘ └──────────┘ │ (persistent │ │ connections) │ └────────────────┘ Component Descriptions: A) API Gateway / Load Balancer - Entry point for internal producer services and external API calls (e.g., mark-as-read). - Handles rate limiting, authentication, and request routing. - Distributes traffic across multiple Notification Service instances. B) Notification Service (NS) - Stateless microservice deployed in multiple replicas behind the load balancer. - Receives notification events, validates them, performs deduplication (idempotency key check). - Looks up user notification preferences from a cache (Redis) or database. - Determines which channels to deliver on (push, email, in-app) based on preferences. - Persists the notification record to the database. - Publishes channel-specific messages to Kafka topics. C) Message Queue (Apache Kafka) - Decouples notification creation from delivery, absorbing traffic spikes. - Partitioned by user_id hash to preserve per-user ordering. - Provides durability (replication factor = 3) and replay capability. - Separate topics per channel allow independent scaling of consumers. D) Channel Workers (Consumer Groups) - Push Worker Pool: Consumes from push_notifications topic, batches requests, and sends to APNs (iOS) and FCM (Android). Handles token management and retry logic. - Email Worker Pool: Consumes from email_notifications topic, renders templates, and sends via SES/SendGrid. Implements exponential backoff for transient failures. - In-App Worker Pool: Consumes from in_app_notifications topic, pushes to WebSocket Gateway for real-time delivery. Falls back to polling if WebSocket is disconnected. E) WebSocket Gateway - Maintains persistent WebSocket connections with online users. - Horizontally scaled; uses Redis Pub/Sub or a shared message bus so any gateway node can deliver to any connected user. - Connection registry stored in Redis (user_id → gateway_node mapping). F) Notification Read API - Separate read path for fetching notification history, marking as read, managing preferences. - Served from read replicas / cache to avoid impacting the write path. ================================================================================ 2. DATABASE SCHEMA ================================================================================ Primary data store: PostgreSQL (for preferences and metadata) + Cassandra (for notification storage at scale). --- PostgreSQL (User Preferences & Metadata) --- TABLE: users user_id UUID PRIMARY KEY email VARCHAR(255) phone VARCHAR(50) created_at TIMESTAMP updated_at TIMESTAMP TABLE: notification_preferences user_id UUID PRIMARY KEY REFERENCES users(user_id) push_enabled BOOLEAN DEFAULT TRUE email_enabled BOOLEAN DEFAULT TRUE in_app_enabled BOOLEAN DEFAULT TRUE quiet_hours_start TIME NULLABLE quiet_hours_end TIME NULLABLE email_digest_freq ENUM('instant', 'hourly', 'daily') DEFAULT 'instant' -- Per-category overrides likes_push BOOLEAN DEFAULT TRUE likes_email BOOLEAN DEFAULT FALSE comments_push BOOLEAN DEFAULT TRUE comments_email BOOLEAN DEFAULT TRUE follows_push BOOLEAN DEFAULT TRUE follows_email BOOLEAN DEFAULT FALSE mentions_push BOOLEAN DEFAULT TRUE mentions_email BOOLEAN DEFAULT TRUE updated_at TIMESTAMP TABLE: device_tokens token_id UUID PRIMARY KEY user_id UUID REFERENCES users(user_id) platform ENUM('ios', 'android', 'web') device_token VARCHAR(512) is_active BOOLEAN DEFAULT TRUE created_at TIMESTAMP updated_at TIMESTAMP INDEX idx_device_user (user_id) --- Cassandra (Notification Storage — optimized for time-series reads) --- TABLE: notifications user_id UUID -- Partition key created_at TIMEUUID -- Clustering key (DESC) notification_id UUID type TEXT -- 'like', 'comment', 'follow', 'mention', 'system' actor_id UUID target_entity_type TEXT -- 'post', 'comment', 'profile' target_entity_id UUID message TEXT is_read BOOLEAN channels_delivered SET<TEXT> -- {'push', 'email', 'in_app'} metadata TEXT -- JSON blob for extensibility PRIMARY KEY ((user_id), created_at) WITH CLUSTERING ORDER BY (created_at DESC) AND default_time_to_live = 7776000 -- 90-day TTL TABLE: notification_counts (materialized/counter table) user_id UUID PRIMARY KEY unread_count COUNTER --- Redis (Caching Layer) --- - User preferences cache: Key = pref:{user_id}, TTL = 10 min - Unread count cache: Key = unread:{user_id}, TTL = 5 min - Deduplication set: Key = dedup:{idempotency_key}, TTL = 24 hours - WebSocket connection registry: Key = ws:{user_id} → gateway_node_id ================================================================================ 3. SCALING STRATEGY ================================================================================ A) Horizontal Scaling of Stateless Services - Notification Service, all Worker Pools, and WebSocket Gateway are stateless and horizontally scalable. - Auto-scaling groups (Kubernetes HPA) based on CPU, memory, and queue lag metrics. - Target: each NS instance handles ~500 req/s; need ~25 instances at peak. B) Kafka Partitioning - Start with 64 partitions per topic (allows up to 64 consumers per group). - Partition by user_id hash for ordering guarantees per user. - Scale partitions as throughput grows (re-partition with care). - Separate consumer groups per channel allow independent scaling. C) Database Scaling - Cassandra: Naturally horizontally scalable. Start with a 6-node cluster (RF=3). Add nodes as data volume grows. Partition key = user_id distributes load evenly. - PostgreSQL: Vertical scaling initially for preferences (relatively small dataset: 10M rows). Add read replicas for read-heavy preference lookups. Consider sharding by user_id if needed. - Redis Cluster: 3+ nodes with automatic sharding for cache and connection registry. D) WebSocket Gateway Scaling - Each gateway node handles ~100K concurrent connections. - 10M DAU with ~30% concurrent online = 3M connections → ~30 gateway nodes. - Redis Pub/Sub or a lightweight message bus (e.g., NATS) for cross-node message routing. - Consistent hashing for user-to-node assignment with graceful rebalancing. E) Rate Limiting and Backpressure - Rate limit notification producers to prevent abuse (e.g., max 1000 events/sec per producer). - Kafka consumer lag monitoring with alerts; auto-scale consumers when lag exceeds threshold. - Circuit breakers on external services (APNs, FCM, SES) to prevent cascade failures. F) Future Growth Path - Move to a multi-region deployment with Kafka MirrorMaker 2 for cross-region replication. - Introduce a notification aggregation/batching layer (e.g., "X and 5 others liked your post"). - Add a priority queue for time-sensitive notifications (e.g., direct messages vs. like notifications). ================================================================================ 4. RELIABILITY AND FAULT TOLERANCE ================================================================================ A) Data Durability - Kafka: Replication factor = 3, min.insync.replicas = 2, acks = all. Guarantees no data loss even if one broker fails. - Cassandra: Replication factor = 3, write consistency = QUORUM (2 of 3). Tolerates single-node failure without data loss. - PostgreSQL: Synchronous replication to at least one standby. B) At-Least-Once Delivery - Kafka consumers commit offsets only after successful processing. - Workers implement idempotent delivery using notification_id as dedup key. - If a push/email delivery fails, the message stays in Kafka for retry. - Dead Letter Queue (DLQ) for messages that fail after N retries (e.g., 5 retries with exponential backoff). C) High Availability (99.9% target = max 8.76 hours downtime/year) - All services deployed across 3 Availability Zones (AZs). - Kubernetes with pod anti-affinity rules ensures replicas are spread across AZs. - Database clusters span multiple AZs with automatic failover. - Load balancer health checks with automatic removal of unhealthy instances. - Blue-green or canary deployments to minimize deployment-related downtime. D) Graceful Degradation - If push notification service (APNs/FCM) is down, notifications are queued in Kafka and retried. - If WebSocket gateway is overloaded, clients fall back to polling the notification read API. - If Cassandra is temporarily unavailable, notifications are buffered in Kafka (which has multi-day retention). - Circuit breaker pattern on all external dependencies with fallback behavior. E) Monitoring and Alerting - End-to-end latency tracking: timestamp at creation → timestamp at delivery (P50, P95, P99). - Kafka consumer lag monitoring per topic and consumer group. - Error rate dashboards per channel (push failure rate, email bounce rate). - Alerting on: latency > 2s at P95, consumer lag > 10K messages, error rate > 1%, availability < 99.9%. F) Disaster Recovery - Kafka topic data retained for 7 days, allowing replay from any point. - Cassandra snapshots taken daily, stored in S3 with cross-region replication. - PostgreSQL WAL archiving to S3 for point-in-time recovery. - Runbook for full cluster recovery with target RTO < 1 hour, RPO < 5 minutes. ================================================================================ 5. KEY TRADE-OFFS ================================================================================ Trade-off 1: Availability vs. Strict Consistency (AP over CP) Decision: We chose eventual consistency for notification delivery and read status. - Cassandra with QUORUM writes provides strong-enough consistency for notifications while prioritizing availability and partition tolerance. - A user might briefly see a stale unread count (cached in Redis with 5-min TTL), but this is acceptable for a notification system where perfect real-time accuracy of counts is not critical. - The alternative — using a strongly consistent database like PostgreSQL for all notification storage — would create a scaling bottleneck at 200M writes/day and risk availability during network partitions. - Impact: Users may occasionally see a notification count that is off by 1-2 for a few seconds. This is a minor UX issue compared to the risk of the entire notification system becoming unavailable. Trade-off 2: Push Model (WebSockets) vs. Pull Model (Polling) for In-App Notifications Decision: We chose a push-first model using WebSockets with polling as a fallback. - Push via WebSockets delivers notifications in real-time (sub-second) meeting our <2s latency requirement. - However, maintaining millions of persistent connections is resource-intensive (~30 gateway nodes for 3M concurrent connections) and adds operational complexity (connection management, heartbeats, reconnection logic). - The alternative — pure polling — would be simpler to operate but would either increase latency (if polling interval is long) or increase server load dramatically (if polling interval is short). At 3M concurrent users polling every 2 seconds = 1.5M req/s just for polling, which is far more expensive than maintaining WebSocket connections. - Compromise: We use WebSockets for online users and fall back to polling (with longer intervals) when WebSocket connections cannot be established (e.g., restrictive firewalls). Push notifications via APNs/FCM handle the case when the user is offline. Trade-off 3: Single Notification Store (Cassandra) vs. Polyglot Persistence Decision: We use Cassandra for notification storage and PostgreSQL for user preferences/metadata rather than a single database for everything. - Cassandra excels at high-throughput time-series writes (200M/day) and partition-key-based reads (fetch notifications for a user), which is the dominant access pattern. - PostgreSQL is better suited for the relational nature of user preferences (joins with user table, transactional updates to preferences). - The trade-off is increased operational complexity (two database systems to maintain, monitor, and back up) and the need to keep data consistent across systems (e.g., if a user is deleted, notifications must be cleaned up in Cassandra). - We mitigate this with an event-driven cleanup process: user deletion events trigger async Cassandra cleanup jobs. Trade-off 4: At-Least-Once vs. Exactly-Once Delivery Decision: We chose at-least-once delivery semantics with application-level deduplication. - Exactly-once delivery across distributed systems is extremely expensive and complex (requires distributed transactions or Kafka's exactly-once semantics with significant performance overhead). - At-least-once is simpler and more performant. Duplicate notifications are handled at the application layer using idempotency keys (notification_id checked before delivery). - The risk is that in rare failure scenarios, a user might receive a duplicate push notification. This is a minor annoyance compared to the alternative of lost notifications (at-most-once) or the complexity/performance cost of exactly-once. ================================================================================ SUMMARY ================================================================================ This design handles 200M notifications/day across three channels with sub-2-second delivery latency. The event-driven architecture with Kafka as the central nervous system provides durability, decoupling, and natural backpressure handling. Cassandra provides horizontally scalable notification storage, while the WebSocket gateway enables real-time in-app delivery. The system is designed for 99.9% availability through multi-AZ deployment, replication at every layer, and graceful degradation patterns. The key trade-offs favor availability, scalability, and operational simplicity over strict consistency and exactly-once semantics, which are appropriate choices for a notification system where occasional staleness or rare duplicates are acceptable.
Result
Winning Votes
2 / 3
Average Score
Total Score
Overall Comments
Answer A is a comprehensive, well-structured system design that covers all five required aspects with exceptional depth and precision. It opens with a back-of-the-envelope calculation that grounds the design in concrete numbers, then walks through each component with clear ASCII diagrams, specific technology choices with justifications, and detailed schema definitions using appropriate data types and indexing strategies. The trade-off section is particularly strong, offering four well-reasoned trade-offs with quantitative comparisons (e.g., polling at 3M users × every 2s = 1.5M req/s vs. WebSocket connections). The reliability section is thorough, covering Kafka configuration parameters (acks=all, min.insync.replicas=2), multi-AZ deployment, DLQ, circuit breakers, and disaster recovery with specific RTO/RPO targets. Minor weaknesses include slightly verbose formatting and the schema could mention indexing strategies more explicitly for Cassandra.
View Score Details ▼
Architecture Quality
Weight 30%Answer A provides a detailed, well-structured architecture with a clear ASCII diagram, specific technology choices (Kafka with partition strategy, Redis Pub/Sub for WebSocket routing, APNs/FCM), and precise component descriptions including stateless deployment, idempotency key handling, and connection registry design. Each component's role and interaction is clearly articulated with concrete implementation details.
Completeness
Weight 20%Answer A covers all five required aspects thoroughly: architecture with component descriptions, a detailed dual-database schema (PostgreSQL + Cassandra) with proper data types and TTL, scaling strategy with specific numbers (64 Kafka partitions, 30 WebSocket nodes), reliability with specific Kafka/Cassandra configuration parameters, and four well-developed trade-offs. The back-of-the-envelope section adds valuable context.
Trade-off Reasoning
Weight 20%Answer A's trade-off section is outstanding. Each trade-off includes the decision, the reasoning, quantitative comparisons (e.g., 3M users polling every 2s = 1.5M req/s), the impact on user experience, and mitigation strategies. The four trade-offs cover distinct dimensions: consistency vs. availability, push vs. pull, polyglot persistence, and delivery semantics.
Scalability & Reliability
Weight 20%Answer A provides specific scaling numbers (64 Kafka partitions, 25 NS instances at peak, 30 WebSocket gateway nodes for 3M concurrent connections), specific Kafka configuration (RF=3, min.insync.replicas=2, acks=all), Cassandra quorum settings, multi-AZ deployment with pod anti-affinity, circuit breakers, DLQ with exponential backoff, and disaster recovery with RTO < 1 hour and RPO < 5 minutes.
Clarity
Weight 10%Answer A is exceptionally well-organized with clear section headers, ASCII diagrams, and consistent formatting. The numbered sections, component labels, and summary at the end make it easy to navigate. The back-of-the-envelope section at the start sets clear context. Minor weakness: the length and density could be slightly reduced.
Total Score
Overall Comments
Very detailed and concrete end-to-end design with solid sizing math, clear event-driven architecture (Kafka + per-channel workers), and strong reliability mechanisms (replication, DLQ, retries, multi-AZ). The schema is reasonably aligned to access patterns (Cassandra per-user time-series) and includes useful operational elements (cache, websocket registry, monitoring). Weaknesses: some choices are a bit over-specified or slightly questionable (e.g., Redis Pub/Sub for cross-node delivery at very large scale, Cassandra counter table/unread counts complexity, some assumptions like 100k connections/node), and the design is heavier (polyglot persistence) than strictly necessary. Trade-offs are good but somewhat verbose and occasionally mix CAP framing a bit loosely for the specific operations.
View Score Details ▼
Architecture Quality
Weight 30%Clear event-driven architecture with Kafka, per-channel workers, websocket gateway, caching, and distinct read/write paths; good ordering and decoupling. Slightly over-prescriptive and includes a few scale-sensitive choices (e.g., Redis Pub/Sub as suggested backbone, counters) that could be problematic without more nuance.
Completeness
Weight 20%Covers all requested sections thoroughly with concrete components, schema, scaling, reliability, monitoring, DR, and multiple trade-offs plus estimates.
Trade-off Reasoning
Weight 20%Multiple trade-offs are discussed with rationale (consistency, push vs pull, polyglot persistence, delivery semantics). Some framing is a bit generic and verbose, and a couple points (CAP mapping) are not tightly tied to specific operations.
Scalability & Reliability
Weight 20%Good multi-AZ, replication settings, buffering via Kafka, retries/DLQ, and monitoring. Some scalability claims are optimistic/hand-wavy (websocket node capacity), and unread counters/caches can be tricky without more detail; upstream event loss prevention (e.g., outbox) is not explicitly addressed.
Clarity
Weight 10%Highly structured with diagrams and labeled sections; very readable though long and occasionally overly detailed for “high-level.”
Total Score
Overall Comments
Answer A provides an outstanding system design that is both technically sound and exceptionally clear. Its strengths lie in the detailed, practical approach, starting with back-of-the-envelope calculations to frame the problem, using a clear ASCII diagram to visualize the architecture, and providing specific, well-justified technology choices (e.g., PostgreSQL + Cassandra). The discussion on scaling, reliability, and trade-offs is deep and demonstrates a senior-level understanding of distributed systems.
View Score Details ▼
Architecture Quality
Weight 30%The proposed architecture is excellent, featuring a clear event-driven flow, well-defined components, and a helpful ASCII diagram. The choice to use Kafka with separate, partitioned topics for each channel is a strong and scalable design pattern.
Completeness
Weight 20%This answer is extremely complete. It addresses all five required sections in great detail and includes a valuable 'back-of-the-envelope' calculation section, which effectively sets the context and constraints for the entire design.
Trade-off Reasoning
Weight 20%The trade-off analysis is exceptional, discussing four key decisions with deep insight. The reasoning is well-supported, for example, by quantifying the cost of polling vs. WebSockets and clearly explaining the user experience impact of choosing AP over CP.
Scalability & Reliability
Weight 20%The strategies for scaling and reliability are comprehensive and practical. The design specifies concrete details like replication factors and consistency levels, and provides quantitative estimates for scaling components, which adds significant credibility.
Clarity
Weight 10%The clarity is outstanding. The combination of a logical structure, initial calculations, a visual diagram, and a concise summary makes this complex design exceptionally easy to read and understand.