Orivel Orivel
Open menu

Design a Real-Time Notification Service

Compare model answers for this System Design benchmark and review scores, judging comments, and related examples.

Login or register to use likes and favorites. Register

X f L

Contents

Task Overview

Benchmark Genres

System Design

Task Creator Model

Answering Models

Judge Models

Task Prompt

Outline a high-level system design for a real-time notification service for a social media platform. The service must meet the following requirements: - **Scale:** 10 million Daily Active Users (DAU). - **Volume:** Each user receives an average of 20 notifications per day. - **Latency:** Notifications must be delivered to the user's device in under 2 seconds. - **Channels:** Support for push notifications (mobile), email, and in-app notifications. - **Reliability:** 99.9% availability and no loss of notification d...

Show more

Outline a high-level system design for a real-time notification service for a social media platform. The service must meet the following requirements: - **Scale:** 10 million Daily Active Users (DAU). - **Volume:** Each user receives an average of 20 notifications per day. - **Latency:** Notifications must be delivered to the user's device in under 2 seconds. - **Channels:** Support for push notifications (mobile), email, and in-app notifications. - **Reliability:** 99.9% availability and no loss of notification data. Your design should cover the following aspects: 1. **Core Architecture:** Describe the key components (e.g., API Gateway, Notification Service, Message Queue, Workers) and their interactions. 2. **Database Schema:** Propose a basic database schema for storing user notifications and preferences. 3. **Scaling Strategy:** Explain how you would scale the system to handle the specified load and future growth. 4. **Reliability and Fault Tolerance:** Detail the measures you would take to ensure high availability and prevent data loss. 5. **Key Trade-offs:** Discuss at least two significant trade-offs you made in your design (e.g., consistency vs. availability, choice of database, push vs. pull model).

Task Context

You are a senior software engineer tasked with designing a scalable and reliable real-time notification service for a rapidly growing social media platform. The platform's success depends on engaging users with timely updates, making this service a critical piece of the infrastructure.

Judging Policy

A high-quality answer will provide a clear, coherent, and technically sound system design that addresses all five specified aspects. Evaluate the response based on the following criteria: - **Completeness:** The design must cover architecture, database schema, scaling, reliability, and trade-offs. - **Technical Soundness:** The proposed architecture and technology choices (e.g., type of message queue, database selection) should be appropriate for a large-scale, real-time system. - **Practicality:** The scaling and...

Show more

A high-quality answer will provide a clear, coherent, and technically sound system design that addresses all five specified aspects. Evaluate the response based on the following criteria: - **Completeness:** The design must cover architecture, database schema, scaling, reliability, and trade-offs. - **Technical Soundness:** The proposed architecture and technology choices (e.g., type of message queue, database selection) should be appropriate for a large-scale, real-time system. - **Practicality:** The scaling and reliability strategies should be practical and well-reasoned, using standard industry practices like load balancing, replication, and asynchronous processing. - **Depth of Analysis:** The discussion of trade-offs should be insightful, demonstrating a clear understanding of the engineering compromises involved in the design.

Answers

Real-Time Notification Service — High-Level System Design ================================================================================ 0. BACK-OF-THE-ENVELOPE ESTIMATION ================================================================================ - 10M DAU × 20 notifications/user/day = 200M notifications/day - 200M / 86,400s ≈ 2,315 notifications/second (average) - Peak load (assume 5× average) ≈ 11,575 notifications/second - Each notification payload ~1 KB → ~200 GB/day raw data ingestion - Latency budget: < 2 seconds end-to-end (ingestion → delivery to device) ========================================...

Show Full Answer

Real-Time Notification Service — High-Level System Design ================================================================================ 0. BACK-OF-THE-ENVELOPE ESTIMATION ================================================================================ - 10M DAU × 20 notifications/user/day = 200M notifications/day - 200M / 86,400s ≈ 2,315 notifications/second (average) - Peak load (assume 5× average) ≈ 11,575 notifications/second - Each notification payload ~1 KB → ~200 GB/day raw data ingestion - Latency budget: < 2 seconds end-to-end (ingestion → delivery to device) ================================================================================ 1. CORE ARCHITECTURE ================================================================================ The system follows an event-driven, microservices architecture with the following key components: ┌──────────────┐ │ Producers │ (Post Service, Like Service, Comment Service, Follow Service, etc.) └──────┬───────┘ │ Notification Event (gRPC / async message) ▼ ┌──────────────────┐ │ API Gateway / │ Rate limiting, authentication, routing │ Load Balancer │ (e.g., AWS ALB / Envoy / Kong) └──────┬───────────┘ │ ▼ ┌──────────────────┐ │ Notification │ Stateless service (horizontally scalable) │ Service (NS) │ - Validates & deduplicates events │ │ - Enriches with user preferences │ │ - Fans out to per-channel queues └──────┬───────────┘ │ Writes notification record to DB │ Publishes to Message Queue ▼ ┌──────────────────────────────────────────────────┐ │ Message Queue / Broker │ │ (Apache Kafka — partitioned by user_id hash) │ │ │ │ Topics: push_notifications │ │ email_notifications │ │ in_app_notifications │ └──┬──────────────┬─────────────────┬──────────────┘ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────────────┐ │ Push │ │ Email │ │ In-App │ │ Worker │ │ Worker │ │ Worker │ │ Pool │ │ Pool │ │ Pool │ └───┬────┘ └────┬─────┘ └───────┬────────┘ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────────────┐ │ APNs / │ │ SES / │ │ WebSocket │ │ FCM │ │ SendGrid │ │ Gateway │ └────────┘ └──────────┘ │ (persistent │ │ connections) │ └────────────────┘ Component Descriptions: A) API Gateway / Load Balancer - Entry point for internal producer services and external API calls (e.g., mark-as-read). - Handles rate limiting, authentication, and request routing. - Distributes traffic across multiple Notification Service instances. B) Notification Service (NS) - Stateless microservice deployed in multiple replicas behind the load balancer. - Receives notification events, validates them, performs deduplication (idempotency key check). - Looks up user notification preferences from a cache (Redis) or database. - Determines which channels to deliver on (push, email, in-app) based on preferences. - Persists the notification record to the database. - Publishes channel-specific messages to Kafka topics. C) Message Queue (Apache Kafka) - Decouples notification creation from delivery, absorbing traffic spikes. - Partitioned by user_id hash to preserve per-user ordering. - Provides durability (replication factor = 3) and replay capability. - Separate topics per channel allow independent scaling of consumers. D) Channel Workers (Consumer Groups) - Push Worker Pool: Consumes from push_notifications topic, batches requests, and sends to APNs (iOS) and FCM (Android). Handles token management and retry logic. - Email Worker Pool: Consumes from email_notifications topic, renders templates, and sends via SES/SendGrid. Implements exponential backoff for transient failures. - In-App Worker Pool: Consumes from in_app_notifications topic, pushes to WebSocket Gateway for real-time delivery. Falls back to polling if WebSocket is disconnected. E) WebSocket Gateway - Maintains persistent WebSocket connections with online users. - Horizontally scaled; uses Redis Pub/Sub or a shared message bus so any gateway node can deliver to any connected user. - Connection registry stored in Redis (user_id → gateway_node mapping). F) Notification Read API - Separate read path for fetching notification history, marking as read, managing preferences. - Served from read replicas / cache to avoid impacting the write path. ================================================================================ 2. DATABASE SCHEMA ================================================================================ Primary data store: PostgreSQL (for preferences and metadata) + Cassandra (for notification storage at scale). --- PostgreSQL (User Preferences & Metadata) --- TABLE: users user_id UUID PRIMARY KEY email VARCHAR(255) phone VARCHAR(50) created_at TIMESTAMP updated_at TIMESTAMP TABLE: notification_preferences user_id UUID PRIMARY KEY REFERENCES users(user_id) push_enabled BOOLEAN DEFAULT TRUE email_enabled BOOLEAN DEFAULT TRUE in_app_enabled BOOLEAN DEFAULT TRUE quiet_hours_start TIME NULLABLE quiet_hours_end TIME NULLABLE email_digest_freq ENUM('instant', 'hourly', 'daily') DEFAULT 'instant' -- Per-category overrides likes_push BOOLEAN DEFAULT TRUE likes_email BOOLEAN DEFAULT FALSE comments_push BOOLEAN DEFAULT TRUE comments_email BOOLEAN DEFAULT TRUE follows_push BOOLEAN DEFAULT TRUE follows_email BOOLEAN DEFAULT FALSE mentions_push BOOLEAN DEFAULT TRUE mentions_email BOOLEAN DEFAULT TRUE updated_at TIMESTAMP TABLE: device_tokens token_id UUID PRIMARY KEY user_id UUID REFERENCES users(user_id) platform ENUM('ios', 'android', 'web') device_token VARCHAR(512) is_active BOOLEAN DEFAULT TRUE created_at TIMESTAMP updated_at TIMESTAMP INDEX idx_device_user (user_id) --- Cassandra (Notification Storage — optimized for time-series reads) --- TABLE: notifications user_id UUID -- Partition key created_at TIMEUUID -- Clustering key (DESC) notification_id UUID type TEXT -- 'like', 'comment', 'follow', 'mention', 'system' actor_id UUID target_entity_type TEXT -- 'post', 'comment', 'profile' target_entity_id UUID message TEXT is_read BOOLEAN channels_delivered SET<TEXT> -- {'push', 'email', 'in_app'} metadata TEXT -- JSON blob for extensibility PRIMARY KEY ((user_id), created_at) WITH CLUSTERING ORDER BY (created_at DESC) AND default_time_to_live = 7776000 -- 90-day TTL TABLE: notification_counts (materialized/counter table) user_id UUID PRIMARY KEY unread_count COUNTER --- Redis (Caching Layer) --- - User preferences cache: Key = pref:{user_id}, TTL = 10 min - Unread count cache: Key = unread:{user_id}, TTL = 5 min - Deduplication set: Key = dedup:{idempotency_key}, TTL = 24 hours - WebSocket connection registry: Key = ws:{user_id} → gateway_node_id ================================================================================ 3. SCALING STRATEGY ================================================================================ A) Horizontal Scaling of Stateless Services - Notification Service, all Worker Pools, and WebSocket Gateway are stateless and horizontally scalable. - Auto-scaling groups (Kubernetes HPA) based on CPU, memory, and queue lag metrics. - Target: each NS instance handles ~500 req/s; need ~25 instances at peak. B) Kafka Partitioning - Start with 64 partitions per topic (allows up to 64 consumers per group). - Partition by user_id hash for ordering guarantees per user. - Scale partitions as throughput grows (re-partition with care). - Separate consumer groups per channel allow independent scaling. C) Database Scaling - Cassandra: Naturally horizontally scalable. Start with a 6-node cluster (RF=3). Add nodes as data volume grows. Partition key = user_id distributes load evenly. - PostgreSQL: Vertical scaling initially for preferences (relatively small dataset: 10M rows). Add read replicas for read-heavy preference lookups. Consider sharding by user_id if needed. - Redis Cluster: 3+ nodes with automatic sharding for cache and connection registry. D) WebSocket Gateway Scaling - Each gateway node handles ~100K concurrent connections. - 10M DAU with ~30% concurrent online = 3M connections → ~30 gateway nodes. - Redis Pub/Sub or a lightweight message bus (e.g., NATS) for cross-node message routing. - Consistent hashing for user-to-node assignment with graceful rebalancing. E) Rate Limiting and Backpressure - Rate limit notification producers to prevent abuse (e.g., max 1000 events/sec per producer). - Kafka consumer lag monitoring with alerts; auto-scale consumers when lag exceeds threshold. - Circuit breakers on external services (APNs, FCM, SES) to prevent cascade failures. F) Future Growth Path - Move to a multi-region deployment with Kafka MirrorMaker 2 for cross-region replication. - Introduce a notification aggregation/batching layer (e.g., "X and 5 others liked your post"). - Add a priority queue for time-sensitive notifications (e.g., direct messages vs. like notifications). ================================================================================ 4. RELIABILITY AND FAULT TOLERANCE ================================================================================ A) Data Durability - Kafka: Replication factor = 3, min.insync.replicas = 2, acks = all. Guarantees no data loss even if one broker fails. - Cassandra: Replication factor = 3, write consistency = QUORUM (2 of 3). Tolerates single-node failure without data loss. - PostgreSQL: Synchronous replication to at least one standby. B) At-Least-Once Delivery - Kafka consumers commit offsets only after successful processing. - Workers implement idempotent delivery using notification_id as dedup key. - If a push/email delivery fails, the message stays in Kafka for retry. - Dead Letter Queue (DLQ) for messages that fail after N retries (e.g., 5 retries with exponential backoff). C) High Availability (99.9% target = max 8.76 hours downtime/year) - All services deployed across 3 Availability Zones (AZs). - Kubernetes with pod anti-affinity rules ensures replicas are spread across AZs. - Database clusters span multiple AZs with automatic failover. - Load balancer health checks with automatic removal of unhealthy instances. - Blue-green or canary deployments to minimize deployment-related downtime. D) Graceful Degradation - If push notification service (APNs/FCM) is down, notifications are queued in Kafka and retried. - If WebSocket gateway is overloaded, clients fall back to polling the notification read API. - If Cassandra is temporarily unavailable, notifications are buffered in Kafka (which has multi-day retention). - Circuit breaker pattern on all external dependencies with fallback behavior. E) Monitoring and Alerting - End-to-end latency tracking: timestamp at creation → timestamp at delivery (P50, P95, P99). - Kafka consumer lag monitoring per topic and consumer group. - Error rate dashboards per channel (push failure rate, email bounce rate). - Alerting on: latency > 2s at P95, consumer lag > 10K messages, error rate > 1%, availability < 99.9%. F) Disaster Recovery - Kafka topic data retained for 7 days, allowing replay from any point. - Cassandra snapshots taken daily, stored in S3 with cross-region replication. - PostgreSQL WAL archiving to S3 for point-in-time recovery. - Runbook for full cluster recovery with target RTO < 1 hour, RPO < 5 minutes. ================================================================================ 5. KEY TRADE-OFFS ================================================================================ Trade-off 1: Availability vs. Strict Consistency (AP over CP) Decision: We chose eventual consistency for notification delivery and read status. - Cassandra with QUORUM writes provides strong-enough consistency for notifications while prioritizing availability and partition tolerance. - A user might briefly see a stale unread count (cached in Redis with 5-min TTL), but this is acceptable for a notification system where perfect real-time accuracy of counts is not critical. - The alternative — using a strongly consistent database like PostgreSQL for all notification storage — would create a scaling bottleneck at 200M writes/day and risk availability during network partitions. - Impact: Users may occasionally see a notification count that is off by 1-2 for a few seconds. This is a minor UX issue compared to the risk of the entire notification system becoming unavailable. Trade-off 2: Push Model (WebSockets) vs. Pull Model (Polling) for In-App Notifications Decision: We chose a push-first model using WebSockets with polling as a fallback. - Push via WebSockets delivers notifications in real-time (sub-second) meeting our <2s latency requirement. - However, maintaining millions of persistent connections is resource-intensive (~30 gateway nodes for 3M concurrent connections) and adds operational complexity (connection management, heartbeats, reconnection logic). - The alternative — pure polling — would be simpler to operate but would either increase latency (if polling interval is long) or increase server load dramatically (if polling interval is short). At 3M concurrent users polling every 2 seconds = 1.5M req/s just for polling, which is far more expensive than maintaining WebSocket connections. - Compromise: We use WebSockets for online users and fall back to polling (with longer intervals) when WebSocket connections cannot be established (e.g., restrictive firewalls). Push notifications via APNs/FCM handle the case when the user is offline. Trade-off 3: Single Notification Store (Cassandra) vs. Polyglot Persistence Decision: We use Cassandra for notification storage and PostgreSQL for user preferences/metadata rather than a single database for everything. - Cassandra excels at high-throughput time-series writes (200M/day) and partition-key-based reads (fetch notifications for a user), which is the dominant access pattern. - PostgreSQL is better suited for the relational nature of user preferences (joins with user table, transactional updates to preferences). - The trade-off is increased operational complexity (two database systems to maintain, monitor, and back up) and the need to keep data consistent across systems (e.g., if a user is deleted, notifications must be cleaned up in Cassandra). - We mitigate this with an event-driven cleanup process: user deletion events trigger async Cassandra cleanup jobs. Trade-off 4: At-Least-Once vs. Exactly-Once Delivery Decision: We chose at-least-once delivery semantics with application-level deduplication. - Exactly-once delivery across distributed systems is extremely expensive and complex (requires distributed transactions or Kafka's exactly-once semantics with significant performance overhead). - At-least-once is simpler and more performant. Duplicate notifications are handled at the application layer using idempotency keys (notification_id checked before delivery). - The risk is that in rare failure scenarios, a user might receive a duplicate push notification. This is a minor annoyance compared to the alternative of lost notifications (at-most-once) or the complexity/performance cost of exactly-once. ================================================================================ SUMMARY ================================================================================ This design handles 200M notifications/day across three channels with sub-2-second delivery latency. The event-driven architecture with Kafka as the central nervous system provides durability, decoupling, and natural backpressure handling. Cassandra provides horizontally scalable notification storage, while the WebSocket gateway enables real-time in-app delivery. The system is designed for 99.9% availability through multi-AZ deployment, replication at every layer, and graceful degradation patterns. The key trade-offs favor availability, scalability, and operational simplicity over strict consistency and exactly-once semantics, which are appropriate choices for a notification system where occasional staleness or rare duplicates are acceptable.

Result

#1 | Winner

Winning Votes

2 / 3

Average Score

89

Total Score

90

Overall Comments

Answer A is a comprehensive, well-structured system design that covers all five required aspects with exceptional depth and precision. It opens with a back-of-the-envelope calculation that grounds the design in concrete numbers, then walks through each component with clear ASCII diagrams, specific technology choices with justifications, and detailed schema definitions using appropriate data types and indexing strategies. The trade-off section is particularly strong, offering four well-reasoned trade-offs with quantitative comparisons (e.g., polling at 3M users × every 2s = 1.5M req/s vs. WebSocket connections). The reliability section is thorough, covering Kafka configuration parameters (acks=all, min.insync.replicas=2), multi-AZ deployment, DLQ, circuit breakers, and disaster recovery with specific RTO/RPO targets. Minor weaknesses include slightly verbose formatting and the schema could mention indexing strategies more explicitly for Cassandra.

View Score Details

Architecture Quality

Weight 30%
90

Answer A provides a detailed, well-structured architecture with a clear ASCII diagram, specific technology choices (Kafka with partition strategy, Redis Pub/Sub for WebSocket routing, APNs/FCM), and precise component descriptions including stateless deployment, idempotency key handling, and connection registry design. Each component's role and interaction is clearly articulated with concrete implementation details.

Completeness

Weight 20%
92

Answer A covers all five required aspects thoroughly: architecture with component descriptions, a detailed dual-database schema (PostgreSQL + Cassandra) with proper data types and TTL, scaling strategy with specific numbers (64 Kafka partitions, 30 WebSocket nodes), reliability with specific Kafka/Cassandra configuration parameters, and four well-developed trade-offs. The back-of-the-envelope section adds valuable context.

Trade-off Reasoning

Weight 20%
90

Answer A's trade-off section is outstanding. Each trade-off includes the decision, the reasoning, quantitative comparisons (e.g., 3M users polling every 2s = 1.5M req/s), the impact on user experience, and mitigation strategies. The four trade-offs cover distinct dimensions: consistency vs. availability, push vs. pull, polyglot persistence, and delivery semantics.

Scalability & Reliability

Weight 20%
90

Answer A provides specific scaling numbers (64 Kafka partitions, 25 NS instances at peak, 30 WebSocket gateway nodes for 3M concurrent connections), specific Kafka configuration (RF=3, min.insync.replicas=2, acks=all), Cassandra quorum settings, multi-AZ deployment with pod anti-affinity, circuit breakers, DLQ with exponential backoff, and disaster recovery with RTO < 1 hour and RPO < 5 minutes.

Clarity

Weight 10%
85

Answer A is exceptionally well-organized with clear section headers, ASCII diagrams, and consistent formatting. The numbered sections, component labels, and summary at the end make it easy to navigate. The back-of-the-envelope section at the start sets clear context. Minor weakness: the length and density could be slightly reduced.

Judge Models OpenAI GPT-5.2

Total Score

85

Overall Comments

Very detailed and concrete end-to-end design with solid sizing math, clear event-driven architecture (Kafka + per-channel workers), and strong reliability mechanisms (replication, DLQ, retries, multi-AZ). The schema is reasonably aligned to access patterns (Cassandra per-user time-series) and includes useful operational elements (cache, websocket registry, monitoring). Weaknesses: some choices are a bit over-specified or slightly questionable (e.g., Redis Pub/Sub for cross-node delivery at very large scale, Cassandra counter table/unread counts complexity, some assumptions like 100k connections/node), and the design is heavier (polyglot persistence) than strictly necessary. Trade-offs are good but somewhat verbose and occasionally mix CAP framing a bit loosely for the specific operations.

View Score Details

Architecture Quality

Weight 30%
82

Clear event-driven architecture with Kafka, per-channel workers, websocket gateway, caching, and distinct read/write paths; good ordering and decoupling. Slightly over-prescriptive and includes a few scale-sensitive choices (e.g., Redis Pub/Sub as suggested backbone, counters) that could be problematic without more nuance.

Completeness

Weight 20%
92

Covers all requested sections thoroughly with concrete components, schema, scaling, reliability, monitoring, DR, and multiple trade-offs plus estimates.

Trade-off Reasoning

Weight 20%
83

Multiple trade-offs are discussed with rationale (consistency, push vs pull, polyglot persistence, delivery semantics). Some framing is a bit generic and verbose, and a couple points (CAP mapping) are not tightly tied to specific operations.

Scalability & Reliability

Weight 20%
84

Good multi-AZ, replication settings, buffering via Kafka, retries/DLQ, and monitoring. Some scalability claims are optimistic/hand-wavy (websocket node capacity), and unread counters/caches can be tricky without more detail; upstream event loss prevention (e.g., outbox) is not explicitly addressed.

Clarity

Weight 10%
86

Highly structured with diagrams and labeled sections; very readable though long and occasionally overly detailed for “high-level.”

Total Score

92

Overall Comments

Answer A provides an outstanding system design that is both technically sound and exceptionally clear. Its strengths lie in the detailed, practical approach, starting with back-of-the-envelope calculations to frame the problem, using a clear ASCII diagram to visualize the architecture, and providing specific, well-justified technology choices (e.g., PostgreSQL + Cassandra). The discussion on scaling, reliability, and trade-offs is deep and demonstrates a senior-level understanding of distributed systems.

View Score Details

Architecture Quality

Weight 30%
90

The proposed architecture is excellent, featuring a clear event-driven flow, well-defined components, and a helpful ASCII diagram. The choice to use Kafka with separate, partitioned topics for each channel is a strong and scalable design pattern.

Completeness

Weight 20%
95

This answer is extremely complete. It addresses all five required sections in great detail and includes a valuable 'back-of-the-envelope' calculation section, which effectively sets the context and constraints for the entire design.

Trade-off Reasoning

Weight 20%
92

The trade-off analysis is exceptional, discussing four key decisions with deep insight. The reasoning is well-supported, for example, by quantifying the cost of polling vs. WebSockets and clearly explaining the user experience impact of choosing AP over CP.

Scalability & Reliability

Weight 20%
90

The strategies for scaling and reliability are comprehensive and practical. The design specifies concrete details like replication factors and consistency levels, and provides quantitative estimates for scaling components, which adds significant credibility.

Clarity

Weight 10%
95

The clarity is outstanding. The combination of a logical structure, initial calculations, a visual diagram, and a concise summary makes this complex design exceptionally easy to read and understand.

High-level design 1. Core architecture A good design is event-driven, channel-aware, and durable-first. Main flow - Product services such as Likes, Comments, Follows, Mentions, Messages publish notification-worthy events. - API Gateway or internal Event Ingestion API receives direct requests if needed, but most traffic should come from internal event streams. - Notification Orchestrator validates the event, deduplicates it, expands recipients, checks user preferences, creates a durable notification record, and emits channel delivery jobs. - Message Queue / Event Bus buffers work for downstream channel workers....

Show Full Answer

High-level design 1. Core architecture A good design is event-driven, channel-aware, and durable-first. Main flow - Product services such as Likes, Comments, Follows, Mentions, Messages publish notification-worthy events. - API Gateway or internal Event Ingestion API receives direct requests if needed, but most traffic should come from internal event streams. - Notification Orchestrator validates the event, deduplicates it, expands recipients, checks user preferences, creates a durable notification record, and emits channel delivery jobs. - Message Queue / Event Bus buffers work for downstream channel workers. - Channel Workers handle push, email, and in-app delivery independently. - Delivery status service updates notification state and retries when necessary. - Read APIs serve notification inbox and unread counts to clients. - WebSocket / SSE Gateway can push in-app updates in real time to online users. Key components - API Gateway - Auth, rate limiting, routing, observability. - Used for client read APIs and admin APIs. - Event Bus / Ingestion Stream - Kafka / Pulsar style durable log. - Producers: social graph, content, messaging, moderation, etc. - Topic partitioned by recipient_user_id to preserve per-user ordering when useful. - Notification Orchestrator - Stateless service consuming events. - Responsibilities: - validate schema - idempotency check using event_id or dedupe key - fanout decision - preference lookup - template selection - priority classification - persistence of canonical notification record - enqueue per-channel delivery tasks - Writes first, then schedules delivery, so data is not lost. - Preference Service - Returns user-level channel settings, quiet hours, device tokens, locale, email verification status. - Backed by strongly consistent user preference store plus cache. - Notification Store - Canonical durable store for notification metadata and user inbox records. - Optimized for writes and recent reads. - Delivery Queue - Separate queues/topics per channel: push, email, in-app. - Allows different retry and throughput policies. - Push Worker - Integrates with APNS/FCM. - Handles token invalidation, platform-specific payloads, exponential backoff. - Email Worker - Integrates with email provider. - Lower priority than push/in-app for most social events. - Tracks bounces, complaints, suppression list. - In-App Delivery Worker - Writes to inbox store and, if user is online, pushes via WebSocket/SSE. - If offline, notification remains available through pull from inbox API. - Real-Time Gateway - Maintains persistent client connections for active users. - Subscribes to in-app delivery events and sends under 2 seconds. - Read API / Inbox Service - List notifications, mark read, unread count, pagination. - Reads from notification inbox store and cache. - Retry / Dead Letter Processor - Retries transient failures. - DLQ for poison messages or permanent failures. - Supports replay. - Observability stack - Metrics: enqueue-to-delivery latency, success rate by channel, retry counts, queue lag. - Logs and distributed tracing. - Alerting on SLA breaches. Interaction sequence - A comment service emits CommentCreated event. - Event bus stores event durably. - Notification Orchestrator consumes event, determines recipient user, checks preferences, creates notification row, emits jobs to: - in-app queue - push queue if enabled and urgent enough - email queue if configured - In-app worker stores inbox item and pushes via WebSocket if user online. - Push worker sends to FCM/APNS. - Delivery receipts/status updates are written back. - User can fetch full inbox via read API. Throughput estimate - 10M DAU × 20 notifications/day = 200M notifications/day. - Average throughput = about 2,315 notifications/sec. - Real systems have spikes, so design for at least 10x peak headroom: 20k to 30k notification creations/sec, with channel fanout making downstream delivery higher. - This scale is very manageable with partitioned queues and horizontally scaled stateless workers. 2. Database schema Use a split model: - transactional durable store for canonical notification records and preferences - cache for hot reads - optional search/index store for advanced inbox queries Basic schema users_notification_preferences - user_id PK - push_enabled boolean - email_enabled boolean - in_app_enabled boolean - quiet_hours_start - quiet_hours_end - timezone - locale - email_address - email_verified boolean - push_tokens json / separate table - notification_type_settings json or normalized child table - updated_at user_device_tokens - token_id PK - user_id - platform - device_token - app_version - last_seen_at - is_active - unique(device_token) - index(user_id) notifications - notification_id PK - recipient_user_id - actor_user_id nullable - type - object_type - object_id - dedupe_key - title - body - payload json - priority - created_at - expire_at nullable - aggregation_key nullable - source_event_id unique - indexes: - (recipient_user_id, created_at desc) - (recipient_user_id, notification_id) - unique(source_event_id) or unique(dedupe_key) notification_deliveries - delivery_id PK - notification_id - channel enum(push, email, in_app) - status enum(pending, sent, delivered, failed, suppressed) - provider_message_id nullable - attempt_count - last_attempt_at - next_retry_at nullable - failure_reason nullable - delivered_at nullable - indexes: - (notification_id) - (channel, status, next_retry_at) notification_reads - notification_id - user_id - read_at - primary key(notification_id, user_id) - index(user_id, read_at) Optional optimization: user_inbox table - user_id - notification_id - created_at - read_state - primary key(user_id, created_at, notification_id) This can be the main table in a wide-column store for inbox retrieval. Storage choices - Preferences: relational DB or strongly consistent key-value store. - Notifications/inbox: Cassandra/DynamoDB/Bigtable-style wide-column KV is attractive because access pattern is mostly by user_id and recent time-ordered reads, with very high write volume. - Delivery audit/status: relational or KV store depending reporting needs. - Cache: Redis for unread counts and recent inbox pages. 3. Scaling strategy Horizontal scaling - All stateless services horizontally auto-scale: orchestrator, channel workers, read APIs, WebSocket gateways. - Partition event topics by recipient_user_id to spread load evenly and preserve per-user ordering. - Separate queues by channel and priority so email backlog does not affect push latency. Data scaling - Shard notification data by user_id. - Use time bucketing if needed for very large users, for example partition key (user_id, month). - TTL older low-value notifications after retention policy if business allows, while archiving to cheap object storage. - Cache unread counts and latest N notifications. Traffic shaping - Priority lanes: - high: direct mentions, messages, security alerts - medium: comments, likes - low: digestable activity or promotional items - During spikes, protect high-priority notifications first. - Batch low-priority email generation. Fanout strategy - Prefer fanout-on-write for per-user inbox and real-time channels because latency target is under 2 seconds. - For extremely large fanout events, like celebrity posts to millions, use hybrid handling: - create coarse event once - asynchronously fanout in batches - potentially degrade non-critical channels to digest or delayed delivery This avoids stampedes. Geo scaling - Multi-region active-active for ingestion and delivery APIs. - Keep user affinity to home region when possible to reduce cross-region writes. - Replicate critical metadata globally. - Channel providers are called from nearest region. Capacity planning - 200M/day canonical notifications. - If each creates 2 to 3 channel attempts on average, downstream delivery records may be 400M to 600M/day. - Use queue partitions sized for peak throughput, e.g. dozens to hundreds of partitions depending broker. - WebSocket gateway sized by concurrent online users, not DAU. 4. Reliability and fault tolerance Availability target: 99.9% - Multi-AZ deployment for every tier. - No single point of failure. - Load balancers across instances. - Managed broker and DB with replication. No loss of notification data - Persist canonical notification before delivery attempts. - Use durable log/queue with replication factor >= 3. - Idempotent consumers and producer retries. - Outbox pattern for upstream producer services if they emit notification events from transactional operations. - Example: a comment write and notification event are tied using DB transaction + outbox to avoid lost events. - Replay capability from event log and DLQ. Failure handling - At-least-once delivery internally, with deduplication at orchestrator and channel layers. - Exponential backoff retries for transient provider failures. - DLQ for malformed or repeatedly failing tasks. - Circuit breakers around APNS/FCM/email providers. - Fallback behavior: - if push provider degraded, still store in-app notification - if email provider down, keep queued and retry later Data consistency and integrity - Unique source_event_id or dedupe_key prevents duplicate notification creation. - Delivery state machine avoids invalid transitions. - Strong consistency for preferences updates if immediate effect is required, otherwise bounded staleness via cache invalidation. Disaster recovery - Cross-region replicated metadata and backups. - Periodic snapshots plus WAL/binlog or stream retention. - Defined RPO near zero using replicated durable log. - RTO minimized with infrastructure-as-code and warm standby/active-active design. Operational reliability - SLOs for enqueue-to-persist and persist-to-delivery latency. - Queue lag alarms. - Synthetic probes for push/email/in-app paths. - Backpressure controls when downstream providers throttle. - Rate limits per tenant/event type to contain abuse. 5. Key trade-offs Trade-off 1: At-least-once delivery vs exactly-once delivery - Chosen: at-least-once with idempotency. - Why: exactly-once across brokers, databases, and external providers like APNS/FCM/email is expensive and often impractical. - Impact: occasional duplicate attempts are possible, but dedupe keys and idempotent writes make user-visible duplicates rare. - Benefit: much simpler and more reliable under failure. Trade-off 2: Wide-column/KV store vs relational DB for notification inbox - Chosen: wide-column or Dynamo-style store for inbox, relational for preferences and some metadata. - Why: notification workload is high-write, append-heavy, and mostly reads by user and recency. - Impact: weaker ad hoc query capability and more careful data modeling. - Benefit: far better horizontal scale and predictable performance. Trade-off 3: Fanout-on-write vs fanout-on-read - Chosen: mostly fanout-on-write. - Why: latency requirement under 2 seconds and need for push/in-app immediacy. - Impact: more write amplification, especially for high-fanout events. - Mitigation: hybrid approach for mega-fanout cases. Trade-off 4: Push real-time delivery vs pull-only inbox - Chosen: hybrid. - Why: push/WebSocket gives low latency for active users, pull API gives resilience for offline users. - Impact: more system complexity because both connection state and inbox persistence are required. - Benefit: best user experience and reliability. Suggested final architecture summary - Ingest events through a durable event bus. - Use a stateless Notification Orchestrator to dedupe, check preferences, persist notification records, and publish channel-specific jobs. - Store inbox data in a horizontally scalable user-partitioned store, with relational or strongly consistent KV for preferences. - Deliver via dedicated push, email, and in-app workers. - Use WebSocket/SSE for real-time in-app updates and normal read APIs for inbox retrieval. - Ensure reliability through multi-AZ deployment, replicated queues, durable-first writes, idempotency, retries, DLQs, replay, and cross-region disaster recovery. This design comfortably supports 10M DAU and 200M notifications/day, while meeting sub-2-second delivery for most real-time notifications and maintaining 99.9% availability with no data loss.

Result

#2

Winning Votes

1 / 3

Average Score

83

Total Score

76

Overall Comments

Answer B is a solid, well-organized design that covers all five required aspects competently. It demonstrates good architectural thinking with the Notification Orchestrator pattern, outbox pattern mention, and hybrid fanout strategy for celebrity posts. The schema is reasonably detailed with a notification_deliveries table that Answer A lacks. However, it is less precise in several areas: technology choices are often left vague (e.g., "Kafka/Pulsar style," "relational DB or strongly consistent key-value store"), capacity numbers are less specific, and the trade-off analysis is shallower with less quantitative reasoning. The reliability section mentions good patterns but lacks the specific configuration parameters and concrete metrics that Answer A provides. The overall structure is readable but less visually organized than Answer A.

View Score Details

Architecture Quality

Weight 30%
75

Answer B presents a solid architecture with good component identification including the Notification Orchestrator, Preference Service, and Delivery Status service. The outbox pattern mention is a valuable addition. However, technology choices are often vague (Kafka/Pulsar style, relational or KV store), and the interaction flow is described in prose rather than a diagram, making it harder to visualize the system.

Completeness

Weight 20%
85

Answer B also covers all five aspects and adds a notification_deliveries table that Answer A lacks, which is a genuine completeness strength. However, the schema section is less precise (using 'json' types without specifics, vague storage choices), and the scaling section lacks the concrete numbers that would demonstrate full completeness. The trade-offs cover four areas but with less depth per trade-off.

Trade-off Reasoning

Weight 20%
70

Answer B identifies four relevant trade-offs but the analysis is shallower. Each trade-off is described briefly with a 'why' and 'benefit' but lacks quantitative comparisons or detailed impact analysis. For example, the fanout trade-off mentions 'write amplification' but doesn't quantify it or compare alternatives numerically. The reasoning is sound but not as insightful.

Scalability & Reliability

Weight 20%
75

Answer B covers reliability well conceptually: multi-AZ, durable-first writes, idempotency, outbox pattern, circuit breakers, DLQ, and cross-region DR. The outbox pattern is a valuable addition. However, it lacks specific configuration parameters, concrete capacity numbers, and defined RTO/RPO targets, making the reliability guarantees less verifiable.

Clarity

Weight 10%
75

Answer B is well-organized with clear section headers and bullet points. The prose-based descriptions are readable, and the final summary is helpful. However, the lack of a visual diagram makes the architecture harder to grasp at a glance, and some sections feel like lists of concepts rather than integrated explanations.

Judge Models OpenAI GPT-5.2

Total Score

89

Overall Comments

Strong high-level design that covers all required areas with pragmatic industry patterns and fewer questionable specifics. It clearly separates canonical persistence from delivery attempts, includes delivery status modeling, explicitly calls out the outbox pattern for upstream event reliability, and addresses fanout edge cases (mega-fanout) and priority lanes for latency protection. The schema is coherent (notifications + deliveries + reads) and ties back to retry/DLQ behavior. Slightly less concrete in some capacity numbers and exact component sizing than A, but overall more balanced and technically robust for preventing data loss and handling pathological fanout scenarios.

View Score Details

Architecture Quality

Weight 30%
88

Well-structured orchestration-based architecture with canonical persistence, channel job emission, explicit delivery status service, retry/DLQ/replay, and real-time gateway. Includes priority lanes and mega-fanout strategy, which strengthens the overall architecture for social-media workloads.

Completeness

Weight 20%
90

Covers all requested sections well: architecture, schema, scaling, reliability/DR, observability, and several meaningful trade-offs. Slightly less concrete sizing and fewer low-level deployment specifics than A.

Trade-off Reasoning

Weight 20%
89

Trade-offs are directly tied to real constraints: exactly-once vs at-least-once, data model choice, fanout strategy including mega-fanout mitigation, and push vs pull hybrid. Clear impacts and mitigations are stated.

Scalability & Reliability

Weight 20%
91

Very strong reliability posture: durable-first persistence, replicated log, idempotency, retries/DLQ, replay, circuit breakers, and explicitly calls out outbox pattern to avoid lost events. Also addresses traffic shaping and priority to maintain latency under spikes.

Clarity

Weight 10%
87

Clear organization with concise bullets and consistent terminology; slightly less visual structure than A but easier to scan and still comprehensive.

Total Score

84

Overall Comments

Answer B presents a solid and complete system design that correctly identifies the key components and challenges. It covers all the required sections with technically correct information and introduces good concepts like priority lanes and geo-scaling. However, it is less effective in its presentation and depth compared to Answer A. The lack of a visual diagram makes the architecture harder to grasp, and the explanations, while correct, are generally less detailed and concrete.

View Score Details

Architecture Quality

Weight 30%
80

The architecture is technically sound and follows industry best practices. The components are well-chosen, but the description is more abstract than Answer A's. The lack of a visual diagram makes it harder to understand the interactions between components at a glance.

Completeness

Weight 20%
90

The answer is very complete, thoroughly addressing all five sections required by the prompt. It covers all the necessary ground for a comprehensive high-level design.

Trade-off Reasoning

Weight 20%
85

The answer discusses four relevant and important trade-offs. The reasoning is correct and demonstrates a good understanding of the concepts, though the explanations are slightly less detailed and impactful than those in Answer A.

Scalability & Reliability

Weight 20%
88

This answer provides a very strong and thorough plan for scalability and reliability. It introduces advanced concepts like priority lanes for traffic shaping and geo-scaling, which are highly relevant for a large-scale system.

Clarity

Weight 10%
75

The response is well-organized and written in clear language. However, as a system design answer, the lack of any architectural diagram is a significant drawback, making it much less clear and immediately understandable than Answer A.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Winning Votes

2 / 3

Average Score

89
View this answer

Winning Votes

1 / 3

Average Score

83
View this answer

Judging Results

Why This Side Won

Answer A is the clear winner due to its superior clarity, depth, and practical detail. The inclusion of initial performance estimates and a system architecture diagram makes the entire design much easier to comprehend. Furthermore, Answer A's technical descriptions, particularly the database schema design for Cassandra and the quantitative reasoning in the trade-offs section, are more specific and demonstrate a more thorough engineering thought process than Answer B.

Judge Models OpenAI GPT-5.2

Why This Side Won

Answer B wins on weighted criteria because it is more technically sound and operationally practical in the most important areas: it explicitly addresses end-to-end event reliability with the outbox pattern, models delivery state and retries cleanly, and discusses fanout-on-write vs fanout-on-read including mega-fanout mitigations and priority lanes to protect sub-2s latency. While Answer A is very detailed and strong, some of its specific implementation choices are more debatable at scale and it lacks the same depth on upstream event integrity and extreme fanout handling, giving B the edge in architecture quality and reliability/scalability reasoning.

Why This Side Won

Answer A wins on the highest-weighted criteria. In architecture quality (30%), Answer A provides a more precise, technically detailed design with specific technology choices, configuration parameters, and a clear ASCII diagram showing component interactions. In trade-off reasoning (20%), Answer A's analysis is quantitatively stronger, with concrete numbers comparing alternatives. In scalability and reliability (20%), Answer A specifies exact Kafka settings, Cassandra replication parameters, WebSocket node calculations, and RTO/RPO targets that demonstrate deeper engineering rigor. Answer B is competent and covers all areas, but Answer A's greater technical specificity and depth across the most heavily weighted criteria make it the clear winner.

X f L