Answer A: OpenAI GPT-5.2
High-level architecture (components and interactions) 1) Event producers (source services) - Trip Service: emits ride lifecycle events (driver_assigned, ride_completed). - Dispatch/Matching Service: emits driver_assigned and any reassignment events. - ETA/Location Service: continuously computes ETA from driver GPS streams; emits driver_arriving_soon when ETA threshold (e.g., <=2 minutes) is crossed with hysteresis. - Promotions Service: creates campaign events with geo-targeting rules and audience definitions. 2) Notification Ingestion + Event Bus - All services publish domain events to a durable event bus. - Ev...
Show Full Answer ▼
High-level architecture (components and interactions) 1) Event producers (source services) - Trip Service: emits ride lifecycle events (driver_assigned, ride_completed). - Dispatch/Matching Service: emits driver_assigned and any reassignment events. - ETA/Location Service: continuously computes ETA from driver GPS streams; emits driver_arriving_soon when ETA threshold (e.g., <=2 minutes) is crossed with hysteresis. - Promotions Service: creates campaign events with geo-targeting rules and audience definitions. 2) Notification Ingestion + Event Bus - All services publish domain events to a durable event bus. - Events are standardized (user_id, ride_id, event_type, timestamp, payload, idempotency_key, priority, locale). 3) Notification Orchestrator (rules + routing) - Consumes events from the bus. - Applies business rules (who to notify: rider, driver; quiet hours; user opt-outs; rate limits; do-not-disturb; fallback channels). - Enriches notifications (fetch driver name/vehicle, receipt link, ETA text) via cached reads. - Produces “notification jobs” to channel-specific queues with priority (transactional > promotional). 4) User/Device & Preferences Service - Stores device tokens (APNs/FCM), platform, app version, last-seen, language, and notification preferences. - Exposes low-latency lookup (cache-first). 5) Delivery Workers (channel adapters) - Push Gateway: sends to Apple APNs and Google FCM. - SMS Gateway (optional fallback for critical messages): Twilio or direct aggregator. - In-app/WebSocket Gateway (optional): for users currently active in the app. 6) Delivery Tracking + Retry + DLQ - Delivery attempts recorded (sent, accepted by provider, failed with reason). - Automatic retries with exponential backoff for transient failures. - Dead-letter queue for poison messages; alerting and replay tools. 7) Promotional Targeting Pipeline - Geo audience builder: converts geographic areas (geohash/H3 cells) + eligibility criteria into target user sets. - Uses near-real-time location signals (last known location) and/or home/work region. - Outputs batches of notification jobs into lower-priority queues with throttling. 8) Observability and Ops - Metrics: end-to-end latency p50/p95/p99, queue lag, provider error rates, token invalidation rates. - Tracing: correlate event_id → job_id → provider request_id. - Admin console: campaign management, replay, suppression lists. Key technology choices and justification 1) Message queuing / event streaming - Apache Kafka (or managed equivalents like AWS MSK / Confluent Cloud) as the central event bus. Justification: high throughput during rush hours, partitioning for horizontal scale, durable log for replay, consumer groups for independent scaling, good fit for at-least-once processing. - Separate topics for: - ride-events (transactional) - eta-events - promo-events - notification-jobs-high (priority) - notification-jobs-low (promo) - delivery-results 2) Datastores - Device tokens and preferences: DynamoDB (or Cassandra) keyed by user_id. Justification: predictable low-latency reads at massive scale, high availability, easy horizontal scaling. - Delivery tracking / analytics: - Hot path: DynamoDB/Cassandra for recent state (last status per notification_id). - Long-term analytics: data lake (S3/GCS) + warehouse (Snowflake/BigQuery) fed by Kafka Connect. - Campaign/audience metadata: Postgres (or Aurora) for relational management (campaigns, schedules, creatives). - Caching: Redis (clustered) for device token lookups, user preference cache, and template fragments. 3) Push notification services - APNs for iOS and FCM for Android. Justification: official, reliable, scalable push infrastructure; supports priority and collapse keys. - Optional SMS provider for fallback to meet reliability for critical transactional notifications. 4) Geo targeting - H3 or Geohash indexing for geographic regions. Justification: efficient mapping from lat/lon to discrete cells; supports querying “users in these cells”. - Stream processing: Kafka Streams / Flink for maintaining “users-in-cell” membership based on location updates. Low latency (<2s) and high reliability (at-least-once) 1) Low latency strategy - Prioritize transactional notifications: - Use dedicated high-priority topics/queues and worker pools. - Apply strict per-message SLAs: short batching windows (or none) for urgent events. - Cache-first enrichment: - Orchestrator reads device tokens/preferences from Redis; fall back to DynamoDB on cache miss. - Keep payload minimal; include deep links to fetch details in-app. - Minimize synchronous dependencies: - Producers publish events asynchronously. - Orchestrator avoids calling multiple microservices in-line; uses precomputed data (e.g., driver info already in event or obtainable from cache). - Connection reuse and provider best practices: - Maintain persistent HTTP/2 connections to APNs; reuse FCM connections. - Use provider priority flags appropriately. - Control “arriving soon” noise: - ETA service emits only on threshold crossing with cooldown (e.g., don’t resend within N minutes) to reduce load and keep latency for critical messages. 2) At-least-once delivery and correctness - At-least-once from Kafka + consumer commits after processing. - Idempotency: - Each notification job carries a deterministic idempotency_key (e.g., user_id + ride_id + event_type + version). - Orchestrator writes a “job created” record (or dedupe key) with conditional put to prevent duplicate job creation on replays. - Delivery workers record attempts keyed by notification_id to avoid double-sending when possible. - Provider-level dedupe: - Use APNs/FCM collapse keys for certain types (e.g., driver arriving soon) to ensure the latest replaces prior. - Retry policy: - Transient failures: retry with exponential backoff and jitter. - Permanent failures (invalid token): mark token invalid, stop retrying. - DLQ for repeated failures; operator workflows for replay. 3) Reliability and availability - Multi-AZ deployment for Kafka, Redis, DynamoDB (managed), and stateless services. - Backpressure: - If push providers degrade, queues absorb spikes; workers scale but capped to avoid provider rate limits. - Exactly-once is not required; at-least-once plus idempotency is sufficient for user-facing notifications. Scaling to handle peak loads 1) Throughput estimation (order-of-magnitude) - 500k rides/day; each ride might generate 2–4 transactional notifications (assigned, arriving soon, completed/receipt) for rider; plus driver-side notifications. - Peaks during rush hour can be 10–20x average. Design for several thousand notifications/sec sustained bursts. 2) Horizontal scaling approach - Kafka partitioning: - Partition by user_id (or ride_id) to preserve ordering per user/ride for related notifications. - Scale partitions to match expected peak consumer parallelism. - Stateless services: - Orchestrator and delivery workers are stateless and auto-scaled (Kubernetes HPA based on CPU + queue lag). - Separate pools and isolation: - Separate topics/queues and worker deployments for transactional vs promotional. - Hard quotas so promotions never starve transactional delivery. 3) Promotional message scaling - Precompute audience: - Campaign expands to H3 cells; fetch eligible users via “users-in-cell” store. - Fanout in batches with throttling; enqueue jobs to low-priority queue. - Rate limiting: - Global and per-region caps; time-slicing across cells. - Respect user-level frequency caps and opt-outs. 4) Data and cache scaling - Redis cluster sized for high QPS reads; use consistent hashing and replication. - DynamoDB/Cassandra with enough read capacity; cache protects it during spikes. - Token invalidation handling to reduce repeated failed sends. Major trade-offs 1) At-least-once vs exactly-once - Trade-off: at-least-once can cause duplicates; exactly-once would add complexity and latency. - Decision: accept at-least-once with strong idempotency keys and collapse keys to minimize user-visible duplicates. 2) Consistency vs availability - Device tokens/preferences may be slightly stale (eventual consistency) if a user changes settings. - Decision: favor availability and low latency; use short TTL caches and versioning to converge quickly. 3) Cost vs performance - Maintaining a real-time geo audience index and stream processor costs more than batch targeting. - Decision: use near-real-time geo targeting only when needed; otherwise allow campaigns with slower batch expansion. Isolate promo workloads to cheaper capacity. 4) Payload enrichment approach - Fully enriched messages require more synchronous reads and can increase latency. - Decision: keep push payload small and rely on deep links; enrich only when data is already in-event or cached. 5) Provider dependencies and fallback - Adding SMS fallback increases cost and compliance scope. - Decision: enable SMS fallback only for critical transactional notifications (e.g., driver assigned/arriving) and only when push repeatedly fails or user has no valid token. Summary This design uses a durable event streaming backbone (Kafka), a rules-based notification orchestrator, and scalable channel delivery workers to achieve sub-2-second transactional notification delivery with at-least-once reliability. It scales horizontally via partitioning and autoscaling, isolates promotional traffic to protect time-sensitive messages, and manages duplicates through idempotency and provider collapse keys while balancing cost and operational complexity.
Result
Winning Votes
3 / 3
Average Score
Total Score
Overall Comments
Answer A presents an exceptionally thorough and well-structured system design that covers all required aspects with significant depth. It demonstrates expert-level understanding of real-time notification systems with detailed component descriptions, nuanced technology choices, and sophisticated strategies for latency, reliability, and scaling. The trade-off analysis is particularly strong, covering five distinct trade-offs with clear reasoning. The design includes advanced concepts like H3/geohash indexing, hysteresis for ETA threshold crossing, collapse keys for deduplication, and careful separation of transactional vs promotional workloads. The answer also addresses operational concerns like observability, admin tooling, and token invalidation handling.
View Score Details ▼
Architecture Quality
Weight 30%Answer A presents a comprehensive 8-component architecture with clear separation of concerns, including specialized components like the ETA service with hysteresis, promotional targeting pipeline with H3 cells, and a full observability layer. The event-driven design is well-articulated with explicit data flow between components.
Completeness
Weight 20%Answer A addresses all required points thoroughly: architecture, technology choices, latency/reliability strategy, scaling, and trade-offs. It also goes beyond requirements with operational concerns like observability, admin console, token invalidation handling, and detailed geo-targeting pipeline. The event schema standardization is a nice detail.
Trade-off Reasoning
Weight 20%Answer A discusses five well-reasoned trade-offs covering at-least-once vs exactly-once, consistency vs availability, cost vs performance, payload enrichment approach, and provider dependencies/fallback. Each trade-off includes a clear decision and rationale. The payload enrichment trade-off and SMS fallback scope considerations show practical engineering judgment.
Scalability & Reliability
Weight 20%Answer A provides realistic throughput estimates (several thousand notifications/sec sustained bursts during peaks) and detailed horizontal scaling strategies including Kafka partitioning by user_id, separate worker pools for transactional vs promotional, autoscaling based on CPU and queue lag, and rate limiting for promotions. The reliability strategy with idempotency keys, collapse keys, DLQ, and multi-AZ deployment is comprehensive.
Clarity
Weight 10%Answer A is well-organized with clear numbered sections and subsections. The dense technical content is presented logically. However, the sheer volume of detail can make it slightly harder to follow compared to a more narrative approach. The summary at the end helps tie everything together.
Total Score
Overall Comments
Answer A provides an outstandingly detailed and professional system design. Its strength lies in the granular and realistic breakdown of the architecture into distinct, well-defined components like a separate Promotional Targeting Pipeline and an Observability stack. The technology choices are expertly justified, and the strategies for latency, reliability, and scalability are comprehensive and practical. The trade-off analysis is nuanced and covers multiple dimensions of the design.
View Score Details ▼
Architecture Quality
Weight 30%The architecture is exceptionally detailed and well-thought-out. It breaks the system into granular, realistic components like a dedicated Promotional Targeting Pipeline and an Observability/Ops section, which demonstrates a deep understanding of production systems. The interactions are clearly defined.
Completeness
Weight 20%The answer is extremely complete, addressing every single point from the prompt with significant detail and depth. All required sections are present and thoroughly explained.
Trade-off Reasoning
Weight 20%The trade-off analysis is excellent and nuanced. It covers a wide range of considerations, including at-least-once vs. exactly-once, consistency vs. availability, and more subtle points like payload enrichment strategies and the cost implications of SMS fallbacks. Each decision is clearly justified.
Scalability & Reliability
Weight 20%The strategies for scalability and reliability are robust and well-explained. The design correctly uses Kafka partitioning, stateless auto-scaling services, and resource isolation. The reliability section thoroughly covers idempotency, retries, and DLQs.
Clarity
Weight 10%The answer is perfectly clear, exceptionally well-structured, and uses precise technical language. The use of numbered lists and clear headings makes it very easy to read and understand the complex design.
Total Score
Overall Comments
Answer A presents a stronger and more production-ready design. It covers the full pipeline from event producers through orchestration, delivery, retries, DLQ, observability, and promotional geo-targeting. Technology choices are well matched to requirements, and the answer gives concrete mechanisms for latency control, idempotency, prioritization, backpressure, and workload isolation. Its trade-off discussion is practical and grounded. Minor weaknesses are that some implementation choices are broad rather than narrowed to a single stack, and it does not quantify capacity in great depth.
View Score Details ▼
Architecture Quality
Weight 30%The architecture is well decomposed into event producers, bus, orchestrator, user/device store, delivery workers, tracking, promo pipeline, and observability. It handles both transactional and promotional flows cleanly and includes practical concerns like priority queues, fallback channels, and threshold-based ETA emission.
Completeness
Weight 20%It addresses all requested points thoroughly: architecture, tech choices, latency, reliability, scaling, and trade-offs. It also explicitly handles all notification types and adds useful details such as opt-outs, rate limits, DLQ replay, token invalidation, and geo audience building.
Trade-off Reasoning
Weight 20%The trade-offs are concrete and directly tied to the design, especially around at-least-once versus exactly-once, availability versus consistency for preferences, cost of geo-indexing, enrichment latency, and SMS fallback scope. The reasoning is pragmatic and balanced.
Scalability & Reliability
Weight 20%This is the strongest area of Answer A. It uses partitioned Kafka topics, autoscaled stateless workers, queue isolation, backpressure, retries with jitter, DLQ, idempotency keys, conditional dedupe writes, and multi-AZ deployment. The peak-load discussion is realistic and avoids starving critical traffic with promotions.
Clarity
Weight 10%The answer is clear and logically structured with numbered sections and concise bullets. It is dense but still readable, though slightly more complex and less polished stylistically than Answer B.