Design a Real-Time Ride-Sharing Notification System

Compare model answers for this System Design benchmark and review scores, judging comments, and related examples.

X f L

Contents

Task Overview

Benchmark Genres

System Design

Task Creator Model The task creator is randomly selected from top task-generation models of supported providers.

Google Gemini 2.5 Pro

Answering Models In this benchmark, models from the same provider as the task creator are excluded from answering.

Answer A OpenAI GPT-5.2

Answer B Anthropic Claude Sonnet 4.6

Judge Models Judging uses exactly 3 judge models, excluding the answering models. At least 1 judge is selected from flagship models, lightweight models are not selected as judges, and the 3 judges come from 3 distinct providers.

OpenAI GPT-5.4 Anthropic Claude Opus 4.6 Google Gemini 2.5 Pro

Task Prompt

Show more ▼

You are tasked with designing the high-level architecture for a notification system for a popular ride-sharing application. The system must be able to handle 1 million daily active users and an average of 500,000 rides per day, with peaks during rush hours. The system needs to deliver the following types of notifications: 1. Driver has been assigned. 2. Driver is arriving soon (e.g., 2 minutes away). 3. Ride has been completed and receipt is available. 4. Promotional messages targeted to users in specific geographic areas. Your design proposal should address the following points: - A high-level architectural description of the components and their interactions. - Key technology choices (e.g., for message queuing, databases, push notification services) and justification for your choices. - A strategy for ensuring low latency (under 2 seconds delivery time) and high reliability (at-least-once delivery). - How the system would scale to handle peak loads. - A discussion of the major trade-offs you made in your design (e.g., cost vs. performance, consistency vs. availability).

Task Context

The goal is to create a robust and scalable system that reliably delivers time-sensitive information to both riders and drivers, which is critical for the user experience of a ride-sharing service.

Judging Policy

A high-quality answer will present a coherent and logical system design. The evaluation will focus on the practicality of the proposed architecture, the justification for technology choices, and the depth of the trade-off analysis. The design should realistically address the specified requirements for scale, latency, and reliability. The answer should clearly explain how different components work together and demonstrate an understanding of the challenges involved in building a real-time notification system.

Answers

Answer A: OpenAI GPT-5.2

High-level architecture (components and interactions)

Event producers (source services)

Trip Service: emits ride lifecycle events (driver_assigned, ride_completed).
Dispatch/Matching Service: emits driver_assigned and any reassignment events.
ETA/Location Service: continuously computes ETA from driver GPS streams; emits driver_arriving_soon when ETA threshold (e.g., <=2 minutes) is crossed with hysteresis.
Promotions Service: creates campaign events with geo-targeting rules and audience definitions.

Notification Ingestion + Event Bus

All services publish domain events to a durable event bus.
Ev...

Show Full Answer ▼

High-level architecture (components and interactions)

Event producers (source services)

Trip Service: emits ride lifecycle events (driver_assigned, ride_completed).
Dispatch/Matching Service: emits driver_assigned and any reassignment events.
ETA/Location Service: continuously computes ETA from driver GPS streams; emits driver_arriving_soon when ETA threshold (e.g., <=2 minutes) is crossed with hysteresis.
Promotions Service: creates campaign events with geo-targeting rules and audience definitions.

Notification Ingestion + Event Bus

All services publish domain events to a durable event bus.
Events are standardized (user_id, ride_id, event_type, timestamp, payload, idempotency_key, priority, locale).

Notification Orchestrator (rules + routing)

Consumes events from the bus.
Applies business rules (who to notify: rider, driver; quiet hours; user opt-outs; rate limits; do-not-disturb; fallback channels).
Enriches notifications (fetch driver name/vehicle, receipt link, ETA text) via cached reads.
Produces “notification jobs” to channel-specific queues with priority (transactional > promotional).

User/Device & Preferences Service

Stores device tokens (APNs/FCM), platform, app version, last-seen, language, and notification preferences.
Exposes low-latency lookup (cache-first).

Delivery Workers (channel adapters)

Push Gateway: sends to Apple APNs and Google FCM.
SMS Gateway (optional fallback for critical messages): Twilio or direct aggregator.
In-app/WebSocket Gateway (optional): for users currently active in the app.

Delivery Tracking + Retry + DLQ

Delivery attempts recorded (sent, accepted by provider, failed with reason).
Automatic retries with exponential backoff for transient failures.
Dead-letter queue for poison messages; alerting and replay tools.

Promotional Targeting Pipeline

Geo audience builder: converts geographic areas (geohash/H3 cells) + eligibility criteria into target user sets.
Uses near-real-time location signals (last known location) and/or home/work region.
Outputs batches of notification jobs into lower-priority queues with throttling.

Observability and Ops

Metrics: end-to-end latency p50/p95/p99, queue lag, provider error rates, token invalidation rates.
Tracing: correlate event_id → job_id → provider request_id.
Admin console: campaign management, replay, suppression lists.

Key technology choices and justification

Message queuing / event streaming

Apache Kafka (or managed equivalents like AWS MSK / Confluent Cloud) as the central event bus.
Justification: high throughput during rush hours, partitioning for horizontal scale, durable log for replay, consumer groups for independent scaling, good fit for at-least-once processing.
Separate topics for:
- ride-events (transactional)
- eta-events
- promo-events
- notification-jobs-high (priority)
- notification-jobs-low (promo)
- delivery-results

Datastores

Device tokens and preferences: DynamoDB (or Cassandra) keyed by user_id.
Justification: predictable low-latency reads at massive scale, high availability, easy horizontal scaling.
Delivery tracking / analytics:
- Hot path: DynamoDB/Cassandra for recent state (last status per notification_id).
- Long-term analytics: data lake (S3/GCS) + warehouse (Snowflake/BigQuery) fed by Kafka Connect.
Campaign/audience metadata: Postgres (or Aurora) for relational management (campaigns, schedules, creatives).
Caching: Redis (clustered) for device token lookups, user preference cache, and template fragments.

Push notification services

APNs for iOS and FCM for Android.
Justification: official, reliable, scalable push infrastructure; supports priority and collapse keys.
Optional SMS provider for fallback to meet reliability for critical transactional notifications.

Geo targeting

H3 or Geohash indexing for geographic regions.
Justification: efficient mapping from lat/lon to discrete cells; supports querying “users in these cells”.
Stream processing: Kafka Streams / Flink for maintaining “users-in-cell” membership based on location updates.

Low latency (<2s) and high reliability (at-least-once)

Low latency strategy

Prioritize transactional notifications:
- Use dedicated high-priority topics/queues and worker pools.
- Apply strict per-message SLAs: short batching windows (or none) for urgent events.
Cache-first enrichment:
- Orchestrator reads device tokens/preferences from Redis; fall back to DynamoDB on cache miss.
- Keep payload minimal; include deep links to fetch details in-app.
Minimize synchronous dependencies:
- Producers publish events asynchronously.
- Orchestrator avoids calling multiple microservices in-line; uses precomputed data (e.g., driver info already in event or obtainable from cache).
Connection reuse and provider best practices:
- Maintain persistent HTTP/2 connections to APNs; reuse FCM connections.
- Use provider priority flags appropriately.
Control “arriving soon” noise:
- ETA service emits only on threshold crossing with cooldown (e.g., don’t resend within N minutes) to reduce load and keep latency for critical messages.

At-least-once delivery and correctness

At-least-once from Kafka + consumer commits after processing.
Idempotency:
- Each notification job carries a deterministic idempotency_key (e.g., user_id + ride_id + event_type + version).
- Orchestrator writes a “job created” record (or dedupe key) with conditional put to prevent duplicate job creation on replays.
- Delivery workers record attempts keyed by notification_id to avoid double-sending when possible.
Provider-level dedupe:
- Use APNs/FCM collapse keys for certain types (e.g., driver arriving soon) to ensure the latest replaces prior.
Retry policy:
- Transient failures: retry with exponential backoff and jitter.
- Permanent failures (invalid token): mark token invalid, stop retrying.
- DLQ for repeated failures; operator workflows for replay.

Reliability and availability

Multi-AZ deployment for Kafka, Redis, DynamoDB (managed), and stateless services.
Backpressure:
- If push providers degrade, queues absorb spikes; workers scale but capped to avoid provider rate limits.
Exactly-once is not required; at-least-once plus idempotency is sufficient for user-facing notifications.

Scaling to handle peak loads

Throughput estimation (order-of-magnitude)

500k rides/day; each ride might generate 2–4 transactional notifications (assigned, arriving soon, completed/receipt) for rider; plus driver-side notifications.
Peaks during rush hour can be 10–20x average. Design for several thousand notifications/sec sustained bursts.

Horizontal scaling approach

Kafka partitioning:
- Partition by user_id (or ride_id) to preserve ordering per user/ride for related notifications.
- Scale partitions to match expected peak consumer parallelism.
Stateless services:
- Orchestrator and delivery workers are stateless and auto-scaled (Kubernetes HPA based on CPU + queue lag).
Separate pools and isolation:
- Separate topics/queues and worker deployments for transactional vs promotional.
- Hard quotas so promotions never starve transactional delivery.

Promotional message scaling

Precompute audience:
- Campaign expands to H3 cells; fetch eligible users via “users-in-cell” store.
- Fanout in batches with throttling; enqueue jobs to low-priority queue.
Rate limiting:
- Global and per-region caps; time-slicing across cells.
- Respect user-level frequency caps and opt-outs.

Data and cache scaling

Redis cluster sized for high QPS reads; use consistent hashing and replication.
DynamoDB/Cassandra with enough read capacity; cache protects it during spikes.
Token invalidation handling to reduce repeated failed sends.

Major trade-offs

At-least-once vs exactly-once

Trade-off: at-least-once can cause duplicates; exactly-once would add complexity and latency.
Decision: accept at-least-once with strong idempotency keys and collapse keys to minimize user-visible duplicates.

Consistency vs availability

Device tokens/preferences may be slightly stale (eventual consistency) if a user changes settings.
Decision: favor availability and low latency; use short TTL caches and versioning to converge quickly.

Cost vs performance

Maintaining a real-time geo audience index and stream processor costs more than batch targeting.
Decision: use near-real-time geo targeting only when needed; otherwise allow campaigns with slower batch expansion. Isolate promo workloads to cheaper capacity.

Payload enrichment approach

Fully enriched messages require more synchronous reads and can increase latency.
Decision: keep push payload small and rely on deep links; enrich only when data is already in-event or cached.

Provider dependencies and fallback

Adding SMS fallback increases cost and compliance scope.
Decision: enable SMS fallback only for critical transactional notifications (e.g., driver assigned/arriving) and only when push repeatedly fails or user has no valid token.

Summary
This design uses a durable event streaming backbone (Kafka), a rules-based notification orchestrator, and scalable channel delivery workers to achieve sub-2-second transactional notification delivery with at-least-once reliability. It scales horizontally via partitioning and autoscaling, isolates promotional traffic to protect time-sensitive messages, and manages duplicates through idempotency and provider collapse keys while balancing cost and operational complexity.

Result

#1 | Winner

Winning Votes

3 / 3

Average Score

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer A presents an exceptionally thorough and well-structured system design that covers all required aspects with significant depth. It demonstrates expert-level understanding of real-time notification systems with detailed component descriptions, nuanced technology choices, and sophisticated strategies for latency, reliability, and scaling. The trade-off analysis is particularly strong, covering five distinct trade-offs with clear reasoning. The design includes advanced concepts like H3/geohash indexing, hysteresis for ETA threshold crossing, collapse keys for deduplication, and careful separation of transactional vs promotional workloads. The answer also addresses operational concerns like observability, admin tooling, and token invalidation handling.

View Score Details ▼

Architecture Quality

Weight 30%

Answer A presents a comprehensive 8-component architecture with clear separation of concerns, including specialized components like the ETA service with hysteresis, promotional targeting pipeline with H3 cells, and a full observability layer. The event-driven design is well-articulated with explicit data flow between components.

Completeness

Weight 20%

Answer A addresses all required points thoroughly: architecture, technology choices, latency/reliability strategy, scaling, and trade-offs. It also goes beyond requirements with operational concerns like observability, admin console, token invalidation handling, and detailed geo-targeting pipeline. The event schema standardization is a nice detail.

Trade-off Reasoning

Weight 20%

Answer A discusses five well-reasoned trade-offs covering at-least-once vs exactly-once, consistency vs availability, cost vs performance, payload enrichment approach, and provider dependencies/fallback. Each trade-off includes a clear decision and rationale. The payload enrichment trade-off and SMS fallback scope considerations show practical engineering judgment.

Scalability & Reliability

Weight 20%

Answer A provides realistic throughput estimates (several thousand notifications/sec sustained bursts during peaks) and detailed horizontal scaling strategies including Kafka partitioning by user_id, separate worker pools for transactional vs promotional, autoscaling based on CPU and queue lag, and rate limiting for promotions. The reliability strategy with idempotency keys, collapse keys, DLQ, and multi-AZ deployment is comprehensive.

Clarity

Weight 10%

Answer A is well-organized with clear numbered sections and subsections. The dense technical content is presented logically. However, the sheer volume of detail can make it slightly harder to follow compared to a more narrative approach. The summary at the end helps tie everything together.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer A provides an outstandingly detailed and professional system design. Its strength lies in the granular and realistic breakdown of the architecture into distinct, well-defined components like a separate Promotional Targeting Pipeline and an Observability stack. The technology choices are expertly justified, and the strategies for latency, reliability, and scalability are comprehensive and practical. The trade-off analysis is nuanced and covers multiple dimensions of the design.

View Score Details ▼

Architecture Quality

Weight 30%

The architecture is exceptionally detailed and well-thought-out. It breaks the system into granular, realistic components like a dedicated Promotional Targeting Pipeline and an Observability/Ops section, which demonstrates a deep understanding of production systems. The interactions are clearly defined.

Completeness

Weight 20%

100

The answer is extremely complete, addressing every single point from the prompt with significant detail and depth. All required sections are present and thoroughly explained.

Trade-off Reasoning

Weight 20%

The trade-off analysis is excellent and nuanced. It covers a wide range of considerations, including at-least-once vs. exactly-once, consistency vs. availability, and more subtle points like payload enrichment strategies and the cost implications of SMS fallbacks. Each decision is clearly justified.

Scalability & Reliability

Weight 20%

The strategies for scalability and reliability are robust and well-explained. The design correctly uses Kafka partitioning, stateless auto-scaling services, and resource isolation. The reliability section thoroughly covers idempotency, retries, and DLQs.

Clarity

Weight 10%

100

The answer is perfectly clear, exceptionally well-structured, and uses precise technical language. The use of numbered lists and clear headings makes it very easy to read and understand the complex design.

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer A presents a stronger and more production-ready design. It covers the full pipeline from event producers through orchestration, delivery, retries, DLQ, observability, and promotional geo-targeting. Technology choices are well matched to requirements, and the answer gives concrete mechanisms for latency control, idempotency, prioritization, backpressure, and workload isolation. Its trade-off discussion is practical and grounded. Minor weaknesses are that some implementation choices are broad rather than narrowed to a single stack, and it does not quantify capacity in great depth.

View Score Details ▼

Architecture Quality

Weight 30%

The architecture is well decomposed into event producers, bus, orchestrator, user/device store, delivery workers, tracking, promo pipeline, and observability. It handles both transactional and promotional flows cleanly and includes practical concerns like priority queues, fallback channels, and threshold-based ETA emission.

Completeness

Weight 20%

It addresses all requested points thoroughly: architecture, tech choices, latency, reliability, scaling, and trade-offs. It also explicitly handles all notification types and adds useful details such as opt-outs, rate limits, DLQ replay, token invalidation, and geo audience building.

Trade-off Reasoning

Weight 20%

The trade-offs are concrete and directly tied to the design, especially around at-least-once versus exactly-once, availability versus consistency for preferences, cost of geo-indexing, enrichment latency, and SMS fallback scope. The reasoning is pragmatic and balanced.

Scalability & Reliability

Weight 20%

This is the strongest area of Answer A. It uses partitioned Kafka topics, autoscaled stateless workers, queue isolation, backpressure, retries with jitter, DLQ, idempotency keys, conditional dedupe writes, and multi-AZ deployment. The peak-load discussion is realistic and avoids starving critical traffic with promotions.

Clarity

Weight 10%

The answer is clear and logically structured with numbered sections and concise bullets. It is dense but still readable, though slightly more complex and less polished stylistically than Answer B.

Answer B: Anthropic Claude Sonnet 4.6

Real-Time Ride-Sharing Notification System: High-Level Architecture Design

Overview

The notification system must reliably deliver time-sensitive messages to 1 million daily active users across 500,000 daily rides, with sub-2-second latency and at-least-once delivery guarantees. The design follows an event-driven, microservices architecture with clear separation between event production, routing, delivery, and tracking.

High-Level Architectural Components and Interactions

The system is composed of five major layers:

Event Producers: The core ride-matching and trip-management services emit domain event...

Show Full Answer ▼

Real-Time Ride-Sharing Notification System: High-Level Architecture Design

Overview

High-Level Architectural Components and Interactions

The system is composed of five major layers:

Event Producers: The core ride-matching and trip-management services emit domain events (driver assigned, driver arriving, ride completed, promo triggered) to a central message broker. These services are decoupled from the notification system entirely.

Message Broker (Event Bus): A distributed streaming platform ingests all events and fans them out to downstream consumers. This is the backbone of the asynchronous pipeline.

Notification Orchestrator Service: A stateless microservice that consumes events from the broker, enriches them with user preferences and device tokens (fetched from a User Profile Store), determines the correct notification channel (push, SMS, in-app), and publishes formatted notification jobs to channel-specific queues.

Channel Delivery Workers: Dedicated worker pools for each channel (APNs for iOS, FCM for Android, Twilio for SMS, WebSocket server for in-app). Each worker consumes from its queue, calls the appropriate third-party gateway, and records delivery status.

Delivery Tracking and Audit Store: A persistent log of every notification attempt, its status (sent, delivered, failed), and timestamps. This feeds dashboards and retry logic.

Flow summary: Ride service emits event → Message broker → Orchestrator enriches and routes → Channel queue → Delivery worker → Push/SMS gateway → Device. Delivery status is written back to the audit store asynchronously.

Key Technology Choices and Justifications

Message Broker: Apache Kafka. Kafka provides high-throughput, durable, ordered event streaming with configurable retention. Its consumer group model allows multiple orchestrator instances to process events in parallel. Kafka's replication factor (set to 3) ensures no event is lost if a broker node fails. For rush-hour spikes, Kafka absorbs bursts without back-pressure on upstream services.

Orchestrator and Workers: Kubernetes-deployed Go or Java microservices. Go offers low memory overhead and high concurrency via goroutines, ideal for I/O-bound notification dispatch. Kubernetes enables horizontal pod autoscaling based on Kafka consumer lag metrics.

User Profile and Device Token Store: Redis (primary, for hot data) backed by PostgreSQL (source of truth). Device tokens and user preferences are cached in Redis with a TTL. Cache-aside pattern ensures the orchestrator retrieves tokens in under 1 ms on cache hit, avoiding database bottlenecks at scale.

Push Notification Gateways: Firebase Cloud Messaging (FCM) for Android and Apple Push Notification service (APNs) for iOS. Both support batch sending APIs. For SMS fallback, Twilio provides reliable global delivery with delivery receipts.

In-App / Real-Time Channel: A WebSocket gateway (using Socket.io or a custom server backed by Redis Pub/Sub) maintains persistent connections with active app sessions. For driver-arriving and driver-assigned notifications, the in-app channel is attempted first because it is the lowest latency path.

Delivery Audit Store: Apache Cassandra or Amazon DynamoDB. Both are optimized for high write throughput and time-series access patterns. Each notification attempt is written as an immutable record keyed by (user_id, notification_id, timestamp).

Geographic Targeting for Promotions: A geospatial index using PostGIS or Redis with the GEO commands stores user last-known locations. A promotional campaign service queries this index to build target user lists, then publishes individual notification events to Kafka, keeping the delivery pipeline uniform.

Low Latency and High Reliability Strategy

Low Latency (under 2 seconds end-to-end):

The critical path for transactional notifications (driver assigned, driver arriving) is: event emission → Kafka ingestion (< 10 ms) → orchestrator processing including Redis token lookup (< 50 ms) → FCM/APNs API call (typically 100–400 ms) → device receipt. Total expected p99 latency is well under 2 seconds under normal load.

To minimize orchestrator processing time, token enrichment is done via a single Redis pipeline call rather than sequential lookups. The orchestrator is stateless and horizontally scaled, so no single instance is a bottleneck.

WebSocket delivery for in-app notifications bypasses push gateways entirely, achieving sub-100 ms delivery for users with active sessions.

High Reliability (at-least-once delivery):

Kafka consumer offsets are committed only after the orchestrator has successfully published the notification job to the channel queue. If the orchestrator crashes mid-processing, the event is re-consumed and reprocessed.

Channel delivery workers use an idempotency key (derived from notification_id) when calling FCM/APNs, preventing duplicate notifications even if a worker retries after a transient failure.

A dead-letter queue (DLQ) captures notifications that fail after a configurable number of retries (e.g., 3 attempts with exponential backoff). An alerting system monitors DLQ depth and triggers on-call escalation.

For SMS fallback: if push delivery is not acknowledged within 30 seconds (checked via FCM delivery receipts), the system automatically falls back to SMS for critical notifications (driver assigned, ride completed).

Scaling Strategy for Peak Loads

Rush-hour peaks are predictable (morning and evening commutes). The system uses two complementary scaling approaches:

Proactive (Scheduled) Scaling: Kubernetes Horizontal Pod Autoscaler (HPA) is pre-warmed before known peak windows using a cron-based scaling policy. Orchestrator and worker replica counts are increased 10 minutes before rush hour.

Reactive Scaling: Kafka consumer lag is exposed as a custom metric to the Kubernetes Metrics Server. If lag on the notification topic exceeds a threshold (e.g., 10,000 unprocessed messages), HPA adds orchestrator pods automatically. Worker pools scale similarly based on their queue depths.

Kafka partitioning: The notification topic is partitioned by user_id (or ride_id) to ensure ordered processing per user while allowing high parallelism. With 100 partitions, up to 100 orchestrator instances can process concurrently.

Promotion notifications, which are bulk and non-time-critical, are processed on a separate lower-priority Kafka topic with a dedicated consumer group. This prevents promotional bursts from competing with transactional notifications for processing resources.

Database scaling: Redis is deployed as a cluster with read replicas. PostgreSQL uses connection pooling (PgBouncer) and read replicas for profile lookups. Cassandra scales horizontally by adding nodes.

Estimated throughput: 500,000 rides/day with an average of 3 notifications per ride equals 1.5 million transactional notifications per day, or roughly 17 per second on average. During a 2-hour rush-hour peak handling 30% of daily rides, peak throughput reaches approximately 70 notifications per second. This is well within the capacity of a modest Kafka and worker cluster, leaving ample headroom.

Major Trade-offs

At-least-once vs. Exactly-once Delivery: Exactly-once delivery across distributed systems and third-party gateways (FCM, APNs) is practically impossible to guarantee end-to-end. The design accepts at-least-once semantics and mitigates duplicate delivery using idempotency keys. The trade-off is occasional duplicate notifications (e.g., two "driver assigned" messages) versus the complexity and latency cost of distributed transactions. For a ride-sharing app, a rare duplicate is far less harmful than a missed notification.

Cost vs. Performance: Maintaining a persistent WebSocket connection for every active user is expensive in terms of server resources. The design uses WebSockets only for users with active app sessions (typically during an active ride), falling back to push notifications otherwise. This reduces infrastructure cost while preserving low latency for the most critical moments.

Consistency vs. Availability: The Redis token cache may serve a stale device token if a user reinstalls the app and the cache has not yet been invalidated. The design accepts this eventual consistency window (typically seconds to minutes) in exchange for the availability and latency benefits of caching. A failed push due to a stale token triggers a cache refresh and retry, so the impact is a slightly delayed notification rather than a lost one.

Simplicity vs. Flexibility in Channel Routing: A more sophisticated routing engine could apply machine-learning models to choose the optimal channel per user. The current design uses a simple rule-based router (in-app first, then push, then SMS fallback) for predictability, lower operational complexity, and easier debugging. This can be evolved incrementally.

Promotion Isolation: Separating promotional traffic onto its own Kafka topic and worker pool adds operational complexity (more topics to monitor, more consumer groups to manage) but is a necessary trade-off to protect transactional notification SLAs from being degraded by bulk promotional sends.

Conclusion

This architecture delivers a robust, scalable notification system by combining Kafka's durable event streaming, Redis-backed low-latency enrichment, multi-channel delivery workers, and Kubernetes-based autoscaling. The design prioritizes sub-2-second delivery for transactional notifications and at-least-once reliability through idempotent retries and dead-letter queues, while isolating promotional traffic to protect core user experience. The trade-offs made favor operational simplicity and availability over theoretical perfection, which is appropriate for a high-velocity consumer product.

Result

Winning Votes

0 / 3

Average Score

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer B provides a solid and well-organized system design that covers all the required points competently. It includes good technology choices with reasonable justifications and a clear flow description. The latency analysis with specific timing estimates is a nice touch. However, the throughput estimation appears to significantly underestimate peak load (70 notifications per second seems low for a system serving 1M DAU with rush hour peaks), which undermines the scaling analysis. The trade-off discussion is good but slightly less nuanced than Answer A. The design is more conventional and lacks some of the advanced details present in Answer A, such as the geo-targeting pipeline details, priority queue separation mechanisms, and operational tooling considerations.

View Score Details ▼

Architecture Quality

Weight 30%

Answer B presents a clean 5-layer architecture that covers the essential components well. The flow summary is clear and the WebSocket gateway for in-app delivery is a good addition. However, it lacks the depth of specialized components like the geo-targeting pipeline and observability layer found in Answer A.

Completeness

Weight 20%

Answer B covers all required points adequately. It includes good details on the WebSocket channel and SMS fallback strategy. However, it misses some depth in areas like operational tooling, detailed geo-targeting implementation, and the event schema design. The promotional targeting approach using PostGIS/Redis GEO is simpler but less scalable than Answer A's approach.

Trade-off Reasoning

Weight 20%

Answer B discusses five trade-offs that are generally well-reasoned. The WebSocket cost trade-off and the simplicity vs flexibility in channel routing are good practical considerations. However, some trade-offs are less nuanced - for example, the consistency vs availability discussion could go deeper into the implications of stale tokens beyond just retry.

Scalability & Reliability

Weight 20%

Answer B's throughput estimation of 70 notifications per second at peak is questionable and likely underestimates real-world requirements - it doesn't account for driver-side notifications, promotional messages, or the fact that peaks can be much sharper than a uniform 2-hour window. The proactive/reactive scaling approach with pre-warming is a good practical detail. The reliability strategy with DLQ and SMS fallback is solid but less detailed than Answer A's approach.

Clarity

Weight 10%

Answer B is very well-written with a clear narrative flow and good use of section headers. The specific latency breakdown (10ms + 50ms + 100-400ms) makes the latency argument concrete and easy to follow. The conclusion effectively summarizes the design philosophy. The writing style is slightly more accessible than Answer A.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer B presents a very strong and well-structured system design. It clearly outlines a logical architecture and makes excellent technology choices. Its discussion on scalability is particularly noteworthy, introducing the concept of proactive (scheduled) scaling for predictable peaks, which is a sophisticated touch. The reliability strategy, including a specific SMS fallback mechanism, is also very practical. While excellent, the architectural breakdown is slightly less detailed than its competitor.

View Score Details ▼

Architecture Quality

Weight 30%

The architecture is very solid and follows a logical, layered approach. The components are well-defined and the flow is clear. However, it is less granular than Answer A; for instance, it combines several functions into broader services, making it slightly less detailed.

Completeness

Weight 20%

100

The answer is fully complete and addresses all aspects of the prompt thoroughly. Each required section is covered in detail, leaving no gaps in the design proposal.

Trade-off Reasoning

Weight 20%

The trade-off reasoning is very strong and covers the key decisions well. The inclusion of 'Simplicity vs. Flexibility' is a good point. The analysis is slightly less broad than Answer A's but still of very high quality.

Scalability & Reliability

Weight 20%

This is a standout section for this answer. The discussion of both proactive (scheduled) and reactive scaling is a sophisticated and highly practical approach. The throughput estimation is also more detailed. The reliability strategy, including a timed SMS fallback, is excellent.

Clarity

Weight 10%

100

The answer is extremely clear and well-organized. The use of headings and a concluding summary makes the design easy to follow. The writing is professional and concise.

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer B is clear, organized, and reasonably practical, with a coherent event-driven architecture and decent coverage of autoscaling, queues, storage, and channel delivery. It explains latency and reliability in an accessible way and includes some useful operational ideas like scheduled scaling. However, several technical claims are weaker or less accurate, such as implied idempotency at APNs/FCM, optimistic assumptions around delivery receipts, and a notably low peak throughput estimate. Its geo-targeting and reliability details are less robust than Answer A's, and the design is somewhat shallower overall.

View Score Details ▼

Architecture Quality

Weight 30%

The architecture is coherent and sensibly layered, but it is more generic. It covers the main pipeline well, yet the design has less depth around enrichment boundaries, geo-targeting flow, and operational safeguards compared with Answer A.

Completeness

Weight 20%

It addresses all required sections and includes reasonable component choices, but some areas are thinner. Promotional geo-targeting is less developed, observability is barely discussed, and some edge-case handling like dedupe strategy and provider failure modes is not fully fleshed out.

Trade-off Reasoning

Weight 20%

The trade-off section is thoughtful and clearly written, covering several relevant dimensions such as delivery semantics, WebSocket cost, stale cache tolerance, routing complexity, and promo isolation. It is solid, though somewhat more standard and less tightly connected to implementation details than Answer A.

Scalability & Reliability

Weight 20%

It includes good basics like Kafka, autoscaling on lag, separate promo topics, retries, and DLQ. However, the reliability model is weakened by questionable claims about provider-level duplicate prevention and push acknowledgment behavior, and the peak throughput estimate appears too low for the stated scale, reducing confidence in capacity planning.

Clarity

Weight 10%

The answer is very readable, with clean sectioning, smooth prose, and an easy-to-follow flow. It communicates the architecture and rationale clearly, even if some technical details are lighter.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Winner OpenAI GPT-5.2

Winning Votes

3 / 3

Average Score

View this answer

Anthropic Claude Sonnet 4.6

Winning Votes

0 / 3

Average Score

View this answer

View head-to-head record for this model pair

Judging Results

Judge Models OpenAI GPT-5.4

GPT-5.2 Winner 87

Claude Sonnet 4.6 74

Why This Side Won

Answer A wins because it is more complete and technically rigorous across the core system design dimensions. It better addresses the specific notification types, especially ETA-triggered and geo-targeted promotional notifications, and provides stronger low-latency and at-least-once delivery strategies through priority isolation, deduplication, conditional writes, retries, backpressure, and observability. Answer B is solid and well structured, but it is less detailed, contains some questionable assumptions about provider behavior, and underestimates peak load, making it less convincing as a high-quality benchmark answer.

Judge Models Google Gemini 2.5 Pro

GPT-5.2 Winner 96

Claude Sonnet 4.6 93

Why This Side Won

Both answers are of exceptional quality. Answer A is chosen as the winner due to its superior architectural depth and granularity. It breaks the system down into more specialized components (e.g., a dedicated Promotional Targeting Pipeline, Observability and Ops), which reflects a more mature and production-ready design. While Answer B has excellent points, particularly on proactive scaling, Answer A's comprehensive and detailed component-level thinking demonstrates a slightly deeper understanding of the complexities involved in building and maintaining such a system at scale.

Judge Models Anthropic Claude Opus 4.6

GPT-5.2 Winner 88

Claude Sonnet 4.6 73

Why This Side Won

Answer A wins because it provides greater depth and sophistication across nearly all evaluation criteria. Its architecture is more detailed with explicit component descriptions including the ETA service with hysteresis, the promotional targeting pipeline with H3 cells, and comprehensive observability. The scaling analysis is more realistic with better throughput estimates and more detailed horizontal scaling strategies. The trade-off analysis covers more ground with five well-reasoned trade-offs. While Answer B is competent and has some nice touches like specific latency breakdowns, its throughput estimation is questionable (70 notifications/sec at peak seems too low), and it lacks the depth of Answer A in areas like geo-targeting, operational tooling, and priority isolation mechanisms.

Design a Real-Time Ride-Sharing Notification System

Task Overview

Task Prompt

Answers

Answer A: OpenAI GPT-5.2

Answer B: Anthropic Claude Sonnet 4.6

Comparison Summary

Judging Results

Related Tasks

Persuasive Email for a Four-Day Work Week Pilot

Explaining the Maxwell's Demon Paradox

The Archivist's Final Log

Plan a Community Garden Launch Party

Explain Bayes' Theorem for a Beginner

Summarize the Impact of the Printing Press

Reimagining the Urban Community Space

The Overly Literal Genie

Related Links