Orivel Orivel
Open menu

Design a Scalable Concert Ticket Reservation System

Compare model answers for this System Design benchmark and review scores, judging comments, and related examples.

Login or register to use likes and favorites. Register

X f L

Contents

Task Overview

Benchmark Genres

System Design

Task Creator Model

Answering Models

Judge Models

Task Prompt

Design a system for an online concert ticketing platform. Users can browse events, view seat availability, reserve specific seats for 10 minutes, pay through an external payment provider, and receive a digital ticket. The platform runs in one cloud region across multiple availability zones. Explicit constraints: 3 million registered users, 500,000 daily active users, major on-sale events can reach 150,000 concurrent users, peak load is 8,000 seat reservation attempts per second and 2,000 payment attempts per secon...

Show more

Design a system for an online concert ticketing platform. Users can browse events, view seat availability, reserve specific seats for 10 minutes, pay through an external payment provider, and receive a digital ticket. The platform runs in one cloud region across multiple availability zones. Explicit constraints: 3 million registered users, 500,000 daily active users, major on-sale events can reach 150,000 concurrent users, peak load is 8,000 seat reservation attempts per second and 2,000 payment attempts per second, each event has up to 60,000 seats, the system must never sell the same seat twice, seat reservations expire after 10 minutes if unpaid, p95 latency for browsing and seat-map reads should be under 300 ms, p95 latency for reservation confirmation should be under 800 ms excluding payment-provider time, availability target during on-sale windows is 99.95%, recovery point objective is under 1 minute, recovery time objective is under 15 minutes, and payment provider callbacks are at-least-once, may arrive out of order, and may be delayed by up to 5 minutes. Provide a design plan. Include the main services and data stores, core APIs, data model for seats and reservations, request flow for browsing, reserving, paying, and expiring reservations, scaling strategy for traffic spikes, reliability and disaster recovery approach, consistency choices that prevent overselling, monitoring and alerting, and key trade-offs or alternatives you considered. State any reasonable assumptions you make.

Judging Policy

A good answer should present a coherent architecture that directly addresses the stated scale, latency, availability, and correctness constraints. It should explain how read-heavy seat maps are served efficiently, how reservation attempts are serialized or guarded so a seat cannot be sold twice, how reservation expiry and payment callbacks are handled safely with idempotency, and how the system behaves under flash-sale traffic. Strong answers discuss concrete components such as API gateways, reservation services, q...

Show more

A good answer should present a coherent architecture that directly addresses the stated scale, latency, availability, and correctness constraints. It should explain how read-heavy seat maps are served efficiently, how reservation attempts are serialized or guarded so a seat cannot be sold twice, how reservation expiry and payment callbacks are handled safely with idempotency, and how the system behaves under flash-sale traffic. Strong answers discuss concrete components such as API gateways, reservation services, queues or waiting rooms, caches, relational or transactional storage, event streams, background workers, and observability. They should reason about consistency versus performance, failure modes, multi-AZ operation, backups or replication, rate limiting, graceful degradation, and recovery targets. Answers that only list generic cloud components without explaining request flows, correctness guarantees, scaling bottlenecks, or trade-offs should be considered weaker.

Answers

# Scalable Concert Ticket Reservation System — Design Plan ## 1. Assumptions - Single cloud region (e.g., AWS us-east-1) with 3 AZs; managed services preferred. - Catalog (events, venues, seat maps) is read-heavy and changes infrequently. - Reservations are the hot path; payments are off-hot-path via async callbacks. - "Never oversell" is a hard invariant; brief unavailability is preferred over double-selling. - An external payment provider (e.g., Stripe/Adyen) handles PCI scope; we store only tokens. - Average event size 30k seats; peak burst lasts minutes to ~1 hour. ## 2. High-Level Architecture Clients (web...

Show Full Answer

# Scalable Concert Ticket Reservation System — Design Plan ## 1. Assumptions - Single cloud region (e.g., AWS us-east-1) with 3 AZs; managed services preferred. - Catalog (events, venues, seat maps) is read-heavy and changes infrequently. - Reservations are the hot path; payments are off-hot-path via async callbacks. - "Never oversell" is a hard invariant; brief unavailability is preferred over double-selling. - An external payment provider (e.g., Stripe/Adyen) handles PCI scope; we store only tokens. - Average event size 30k seats; peak burst lasts minutes to ~1 hour. ## 2. High-Level Architecture Clients (web/mobile) → CDN (CloudFront) → API Gateway / L7 Load Balancer → Edge auth (JWT) → Stateless microservices on Kubernetes (EKS) across 3 AZs. Core services: - **Identity Service**: signup, login, JWT issuance, MFA. - **Catalog Service**: events, venues, seat-map metadata; read-optimized. - **Inventory/Seat Service**: authoritative seat state, holds, reservations; the consistency anchor. - **Reservation Service**: orchestrates hold → checkout → payment intent. - **Payment Service**: integrates with provider, processes webhook callbacks idempotently. - **Ticket Service**: issues signed digital tickets (JWT/QR) after payment success. - **Notification Service**: email/push (SES/SNS). - **Waiting Room / Virtual Queue Service**: throttles entry during on-sale spikes. - **Expiration Worker**: releases unpaid holds after 10 minutes. - **Admin/Onsale Service**: event configuration, seat-map upload, on-sale scheduling. Cross-cutting: Kafka (MSK) for events, Redis (ElastiCache, cluster mode) for hot state and locks, PostgreSQL (Aurora Multi-AZ) for transactional data, DynamoDB for idempotency keys and ticket store, S3 for seat-map JSON/images, OpenSearch for event search. ## 3. Data Stores - **Aurora PostgreSQL (Multi-AZ, 1 writer + 2 readers)**: events, users, reservations, payments, tickets (system of record). Continuous backup; PITR. - **Redis Cluster (Multi-AZ, with replicas)**: per-seat hold state, per-event seat bitmap cache, rate limits, waiting room tokens. Used for fast CAS on holds. - **DynamoDB**: payment idempotency keys, webhook dedupe, issued-ticket lookup (low-latency, multi-AZ by default). - **Kafka (MSK)**: domain events (`SeatHeld`, `ReservationCreated`, `PaymentSucceeded`, `PaymentFailed`, `ReservationExpired`, `TicketIssued`). Replication factor 3 across AZs. - **S3**: static seat-map artifacts, ticket PDFs. - **CDN**: caches event listings, seat-map skeleton (not live availability). - **OpenSearch**: event search/filtering. ## 4. Core Data Model **events**(event_id PK, venue_id, name, onsale_at, status, version). **seats**(seat_id PK, event_id, section, row, number, price_tier, status ENUM[available, held, reserved, sold], hold_id NULL, version BIGINT). Composite index (event_id, status). Row-level versioning for optimistic locking. **reservations**(reservation_id PK, user_id, event_id, seat_ids[], state ENUM[pending_payment, confirmed, expired, cancelled], created_at, expires_at, payment_intent_id, idempotency_key UNIQUE). **payments**(payment_id PK, reservation_id, provider_ref UNIQUE, status, amount, currency, received_at). Unique constraint on provider_ref for at-least-once dedupe. **tickets**(ticket_id PK, reservation_id, seat_id, qr_payload, issued_at, signature). **outbox**(id, aggregate, payload, published_at) for transactional outbox pattern → Kafka. Redis keys: - `event:{id}:seat:{seat_id}` → status + hold owner + TTL 600s. - `event:{id}:availability` → bitmap/HLL for fast counts. - `hold:{reservation_id}` → seat list, TTL 600s. ## 5. Core APIs (REST + idempotency headers) - `GET /events?filters` → paginated list (CDN-cacheable, 30s TTL). - `GET /events/{id}` → event details. - `GET /events/{id}/seatmap` → static layout (long cache). - `GET /events/{id}/availability` → coarse availability (sections); 1–5s cache. - `GET /events/{id}/seats?section=A` → fine-grained seat status (short cache or live). - `POST /reservations` (Idempotency-Key) → `{event_id, seat_ids[]}` → creates 10-min hold. - `GET /reservations/{id}` → state, expires_at. - `DELETE /reservations/{id}` → user cancels, releases seats. - `POST /reservations/{id}/checkout` → creates payment intent at provider, returns client secret. - `POST /webhooks/payments` → provider callback (signed, idempotent). - `GET /tickets/{id}` → signed digital ticket. - Admin: `POST /events`, `POST /events/{id}/seats:bulk`, `POST /events/{id}/onsale`. ## 6. Request Flows ### Browsing Client → CDN (hit for catalog/seatmap) → on miss, API → Catalog Service → Aurora reader / OpenSearch. Availability counts served from Redis with 1–5s staleness; individual seat states pulled live for the section the user is viewing. ### Reservation (hold) 1. Client sends `POST /reservations` with Idempotency-Key and target seats. 2. API Gateway checks waiting-room token; rate-limits per user/IP. 3. Reservation Service validates event status and seat IDs. 4. **Acquire holds atomically** via Redis Lua script: for each seat key, `SETNX` with hold_id and TTL 600s; if any seat fails, roll back the successful ones and return 409 with conflicting seats. 5. Persist reservation row in Aurora in `pending_payment` state (single transaction with outbox event). Use optimistic locking on seats: `UPDATE seats SET status='held', hold_id=?, version=version+1 WHERE seat_id=? AND status='available'`. Aurora is the durable truth; Redis is the fast guard. Both must agree. 6. Return reservation with `expires_at`. p95 < 800 ms. ### Payment 1. Client calls `POST /reservations/{id}/checkout`; Payment Service creates a PaymentIntent at provider, stores `provider_ref` keyed by reservation_id (idempotent). 2. Client completes payment via provider SDK directly (we stay out of PCI scope). 3. Provider sends webhook → `POST /webhooks/payments`. 4. Webhook handler: verify signature → upsert into `payments` using `provider_ref` UNIQUE (dedupe). Use DynamoDB conditional put on event_id for extra idempotency. 5. On `succeeded`: transactional update — reservation→`confirmed`, seats→`sold`, write `tickets`, append outbox event `TicketIssued`. Out-of-order safety: handler compares event timestamps and ignores stale transitions (state machine: pending → confirmed/failed/expired, terminal states absorb late duplicates). 6. Ticket Service consumes `TicketIssued`, generates signed QR/PDF, stores in S3 + DynamoDB; Notification Service emails user. ### Expiration - Primary: Redis TTL expires the `hold:*` key → keyspace notification triggers expiration worker; worker runs Aurora transaction releasing seats only if reservation still `pending_payment` (CAS on state). - Backstop: scheduled job every 30s scans `reservations WHERE state='pending_payment' AND expires_at < now() - 30s` and releases. Late webhook arriving after expiration: if seat already resold, mark payment as `refund_required` and trigger automatic refund; if seat still free, optionally re-confirm — but default policy is refund, because we cannot re-hold a possibly-resold seat. Payment provider's 5-min delay is within the 10-min hold window, so normal case has no conflict. ## 7. Preventing Overselling (Consistency) - Seat is a single owned resource: every state transition uses **optimistic concurrency** in Aurora (`WHERE status=expected_status AND version=expected_version`). - Redis SETNX provides fast first-line rejection at 8k RPS without hammering the DB; Aurora row-update is the second line and the legal truth. - All payment-side writes are idempotent via `provider_ref` uniqueness + DynamoDB dedupe table. - Outbox pattern ensures domain events are published exactly-once to Kafka relative to DB commits. - Strong consistency within a single seat row; eventual consistency is acceptable only for aggregate availability counts shown in browse views. ## 8. Scaling Strategy for Spikes - **Virtual waiting room**: when `concurrent_users > threshold`, new users get a queue token; only N tokens/sec are admitted to the reservation endpoints. Keeps the system at known capacity (e.g., admit 10k/sec to absorb 8k reservation attempts/sec). - **Horizontal autoscaling** (HPA on EKS) on CPU and custom RPS metrics; pre-warm pods 15 minutes before announced on-sales. - **Sharding hot events**: partition Redis keys by `event_id` so a single mega-event lands on a dedicated shard; can pre-provision shards for known on-sales. - **Read scaling**: Aurora read replicas + Redis for availability + CDN for seat-map static data. - **Backpressure**: API Gateway request quotas per user; 429 with Retry-After. - **Async fan-out**: Kafka decouples ticket generation, email, analytics from the hot path. - **Connection pooling**: RDS Proxy / PgBouncer to avoid Aurora connection storms. - **Bot defense**: WAF, CAPTCHA on `POST /reservations` during on-sales, device fingerprinting. ## 9. Reliability & Disaster Recovery - Multi-AZ for every stateful component (Aurora, Redis with replicas + automatic failover, MSK RF=3, DynamoDB). - Aurora: continuous backup to S3, PITR to any second within retention → RPO ≤ 1 min met; failover ~30–60s → RTO ≤ 15 min met. - Redis: Multi-AZ with automatic failover; data is reconstructable from Aurora (holds can be rebuilt on cold start from `reservations WHERE state='pending_payment'`). - Kafka: tiered storage, RF=3, min ISR=2. - DR runbook: cross-region Aurora Global Database replica (RPO ~ seconds) for region-wide failure recovery; documented promotion procedure. - Chaos drills: AZ blackout, Redis primary kill, payment provider outage simulation quarterly. - Health checks at ALB level; circuit breakers (Resilience4j-like) between services and toward payment provider. - Graceful degradation: if Redis unavailable, fall back to DB-only path with stricter rate limit; if payment provider down, queue checkouts and notify user. ## 10. Monitoring & Alerting - **Metrics (Prometheus + CloudWatch)**: RPS, p50/p95/p99 latency per endpoint, reservation success rate, hold-acquire conflict rate, payment webhook lag, expiration worker lag, Aurora replica lag, Redis CPU/memory/evictions, Kafka consumer lag. - **SLOs**: 99.95% availability on-sale window; p95 browse < 300 ms; p95 reservation < 800 ms; error budget burn alerts. - **Tracing**: OpenTelemetry end-to-end (client → API → service → DB). - **Logging**: structured JSON to CloudWatch/Elastic; correlation IDs. - **Business dashboards**: holds/sec, conversion (hold→paid), oversell counter (must be 0 — page on any nonzero). - **Alerts**: oversell=0 violation (P0), webhook backlog > 1 min, reservation p95 > 800 ms for 5 min, Aurora failover, Redis failover, payment success rate drop > 2σ. ## 11. Key Trade-offs and Alternatives - **Redis as hold authority vs. DB-only**: pure DB approach is simpler and stronger but couldn't sustain 8k RPS on single-row hotspots; Redis-first absorbs the spike, DB row update guarantees correctness. - **Pessimistic locks (SELECT FOR UPDATE)**: considered; rejected because lock contention on popular sections would tank throughput. Optimistic locking with retries scales better. - **Reserved seating vs. general admission counter**: design above is per-seat. For GA, a single decrementing counter (`DECR` in Redis with floor 0, mirrored to DB) is used instead. - **Single-region vs. multi-region active-active**: constraint says one region. Active-active would complicate "never oversell" (cross-region consensus needed). We use cross-region read replica for DR only. - **Kafka vs. SQS**: Kafka chosen for ordered, replayable event log (helps with out-of-order payment processing and rebuilding state). - **Synchronous payment vs. webhook-only**: webhook-only chosen to remain PCI-light and tolerate provider latency; webhook delays up to 5 min still fit within 10-min hold. - **Waiting room vs. pure autoscaling**: autoscaling alone cannot protect downstream stateful stores; the queue gives deterministic load shape and a better UX than mass 503s. - **Issuing tickets synchronously vs. async**: async via Kafka keeps payment callback handler fast and resilient; user sees ticket within seconds via push/refresh. This design meets the stated SLOs, prevents overselling via layered optimistic concurrency, absorbs on-sale spikes via virtual queue + Redis-fronted holds, and satisfies the RPO/RTO targets through Multi-AZ managed stores with PITR and rehearsed failover.

Result

#1 | Winner

Winning Votes

3 / 3

Average Score

90
Judge Models OpenAI GPT-5.4

Total Score

88

Overall Comments

Answer A presents a concrete, end-to-end design that directly addresses the prompt's flash-sale constraints, correctness requirements, and operational targets. Its strongest aspects are the layered anti-oversell strategy, explicit request flows, waiting-room/load-shaping approach, idempotent payment callback handling, and detailed reliability/monitoring sections. It also discusses fallback behavior and trade-offs with reasonable specificity. Minor weaknesses are some added complexity in the Redis-plus-Aurora dual-write hold path and a few implementation choices that would need careful engineering to avoid drift, but overall it is a strong benchmark-quality system design answer.

View Score Details

Architecture Quality

Weight 30%
89

Strong architecture with well-chosen components and clear separation between catalog, inventory, reservation, payment, ticketing, waiting room, and expiry workers. The design ties read paths, write paths, eventing, and durable storage together coherently, and explicitly identifies the inventory service as the consistency anchor. The Redis plus Aurora layered hold approach is sophisticated and suitable for the problem, though it introduces coordination complexity.

Completeness

Weight 20%
90

Covers assumptions, services, data stores, APIs, data model, detailed browsing/reservation/payment/expiry flows, anti-oversell consistency, spike handling, DR, monitoring, and trade-offs. It also addresses out-of-order and delayed callbacks, backstop expiry scans, graceful degradation, and bot defense. Very few prompt areas are left untreated.

Trade-off Reasoning

Weight 20%
87

Provides several meaningful trade-offs: Redis-first versus DB-only, optimistic versus pessimistic locking, Kafka versus SQS, webhook-only payment, waiting room versus autoscaling, and single-region versus active-active. The reasoning is specific to the prompt's constraints and explains why correctness and load-shaping dominate design choices.

Scalability & Reliability

Weight 20%
88

Strong on both scale and resilience: explicit waiting room, rate limiting, pre-warming, shard-aware Redis strategy for hot events, CDN and cache layering for reads, async decoupling with Kafka, connection pooling, multi-AZ deployment, PITR, failover expectations, backstop rebuild from Aurora, and detailed alerts. It directly addresses the stated flash-sale and DR requirements.

Clarity

Weight 10%
84

Well structured and easy to follow, with distinct sections and step-by-step flows. The answer is dense but still readable. Some parts are slightly complex due to the layered consistency strategy, yet the organization keeps it understandable.

Total Score

89

Overall Comments

Answer A is a comprehensive, deeply technical design plan that directly addresses every constraint in the prompt. It provides a layered consistency model (Redis SETNX + Aurora optimistic locking), a concrete virtual waiting room strategy for flash-sale traffic, a detailed expiration flow with both a Redis TTL primary path and a DB-scan backstop, idempotent payment handling with outbox pattern, and specific SLO-tied alerting. The data model is precise, the request flows are step-by-step and mechanically sound, and trade-offs are argued with concrete reasoning rather than generic statements. Minor weaknesses: some sections are dense and could benefit from diagrams, and the cross-region DR section is brief, but overall this is a benchmark-quality response.

View Score Details

Architecture Quality

Weight 30%
90

Answer A presents a well-layered architecture with clear separation of concerns, a precise two-phase consistency model (Redis SETNX + Aurora optimistic locking), transactional outbox for reliable Kafka publishing, idempotent payment handling via provider_ref uniqueness and DynamoDB dedupe, and a virtual waiting room. Every component choice is justified and tied to a specific constraint. The data model is detailed and correct, including version columns, hold_id, and outbox table.

Completeness

Weight 20%
90

Answer A covers all required sections: services, data stores, APIs, data model, all four request flows (browse, reserve, pay, expire), scaling strategy, reliability/DR with RPO/RTO analysis, consistency guarantees, monitoring with specific SLO-tied alerts, and trade-offs. It also addresses bot defense, connection pooling, pre-warming, and late webhook handling after expiry.

Trade-off Reasoning

Weight 20%
85

Answer A provides concrete, well-argued trade-offs: Redis-first vs. DB-only (with RPS justification), pessimistic vs. optimistic locking (with contention reasoning), single-region vs. multi-region active-active (with oversell risk explanation), Kafka vs. SQS, waiting room vs. pure autoscaling, and async vs. sync ticket issuance. Each trade-off is tied to a specific constraint or failure mode.

Scalability & Reliability

Weight 20%
90

Answer A addresses the 150k concurrent user / 8k RPS spike with a virtual waiting room (admitting N tokens/sec), pre-warming pods 15 min before on-sales, event-sharded Redis, RDS Proxy for connection pooling, WAF/CAPTCHA for bot defense, and async Kafka fan-out. Reliability covers multi-AZ for all stores, Aurora PITR meeting RPO < 1 min, failover meeting RTO < 15 min, Redis reconstructability from DB, chaos drills, and circuit breakers. Cross-region Global Database is mentioned for region-wide DR.

Clarity

Weight 10%
85

Answer A is well-organized with numbered sections, clear headings, and step-by-step flows. The Redis key schema and data model are explicitly stated. The writing is precise and technical without being verbose. Minor issue: the density of some sections (especially consistency and scaling) could benefit from a summary diagram, but the prose is clear throughout.

Total Score

92

Overall Comments

Answer A provides an outstanding and highly detailed system design. Its core strength lies in the specific and robust mechanism for handling the high-concurrency seat reservation problem, using a combination of a Redis Lua script for a fast, atomic check and optimistic locking in the database for correctness. The scaling strategy is comprehensive and proactive, including practical details like a virtual waiting room, pre-warming instances, and sharding hot events in the cache. The design is complete, well-reasoned, and demonstrates a deep understanding of the trade-offs involved in building such a system.

View Score Details

Architecture Quality

Weight 30%
90

The architecture is excellent. The choice of using Redis with an atomic operation (SETNX/Lua) as a fast guard in front of a database with optimistic locking is a superior pattern for this high-contention problem. It effectively handles the 8k RPS load without overwhelming the database. The integration of a virtual waiting room is also a key strength.

Completeness

Weight 20%
95

The answer is exceptionally complete, addressing every single point in the prompt with significant detail. It includes specific implementation details like the transactional outbox pattern, use of DynamoDB for idempotency, and bot defense strategies, which go beyond the basic requirements.

Trade-off Reasoning

Weight 20%
90

The tradeoff reasoning is excellent and demonstrates deep expertise. The discussion on Redis vs. DB-only, pessimistic vs. optimistic locking, and the choice of Kafka over SQS are all directly relevant and well-justified in the context of the system's specific constraints.

Scalability & Reliability

Weight 20%
95

This is a standout section. The scaling strategy is both comprehensive and proactive, mentioning specific techniques like pre-warming pods, sharding Redis by event ID for hot events, and using RDS Proxy. The reliability plan is also top-tier, including multi-AZ for all components, a cross-region DR plan, and a commitment to chaos drills.

Clarity

Weight 10%
90

The answer is exceptionally clear, well-structured, and easy to follow. The use of numbered sections, subheadings, and bold text effectively guides the reader through the complex design.

## System Design: Scalable Concert Ticket Reservation System ### 1. Main Services and Data Stores **Services:** * **API Gateway:** Entry point for all client requests, handles authentication, rate limiting, and request routing. * **User Service:** Manages user registration, profiles, and authentication. * **Event Service:** Manages event details, schedules, and static content. * **Seat Service:** Manages seat inventory, availability, and seat map data for events. * **Reservation Service:** Handles seat reservation logic, manages reservation lifecycles (pending, confirmed, expired). * **Payment Servi...

Show Full Answer

## System Design: Scalable Concert Ticket Reservation System ### 1. Main Services and Data Stores **Services:** * **API Gateway:** Entry point for all client requests, handles authentication, rate limiting, and request routing. * **User Service:** Manages user registration, profiles, and authentication. * **Event Service:** Manages event details, schedules, and static content. * **Seat Service:** Manages seat inventory, availability, and seat map data for events. * **Reservation Service:** Handles seat reservation logic, manages reservation lifecycles (pending, confirmed, expired). * **Payment Service (Internal):** Orchestrates interactions with external payment providers, processes callbacks, and manages internal payment records. * **Notification Service:** Sends digital tickets, confirmation emails, and other user notifications. * **Queueing Service (Kafka):** Asynchronous communication between services, handles high-throughput events (e.g., payment callbacks, reservation expiry). * **Cache Service (Redis):** High-speed data access for seat availability, event details, and distributed locks. * **Background Worker Service:** Processes asynchronous tasks like reservation expiry, payment callback retries, and data cleanup. **Data Stores:** * **PostgreSQL (Relational Database):** Primary data store for users, events, seats, reservations, and payments. Chosen for strong ACID properties crucial for preventing overselling. * *Deployment:* Multi-AZ, primary-replica setup with synchronous replication for critical tables (seats, reservations) and asynchronous for others. * **Redis (In-Memory Data Store):** Used for: * Caching seat availability and event details. * Distributed locks for seat reservation and confirmation. * Temporary reservation expiry timers (TTL keys). * **Kafka (Distributed Streaming Platform):** For reliable, asynchronous message passing and event sourcing. * **Object Storage (e.g., S3):** Stores static assets like event images, artist photos, and potentially digital ticket PDFs. ### 2. Core APIs **User Service:** * `POST /users/register`: Register a new user. * `POST /users/login`: Authenticate user. * `GET /users/{id}`: Retrieve user profile. **Event Service:** * `GET /events`: Browse all events (paginated). * `GET /events/{id}`: Retrieve details for a specific event. * `GET /events/{id}/seatmap`: Retrieve seat map and current availability for an event. **Reservation Service:** * `POST /events/{id}/reserve`: Reserve specific seats for an event. Input: `event_id`, `seat_ids`, `user_id`. Output: `reservation_id`, `expires_at`. * `GET /reservations/{id}`: Retrieve reservation details. * `DELETE /reservations/{id}`: Cancel a pending reservation. * `POST /reservations/{id}/confirm`: Confirm a reservation after successful payment (internal API, triggered by Payment Service). **Payment Service:** * `POST /payments/initiate`: Initiate payment for a reservation. Input: `reservation_id`, `amount`, `user_id`. Output: Payment provider redirect URL. * `POST /payments/callback`: Webhook endpoint for external payment provider callbacks (internal). **Ticket Service (part of Notification/Reservation Service):** * `GET /tickets/{id}`: Retrieve a digital ticket. ### 3. Data Model for Seats and Reservations **Table: `users`** * `user_id` (PK, UUID) * `username` (VARCHAR, UNIQUE) * `email` (VARCHAR, UNIQUE) * `password_hash` (VARCHAR) * `created_at` (TIMESTAMP) **Table: `events`** * `event_id` (PK, UUID) * `name` (VARCHAR) * `venue` (VARCHAR) * `date_time` (TIMESTAMP) * `total_seats` (INT) * `image_url` (VARCHAR) **Table: `seats`** * `seat_id` (PK, UUID) * `event_id` (FK to `events.event_id`) * `section` (VARCHAR) * `row_number` (VARCHAR) * `seat_number` (VARCHAR) * `price` (DECIMAL) * `status` (ENUM: 'available', 'reserved', 'sold') * `version` (INT, for optimistic locking) **Table: `reservations`** * `reservation_id` (PK, UUID) * `user_id` (FK to `users.user_id`) * `event_id` (FK to `events.event_id`) * `status` (ENUM: 'pending', 'confirmed', 'expired', 'cancelled') * `reserved_at` (TIMESTAMP) * `expires_at` (TIMESTAMP) * `payment_id` (FK to `payments.payment_id`, nullable) * `total_amount` (DECIMAL) * `created_at` (TIMESTAMP) * `updated_at` (TIMESTAMP) **Table: `reservation_seats` (Junction Table)** * `reservation_id` (FK to `reservations.reservation_id`) * `seat_id` (FK to `seats.seat_id`) * (Composite PK: `reservation_id`, `seat_id`) **Table: `payments`** * `payment_id` (PK, UUID) * `reservation_id` (FK to `reservations.reservation_id`) * `user_id` (FK to `users.user_id`) * `amount` (DECIMAL) * `currency` (VARCHAR) * `provider_transaction_id` (VARCHAR, UNIQUE) * `status` (ENUM: 'initiated', 'pending', 'succeeded', 'failed', 'refunded') * `created_at` (TIMESTAMP) * `updated_at` (TIMESTAMP) ### 4. Request Flow **A. Browsing Events & Viewing Seat Availability:** 1. **User Request:** User's browser/app requests `/events` or `/events/{id}/seatmap`. 2. **CDN:** Serves static assets (event images, JS/CSS) if cached. 3. **API Gateway:** Authenticates user, routes request to Event Service or Seat Service. 4. **Event Service:** Retrieves event details from PostgreSQL (or Redis cache). 5. **Seat Service:** Retrieves seat map and availability from Redis cache. If not in cache, queries PostgreSQL `seats` table, updates cache, and returns data. 6. **Response:** Data returned to user. (P95 latency < 300 ms). **B. Reserving Seats:** 1. **User Request:** User selects seats and clicks 'Reserve', sending `POST /events/{id}/reserve` with `event_id`, `seat_ids`, `user_id`. 2. **API Gateway:** Routes to Reservation Service. 3. **Reservation Service:** * **Distributed Lock:** Acquires a distributed lock in Redis for each `seat_id` to prevent concurrent modifications. (e.g., `LOCK:seat:{seat_id}`). * **Seat Status Check:** Queries `seats` table (or Redis cache) to verify selected seats are 'available'. * **Transaction:** Starts a database transaction. * **Update Seats:** Updates `status` of selected seats to 'reserved' in `seats` table. * **Create Reservation:** Inserts a new record into `reservations` table with `status='pending'`, `reserved_at`, and `expires_at` (current time + 10 minutes). * **Link Seats:** Inserts records into `reservation_seats` for each reserved seat. * **Commit Transaction.** * **Set Redis TTL:** Sets a Redis key (e.g., `RESERVATION_EXPIRY:{reservation_id}`) with a 10-minute TTL. This acts as a fast-path expiry trigger. * **Release Locks:** Releases distributed locks for the seats. 4. **Response:** Returns `reservation_id` and `expires_at` to the user. (P95 latency < 800 ms excluding payment). **C. Paying for Reservation:** 1. **User Action:** User initiates payment for a `pending` reservation. 2. **User Request:** Browser/app sends `POST /payments/initiate` with `reservation_id` to Payment Service. 3. **Payment Service (Internal):** * Creates a `payment` record with `status='initiated'`. * Calls external Payment Provider API (e.g., Stripe, PayPal) with reservation details and amount. * Redirects user to Payment Provider's page. 4. **External Payment Provider:** Processes payment. 5. **Payment Provider Callback:** Upon successful/failed payment, the external provider sends an asynchronous webhook `POST /payments/callback` to the Payment Service. 6. **Payment Service (Internal) - Callback Handler:** * **Idempotency Check:** Uses `provider_transaction_id` to ensure the callback is processed only once. * Updates `payment` record status (`succeeded` or `failed`). * Publishes a `PaymentConfirmed` or `PaymentFailed` event to Kafka, including `reservation_id` and `payment_id`. 7. **Reservation Service - Kafka Consumer:** * Consumes `PaymentConfirmed` event. * **Distributed Lock:** Acquires a lock for the `reservation_id`. * **Transaction:** Starts a database transaction. * **Update Reservation:** Updates `reservations` table: `status='confirmed'`, links `payment_id`. * **Update Seats:** Updates `status` of associated seats in `seats` table to 'sold'. * **Commit Transaction.** * **Remove Redis TTL:** Deletes the `RESERVATION_EXPIRY:{reservation_id}` key from Redis. * **Release Lock.** * Publishes a `TicketIssued` event to Kafka. 8. **Notification Service - Kafka Consumer:** Consumes `TicketIssued` event, generates digital ticket, and sends email/push notification to the user. **D. Expiring Reservations:** 1. **Redis Keyspace Events / Background Worker:** * **Fast Path (Redis TTL):** When a `RESERVATION_EXPIRY:{reservation_id}` key expires in Redis, a keyspace event is triggered. A dedicated microservice or background worker consumes this event. * **Robust Path (DB Scan):** A Background Worker Service periodically (e.g., every 30 seconds) scans the `reservations` table for records where `status='pending'` and `expires_at < current_time`. 2. **Background Worker Logic:** * For each expired `pending` reservation: * **Distributed Lock:** Acquires a lock for the `reservation_id`. * **Check Status:** Verifies the reservation is still `pending` (to avoid expiring already paid reservations due to race conditions or delayed events). * **Transaction:** Starts a database transaction. * **Update Reservation:** Updates `reservations` table: `status='expired'`. * **Update Seats:** Updates `status` of associated seats in `seats` table back to 'available'. * **Commit Transaction.** * **Release Lock.** ### 5. Scaling Strategy for Traffic Spikes * **Stateless Services:** All application services (User, Event, Seat, Reservation, Payment, Notification) are designed to be stateless, allowing for horizontal scaling. * **Auto-Scaling Groups (ASG):** Services are deployed in ASGs that automatically add/remove instances based on metrics like CPU utilization, request queue depth, or custom metrics (e.g., active reservations). * **Load Balancers:** Application Load Balancers (ALBs) distribute incoming traffic across healthy instances of each service across multiple Availability Zones. * **CDN:** Content Delivery Network caches static assets (event images, JS, CSS) to offload traffic from backend services and reduce latency for users. * **Caching (Redis):** Extensive use of Redis for caching frequently accessed data (seat availability, event details) significantly reduces database load, especially during read-heavy browsing and seat map views. * **Queues (Kafka):** Kafka acts as a buffer, decoupling services and absorbing traffic spikes. Asynchronous operations (payment callbacks, notifications, reservation expiry processing) are pushed to Kafka, allowing services to process them at their own pace without blocking user requests. * **Database Read Replicas:** PostgreSQL instances are configured with multiple read replicas. Read-heavy services (Event Service, Seat Service for seat maps) can direct their queries to these replicas, offloading the primary database. * **Database Sharding (Potential Future):** If a single PostgreSQL instance becomes a bottleneck for writes, sharding by `event_id` or `user_id` could be considered. This would involve distributing data across multiple database instances. * **Connection Pooling:** Efficient management of database connections to minimize overhead and maximize throughput. * **Rate Limiting:** Implemented at the API Gateway to protect backend services from abusive traffic or sudden, overwhelming spikes. ### 6. Reliability and Disaster Recovery Approach * **Multi-Availability Zone (Multi-AZ) Deployment:** All services and data stores are deployed across at least three Availability Zones within the cloud region. This provides resilience against AZ-level failures. * **Database Replication & Failover:** * **PostgreSQL:** Primary-replica setup. Critical data (seats, reservations) uses synchronous replication to ensure RPO < 1 minute. Automated failover mechanisms (e.g., Patroni, cloud-managed RDS failover) promote a replica to primary in case of primary failure, achieving RTO < 15 minutes. * **Redis:** Sentinel or Cluster mode for high availability, with data replicated across nodes in different AZs. * **Kafka Durability:** Kafka brokers are deployed across multiple AZs with replication factor > 1 (e.g., 3) to ensure message durability and availability even if an AZ or broker fails. * **Automated Backups:** Regular, automated backups of PostgreSQL to object storage with point-in-time recovery capabilities. Snapshots for Redis. * **Stateless Services:** Services are designed to be stateless, meaning any instance can be replaced or restarted without data loss, contributing to high availability. * **Circuit Breakers & Retries:** Implemented in service-to-service communication to prevent cascading failures. Services will gracefully degrade or retry failed requests. * **Graceful Degradation:** During extreme load, non-critical features (e.g., personalized recommendations) might be temporarily disabled to prioritize core functionality (reservation, payment). * **Recovery Point Objective (RPO) < 1 minute:** Achieved by synchronous database replication for critical data and Kafka's durable message logging with appropriate replication settings. * **Recovery Time Objective (RTO) < 15 minutes:** Achieved by automated database failover, auto-scaling of application services, and pre-warmed instances or rapid provisioning capabilities. ### 7. Consistency Choices that Prevent Overselling Preventing overselling is paramount and requires strong consistency for seat status updates. * **Distributed Locks (Redis):** When a user attempts to reserve seats, the Reservation Service acquires a distributed lock for each specific `seat_id` in Redis. This ensures that only one reservation attempt can modify the status of a given seat at any moment. The lock is held for the duration of the database transaction that updates the seat status and creates the reservation. * **ACID Transactions (PostgreSQL):** All updates to `seats` status and the creation/update of `reservations` records are performed within a single ACID transaction in PostgreSQL. This guarantees atomicity, consistency, isolation, and durability. If any part of the transaction fails (e.g., another user already reserved a seat), the entire transaction is rolled back, ensuring no partial updates and preventing overselling. * **Optimistic Concurrency Control (Version Numbers):** For the `seats` table, a `version` column can be added. When updating a seat's status, the `UPDATE` query includes a `WHERE version = <old_version>`. If the version doesn't match, it means another transaction modified the seat, and the current transaction can be retried or rejected. This provides an additional layer of protection against race conditions, especially if distributed locks fail or are not perfectly implemented. * **Idempotent Payment Processing:** Payment provider callbacks are `at-least-once` and may be out of order. The Payment Service processes these callbacks idempotently, using the `provider_transaction_id` to ensure that a successful payment is only applied once to a reservation, preventing double-confirmation or incorrect status updates. * **Reservation Expiry Logic:** The expiry process explicitly checks if a reservation is still `pending` before marking it `expired` and releasing seats. This prevents a race condition where a payment might arrive just as the expiry process is about to run. ### 8. Monitoring and Alerting * **Metrics Collection:** Use Prometheus/Grafana or cloud-native monitoring (e.g., CloudWatch, Stackdriver) to collect metrics from all services, databases, caches, and queues. Key metrics include: * **System Metrics:** CPU utilization, memory usage, disk I/O, network I/O. * **Application Metrics:** Request rates, error rates (5xx), latency (p95, p99) for all API endpoints, queue depths, active connections, garbage collection. * **Database Metrics:** Query latency, connection pool usage, transaction rates, replication lag, deadlocks. * **Cache Metrics:** Cache hit/miss ratio, memory usage. * **Kafka Metrics:** Producer/consumer lag, message rates, broker health. * **Centralized Logging:** Aggregate logs from all services into a centralized logging system (e.g., ELK stack, Splunk, Datadog). Use structured logging for easier parsing and analysis. * **Distributed Tracing:** Implement distributed tracing (e.g., OpenTelemetry, Jaeger) to visualize request flows across multiple services, identify bottlenecks, and debug complex interactions. * **Alerting:** Configure alerts for critical thresholds: * High error rates (e.g., 5% 5xx errors over 5 minutes). * High latency (e.g., p95 API latency > 800ms). * Service downtime or unhealthy instances. * Database replication lag exceeding thresholds. * Queue backlogs. * Failed payment callbacks or reservation expiry failures. * Resource exhaustion (CPU, memory, disk). * **Dashboards:** Create real-time dashboards for operational visibility, displaying key metrics and service health. ### 9. Key Trade-offs or Alternatives * **Database Choice:** * **Alternative:** NoSQL database (e.g., Cassandra, DynamoDB) for extreme write scalability. **Trade-off:** While NoSQL can handle massive scale, achieving strong consistency for complex transactional operations like seat reservations (which involve multiple updates and checks) is significantly more challenging and often requires complex application-level logic. PostgreSQL's ACID properties simplify preventing overselling, and its scalability can be managed through read replicas and sharding. * **Locking Mechanism:** * **Alternative:** Database-level row locks. **Trade-off:** Can lead to higher contention and potential deadlocks in high-concurrency scenarios, impacting performance. Redis distributed locks are generally faster and offload the locking mechanism from the database, but require careful implementation (e.g., handling lock expiry, ensuring atomicity of lock acquisition/release). * **Reservation Expiry:** * **Alternative:** Purely relying on Redis TTL keyspace events. **Trade-off:** While fast, Redis keyspace events are not guaranteed to be delivered (e.g., if Redis restarts or events are missed). The chosen approach combines Redis TTL for a fast path with a robust background worker scanning the database for eventual consistency, ensuring no reservation is permanently stuck in `pending`. * **Payment Callback Handling:** * **Alternative:** Synchronous callback from payment provider. **Trade-off:** This is rarely offered by external providers and would require the platform to be highly available and responsive to the payment provider, introducing tight coupling. The asynchronous, at-least-once approach via Kafka is standard and more resilient, but necessitates idempotent processing and handling of out-of-order messages. * **Microservices vs. Monolith:** * **Alternative:** Monolithic architecture. **Trade-off:** Simpler to develop initially, but harder to scale individual components, deploy independently, and manage for large teams. Microservices offer better scalability, fault isolation, and technology diversity but introduce operational complexity (distributed transactions, monitoring, deployment). ### 10. Assumptions * **User Authentication:** Users are authenticated before performing any actions that require identity (e.g., reserving seats, viewing tickets). * **PCI Compliance:** The external payment provider handles all PCI DSS compliance requirements; the internal Payment Service only orchestrates and stores minimal, non-sensitive payment data. * **Cloud Region Capacity:** The chosen cloud region has sufficient compute, network, and storage capacity to handle the specified peak loads. * **Network Latency:** Internal network latency within the cloud region and between Availability Zones is low enough to meet latency targets. * **Security:** Standard security practices (e.g., encryption in transit and at rest, input validation, access control, vulnerability scanning) are implemented across the system, though not detailed in this design. * **External Payment Provider Reliability:** The external payment provider is assumed to be highly available and reliable for processing payments and sending callbacks.

Result

#2

Winning Votes

0 / 3

Average Score

73
Judge Models OpenAI GPT-5.4

Total Score

71

Overall Comments

Answer B gives a solid high-level architecture with reasonable services, data stores, APIs, and lifecycle flows. It covers many required areas, including reservation expiry, payment callbacks, and observability. However, it stays more generic, is less explicit about how the design will hold up under the stated 150,000-concurrent-user flash-sale conditions, and does not fully resolve some critical correctness details such as out-of-order callback handling, seat-map read scaling, and the exact anti-oversell serialization strategy. It is acceptable but less rigorous and less tailored to the prompt's hardest constraints than Answer A.

View Score Details

Architecture Quality

Weight 30%
69

Reasonable service decomposition and appropriate use of PostgreSQL, Redis, and Kafka, but the design remains more generic. The locking and reservation architecture is described at a high level without the same depth on authoritative state, contention management, or read/write path separation for hot events. It is coherent but not as robustly engineered for this specific ticketing workload.

Completeness

Weight 20%
72

Covers most requested sections, including services, APIs, data model, request flow, scaling, reliability, consistency, monitoring, and trade-offs. However, some areas are thinner than requested, especially seat-map serving strategy under heavy read load, precise callback ordering behavior, explicit admission control for flash sales, and more concrete RPO/RTO operational steps.

Trade-off Reasoning

Weight 20%
71

Includes relevant trade-offs such as SQL versus NoSQL, Redis locks versus DB locks, TTL events versus DB scans, async callbacks, and microservices versus monolith. The trade-off section is sensible, but the reasoning is more standard and less tightly connected to the numeric workload and latency constraints in the prompt.

Scalability & Reliability

Weight 20%
68

Good baseline coverage of stateless scaling, autoscaling, read replicas, Kafka buffering, Multi-AZ, failover, backups, and monitoring. However, it is less convincing on sustaining 8,000 reservation attempts per second for hot events because it lacks a concrete load-shaping mechanism like a waiting room and relies on per-seat distributed locks without discussing hotspot behavior in enough detail. DR and availability discussion is present but more generic.

Clarity

Weight 10%
80

Clear and organized, with straightforward sectioning and readable prose. The high-level framing is easy to follow. It is slightly simpler and more concise than A, though sometimes that simplicity comes at the cost of precision.

Total Score

66

Overall Comments

Answer B is a solid, well-structured design that covers all required sections and demonstrates good understanding of the problem. It correctly identifies the key components (Redis distributed locks, ACID transactions, Kafka, multi-AZ), describes request flows clearly, and addresses idempotency and expiry. However, it is less precise in several critical areas: the distributed lock approach for seat reservation (acquiring per-seat Redis locks then doing a DB transaction) is described but the interaction between Redis and DB is not as rigorously specified as in A; the waiting room / virtual queue concept is absent; the scaling strategy for 8k RPS is less concrete; the outbox pattern is not mentioned; and trade-off reasoning is more generic. It is a competent answer but does not reach the depth or correctness rigor of Answer A.

View Score Details

Architecture Quality

Weight 30%
65

Answer B has a reasonable architecture with correct component choices (Redis, PostgreSQL, Kafka, multi-AZ). However, the consistency model is less rigorous: acquiring per-seat Redis locks then doing a DB transaction is described but the interaction and rollback semantics are underspecified. The outbox pattern is absent, meaning Kafka publishing could be inconsistent with DB commits. No virtual waiting room is described, which is a significant gap for flash-sale handling.

Completeness

Weight 20%
70

Answer B covers all required sections and is reasonably complete. It addresses browsing, reservation, payment, and expiry flows, plus scaling, reliability, consistency, monitoring, and trade-offs. However, it lacks a virtual waiting room, does not address bot defense, does not mention the outbox pattern, and the late-webhook-after-expiry scenario (refund logic) is not addressed. The cross-region DR strategy is also absent.

Trade-off Reasoning

Weight 20%
60

Answer B discusses trade-offs for database choice, locking mechanism, expiry approach, payment callback handling, and microservices vs. monolith. These are relevant but somewhat generic. The reasoning is correct but lacks the quantitative grounding and constraint-specific justification seen in Answer A. For example, the locking trade-off discussion does not explain why Redis SETNX is preferred over SELECT FOR UPDATE at 8k RPS specifically.

Scalability & Reliability

Weight 20%
65

Answer B covers multi-AZ deployment, auto-scaling groups, read replicas, Kafka buffering, Redis caching, and circuit breakers. It correctly states RPO/RTO targets and how they are met. However, it lacks a virtual waiting room for flash-sale traffic shaping, does not mention pre-warming, does not address connection pool exhaustion specifically, and the bot defense is only mentioned as rate limiting. The reliability section is solid but less detailed than A.

Clarity

Weight 10%
75

Answer B is also well-organized with clear headings and readable prose. The request flows are presented as numbered steps which aids comprehension. The data model tables are clearly formatted. It is slightly less dense than A, which makes it easier to skim, but also means some important details are missing. Overall clarity is good but not exceptional.

Total Score

83

Overall Comments

Answer B presents a very strong and competent system design that covers all the required aspects of the prompt. The architecture is sound, using standard components like microservices, PostgreSQL, Redis, and Kafka effectively. The request flows and data models are well-defined. However, its proposed mechanism for preventing overselling—relying primarily on distributed locks held during a database transaction—is generally considered less performant and more complex to manage at the specified scale compared to the optimistic concurrency pattern used by Answer A. While a very good answer, it lacks some of the finer, more performance-oriented details found in the winning answer.

View Score Details

Architecture Quality

Weight 30%
80

The architecture is strong and uses appropriate technologies. However, the primary mechanism for preventing overselling relies on distributed locks held over a database transaction. This approach can become a bottleneck at the required scale due to lock contention and the duration of the lock hold, making it slightly less robust than Answer A's approach.

Completeness

Weight 20%
85

The answer is very complete and covers all the required sections of the prompt, including services, data models, flows, and non-functional requirements. The data model is particularly detailed. It is a thorough response, though it lacks a few of the finer-grained details present in Answer A, such as the outbox pattern.

Trade-off Reasoning

Weight 20%
80

The tradeoff reasoning is strong and covers relevant alternatives for key components like the database and locking mechanism. The explanations are logical and clear. However, the analysis is slightly less sharp than Answer A's, particularly regarding the performance implications of different locking strategies at extreme scale.

Scalability & Reliability

Weight 20%
85

The plan for scalability and reliability is very strong, correctly identifying key strategies like horizontal scaling, caching, read replicas, and multi-AZ deployments. The use of synchronous replication to meet the RPO is a good detail. The plan is solid but slightly more generic than Answer A's, which includes more specific, proactive measures.

Clarity

Weight 10%
90

The answer is very clear and well-organized. The structure is logical, and the explanations for each component and flow are easy to understand. It is a well-written and professional document.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Winning Votes

3 / 3

Average Score

90
View this answer

Winning Votes

0 / 3

Average Score

73
View this answer

Judging Results

Why This Side Won

Answer A wins due to its superior architectural choices for handling the core high-concurrency challenge and its more detailed, proactive scaling strategy. The proposed reservation flow in Answer A, which uses a fast, atomic Redis operation followed by an optimistic database update, is a more scalable and robust pattern for this specific problem than Answer B's reliance on distributed locks. Furthermore, Answer A's discussion on scaling and reliability includes more specific and advanced techniques like pre-warming instances, sharding the cache for hot events, and chaos engineering, demonstrating a deeper level of expertise.

Why This Side Won

Answer A wins on every heavily weighted criterion. It provides a more rigorous and correct consistency model (Redis SETNX as fast guard + Aurora optimistic locking as durable truth, with explicit rollback logic), a virtual waiting room that is essential for handling 150k concurrent users at 8k RPS, the transactional outbox pattern for reliable event publishing, more detailed and mechanically correct request flows, and more concrete trade-off reasoning tied to the specific scale numbers. On scalability and reliability it also goes further with pre-warming, event-sharded Redis, RDS Proxy, chaos drills, and a cross-region DR note. The weighted scores across architecture quality (30%), scalability/reliability (20%), tradeoff reasoning (20%), and completeness (20%) all favor A, making it the clear winner.

Judge Models OpenAI GPT-5.4

Why This Side Won

Answer A wins because, based on the weighted criteria, it is materially stronger on the most important dimensions: architecture quality, completeness, trade-off reasoning, and scalability/reliability. It directly addresses the stated peak reservation load, waiting-room admission control, Redis-plus-durable-store reservation path, expiration backstops, idempotent webhook handling, DR targets, and operational monitoring with more concrete mechanisms. Answer B is coherent but more generic and less specific about bottlenecks and failure handling under flash-sale load, so its weighted overall result is lower.

X f L