Answer A: Anthropic Claude Opus 4.7
# Scalable Concert Ticket Reservation System — Design Plan ## 1. Assumptions - Single cloud region (e.g., AWS us-east-1) with 3 AZs; managed services preferred. - Catalog (events, venues, seat maps) is read-heavy and changes infrequently. - Reservations are the hot path; payments are off-hot-path via async callbacks. - "Never oversell" is a hard invariant; brief unavailability is preferred over double-selling. - An external payment provider (e.g., Stripe/Adyen) handles PCI scope; we store only tokens. - Average event size 30k seats; peak burst lasts minutes to ~1 hour. ## 2. High-Level Architecture Clients (web...
Show Full Answer ▼
# Scalable Concert Ticket Reservation System — Design Plan ## 1. Assumptions - Single cloud region (e.g., AWS us-east-1) with 3 AZs; managed services preferred. - Catalog (events, venues, seat maps) is read-heavy and changes infrequently. - Reservations are the hot path; payments are off-hot-path via async callbacks. - "Never oversell" is a hard invariant; brief unavailability is preferred over double-selling. - An external payment provider (e.g., Stripe/Adyen) handles PCI scope; we store only tokens. - Average event size 30k seats; peak burst lasts minutes to ~1 hour. ## 2. High-Level Architecture Clients (web/mobile) → CDN (CloudFront) → API Gateway / L7 Load Balancer → Edge auth (JWT) → Stateless microservices on Kubernetes (EKS) across 3 AZs. Core services: - **Identity Service**: signup, login, JWT issuance, MFA. - **Catalog Service**: events, venues, seat-map metadata; read-optimized. - **Inventory/Seat Service**: authoritative seat state, holds, reservations; the consistency anchor. - **Reservation Service**: orchestrates hold → checkout → payment intent. - **Payment Service**: integrates with provider, processes webhook callbacks idempotently. - **Ticket Service**: issues signed digital tickets (JWT/QR) after payment success. - **Notification Service**: email/push (SES/SNS). - **Waiting Room / Virtual Queue Service**: throttles entry during on-sale spikes. - **Expiration Worker**: releases unpaid holds after 10 minutes. - **Admin/Onsale Service**: event configuration, seat-map upload, on-sale scheduling. Cross-cutting: Kafka (MSK) for events, Redis (ElastiCache, cluster mode) for hot state and locks, PostgreSQL (Aurora Multi-AZ) for transactional data, DynamoDB for idempotency keys and ticket store, S3 for seat-map JSON/images, OpenSearch for event search. ## 3. Data Stores - **Aurora PostgreSQL (Multi-AZ, 1 writer + 2 readers)**: events, users, reservations, payments, tickets (system of record). Continuous backup; PITR. - **Redis Cluster (Multi-AZ, with replicas)**: per-seat hold state, per-event seat bitmap cache, rate limits, waiting room tokens. Used for fast CAS on holds. - **DynamoDB**: payment idempotency keys, webhook dedupe, issued-ticket lookup (low-latency, multi-AZ by default). - **Kafka (MSK)**: domain events (`SeatHeld`, `ReservationCreated`, `PaymentSucceeded`, `PaymentFailed`, `ReservationExpired`, `TicketIssued`). Replication factor 3 across AZs. - **S3**: static seat-map artifacts, ticket PDFs. - **CDN**: caches event listings, seat-map skeleton (not live availability). - **OpenSearch**: event search/filtering. ## 4. Core Data Model **events**(event_id PK, venue_id, name, onsale_at, status, version). **seats**(seat_id PK, event_id, section, row, number, price_tier, status ENUM[available, held, reserved, sold], hold_id NULL, version BIGINT). Composite index (event_id, status). Row-level versioning for optimistic locking. **reservations**(reservation_id PK, user_id, event_id, seat_ids[], state ENUM[pending_payment, confirmed, expired, cancelled], created_at, expires_at, payment_intent_id, idempotency_key UNIQUE). **payments**(payment_id PK, reservation_id, provider_ref UNIQUE, status, amount, currency, received_at). Unique constraint on provider_ref for at-least-once dedupe. **tickets**(ticket_id PK, reservation_id, seat_id, qr_payload, issued_at, signature). **outbox**(id, aggregate, payload, published_at) for transactional outbox pattern → Kafka. Redis keys: - `event:{id}:seat:{seat_id}` → status + hold owner + TTL 600s. - `event:{id}:availability` → bitmap/HLL for fast counts. - `hold:{reservation_id}` → seat list, TTL 600s. ## 5. Core APIs (REST + idempotency headers) - `GET /events?filters` → paginated list (CDN-cacheable, 30s TTL). - `GET /events/{id}` → event details. - `GET /events/{id}/seatmap` → static layout (long cache). - `GET /events/{id}/availability` → coarse availability (sections); 1–5s cache. - `GET /events/{id}/seats?section=A` → fine-grained seat status (short cache or live). - `POST /reservations` (Idempotency-Key) → `{event_id, seat_ids[]}` → creates 10-min hold. - `GET /reservations/{id}` → state, expires_at. - `DELETE /reservations/{id}` → user cancels, releases seats. - `POST /reservations/{id}/checkout` → creates payment intent at provider, returns client secret. - `POST /webhooks/payments` → provider callback (signed, idempotent). - `GET /tickets/{id}` → signed digital ticket. - Admin: `POST /events`, `POST /events/{id}/seats:bulk`, `POST /events/{id}/onsale`. ## 6. Request Flows ### Browsing Client → CDN (hit for catalog/seatmap) → on miss, API → Catalog Service → Aurora reader / OpenSearch. Availability counts served from Redis with 1–5s staleness; individual seat states pulled live for the section the user is viewing. ### Reservation (hold) 1. Client sends `POST /reservations` with Idempotency-Key and target seats. 2. API Gateway checks waiting-room token; rate-limits per user/IP. 3. Reservation Service validates event status and seat IDs. 4. **Acquire holds atomically** via Redis Lua script: for each seat key, `SETNX` with hold_id and TTL 600s; if any seat fails, roll back the successful ones and return 409 with conflicting seats. 5. Persist reservation row in Aurora in `pending_payment` state (single transaction with outbox event). Use optimistic locking on seats: `UPDATE seats SET status='held', hold_id=?, version=version+1 WHERE seat_id=? AND status='available'`. Aurora is the durable truth; Redis is the fast guard. Both must agree. 6. Return reservation with `expires_at`. p95 < 800 ms. ### Payment 1. Client calls `POST /reservations/{id}/checkout`; Payment Service creates a PaymentIntent at provider, stores `provider_ref` keyed by reservation_id (idempotent). 2. Client completes payment via provider SDK directly (we stay out of PCI scope). 3. Provider sends webhook → `POST /webhooks/payments`. 4. Webhook handler: verify signature → upsert into `payments` using `provider_ref` UNIQUE (dedupe). Use DynamoDB conditional put on event_id for extra idempotency. 5. On `succeeded`: transactional update — reservation→`confirmed`, seats→`sold`, write `tickets`, append outbox event `TicketIssued`. Out-of-order safety: handler compares event timestamps and ignores stale transitions (state machine: pending → confirmed/failed/expired, terminal states absorb late duplicates). 6. Ticket Service consumes `TicketIssued`, generates signed QR/PDF, stores in S3 + DynamoDB; Notification Service emails user. ### Expiration - Primary: Redis TTL expires the `hold:*` key → keyspace notification triggers expiration worker; worker runs Aurora transaction releasing seats only if reservation still `pending_payment` (CAS on state). - Backstop: scheduled job every 30s scans `reservations WHERE state='pending_payment' AND expires_at < now() - 30s` and releases. Late webhook arriving after expiration: if seat already resold, mark payment as `refund_required` and trigger automatic refund; if seat still free, optionally re-confirm — but default policy is refund, because we cannot re-hold a possibly-resold seat. Payment provider's 5-min delay is within the 10-min hold window, so normal case has no conflict. ## 7. Preventing Overselling (Consistency) - Seat is a single owned resource: every state transition uses **optimistic concurrency** in Aurora (`WHERE status=expected_status AND version=expected_version`). - Redis SETNX provides fast first-line rejection at 8k RPS without hammering the DB; Aurora row-update is the second line and the legal truth. - All payment-side writes are idempotent via `provider_ref` uniqueness + DynamoDB dedupe table. - Outbox pattern ensures domain events are published exactly-once to Kafka relative to DB commits. - Strong consistency within a single seat row; eventual consistency is acceptable only for aggregate availability counts shown in browse views. ## 8. Scaling Strategy for Spikes - **Virtual waiting room**: when `concurrent_users > threshold`, new users get a queue token; only N tokens/sec are admitted to the reservation endpoints. Keeps the system at known capacity (e.g., admit 10k/sec to absorb 8k reservation attempts/sec). - **Horizontal autoscaling** (HPA on EKS) on CPU and custom RPS metrics; pre-warm pods 15 minutes before announced on-sales. - **Sharding hot events**: partition Redis keys by `event_id` so a single mega-event lands on a dedicated shard; can pre-provision shards for known on-sales. - **Read scaling**: Aurora read replicas + Redis for availability + CDN for seat-map static data. - **Backpressure**: API Gateway request quotas per user; 429 with Retry-After. - **Async fan-out**: Kafka decouples ticket generation, email, analytics from the hot path. - **Connection pooling**: RDS Proxy / PgBouncer to avoid Aurora connection storms. - **Bot defense**: WAF, CAPTCHA on `POST /reservations` during on-sales, device fingerprinting. ## 9. Reliability & Disaster Recovery - Multi-AZ for every stateful component (Aurora, Redis with replicas + automatic failover, MSK RF=3, DynamoDB). - Aurora: continuous backup to S3, PITR to any second within retention → RPO ≤ 1 min met; failover ~30–60s → RTO ≤ 15 min met. - Redis: Multi-AZ with automatic failover; data is reconstructable from Aurora (holds can be rebuilt on cold start from `reservations WHERE state='pending_payment'`). - Kafka: tiered storage, RF=3, min ISR=2. - DR runbook: cross-region Aurora Global Database replica (RPO ~ seconds) for region-wide failure recovery; documented promotion procedure. - Chaos drills: AZ blackout, Redis primary kill, payment provider outage simulation quarterly. - Health checks at ALB level; circuit breakers (Resilience4j-like) between services and toward payment provider. - Graceful degradation: if Redis unavailable, fall back to DB-only path with stricter rate limit; if payment provider down, queue checkouts and notify user. ## 10. Monitoring & Alerting - **Metrics (Prometheus + CloudWatch)**: RPS, p50/p95/p99 latency per endpoint, reservation success rate, hold-acquire conflict rate, payment webhook lag, expiration worker lag, Aurora replica lag, Redis CPU/memory/evictions, Kafka consumer lag. - **SLOs**: 99.95% availability on-sale window; p95 browse < 300 ms; p95 reservation < 800 ms; error budget burn alerts. - **Tracing**: OpenTelemetry end-to-end (client → API → service → DB). - **Logging**: structured JSON to CloudWatch/Elastic; correlation IDs. - **Business dashboards**: holds/sec, conversion (hold→paid), oversell counter (must be 0 — page on any nonzero). - **Alerts**: oversell=0 violation (P0), webhook backlog > 1 min, reservation p95 > 800 ms for 5 min, Aurora failover, Redis failover, payment success rate drop > 2σ. ## 11. Key Trade-offs and Alternatives - **Redis as hold authority vs. DB-only**: pure DB approach is simpler and stronger but couldn't sustain 8k RPS on single-row hotspots; Redis-first absorbs the spike, DB row update guarantees correctness. - **Pessimistic locks (SELECT FOR UPDATE)**: considered; rejected because lock contention on popular sections would tank throughput. Optimistic locking with retries scales better. - **Reserved seating vs. general admission counter**: design above is per-seat. For GA, a single decrementing counter (`DECR` in Redis with floor 0, mirrored to DB) is used instead. - **Single-region vs. multi-region active-active**: constraint says one region. Active-active would complicate "never oversell" (cross-region consensus needed). We use cross-region read replica for DR only. - **Kafka vs. SQS**: Kafka chosen for ordered, replayable event log (helps with out-of-order payment processing and rebuilding state). - **Synchronous payment vs. webhook-only**: webhook-only chosen to remain PCI-light and tolerate provider latency; webhook delays up to 5 min still fit within 10-min hold. - **Waiting room vs. pure autoscaling**: autoscaling alone cannot protect downstream stateful stores; the queue gives deterministic load shape and a better UX than mass 503s. - **Issuing tickets synchronously vs. async**: async via Kafka keeps payment callback handler fast and resilient; user sees ticket within seconds via push/refresh. This design meets the stated SLOs, prevents overselling via layered optimistic concurrency, absorbs on-sale spikes via virtual queue + Redis-fronted holds, and satisfies the RPO/RTO targets through Multi-AZ managed stores with PITR and rehearsed failover.
Result
Winning Votes
3 / 3
Average Score
Total Score
Overall Comments
Answer A presents a concrete, end-to-end design that directly addresses the prompt's flash-sale constraints, correctness requirements, and operational targets. Its strongest aspects are the layered anti-oversell strategy, explicit request flows, waiting-room/load-shaping approach, idempotent payment callback handling, and detailed reliability/monitoring sections. It also discusses fallback behavior and trade-offs with reasonable specificity. Minor weaknesses are some added complexity in the Redis-plus-Aurora dual-write hold path and a few implementation choices that would need careful engineering to avoid drift, but overall it is a strong benchmark-quality system design answer.
View Score Details ▼
Architecture Quality
Weight 30%Strong architecture with well-chosen components and clear separation between catalog, inventory, reservation, payment, ticketing, waiting room, and expiry workers. The design ties read paths, write paths, eventing, and durable storage together coherently, and explicitly identifies the inventory service as the consistency anchor. The Redis plus Aurora layered hold approach is sophisticated and suitable for the problem, though it introduces coordination complexity.
Completeness
Weight 20%Covers assumptions, services, data stores, APIs, data model, detailed browsing/reservation/payment/expiry flows, anti-oversell consistency, spike handling, DR, monitoring, and trade-offs. It also addresses out-of-order and delayed callbacks, backstop expiry scans, graceful degradation, and bot defense. Very few prompt areas are left untreated.
Trade-off Reasoning
Weight 20%Provides several meaningful trade-offs: Redis-first versus DB-only, optimistic versus pessimistic locking, Kafka versus SQS, webhook-only payment, waiting room versus autoscaling, and single-region versus active-active. The reasoning is specific to the prompt's constraints and explains why correctness and load-shaping dominate design choices.
Scalability & Reliability
Weight 20%Strong on both scale and resilience: explicit waiting room, rate limiting, pre-warming, shard-aware Redis strategy for hot events, CDN and cache layering for reads, async decoupling with Kafka, connection pooling, multi-AZ deployment, PITR, failover expectations, backstop rebuild from Aurora, and detailed alerts. It directly addresses the stated flash-sale and DR requirements.
Clarity
Weight 10%Well structured and easy to follow, with distinct sections and step-by-step flows. The answer is dense but still readable. Some parts are slightly complex due to the layered consistency strategy, yet the organization keeps it understandable.
Total Score
Overall Comments
Answer A is a comprehensive, deeply technical design plan that directly addresses every constraint in the prompt. It provides a layered consistency model (Redis SETNX + Aurora optimistic locking), a concrete virtual waiting room strategy for flash-sale traffic, a detailed expiration flow with both a Redis TTL primary path and a DB-scan backstop, idempotent payment handling with outbox pattern, and specific SLO-tied alerting. The data model is precise, the request flows are step-by-step and mechanically sound, and trade-offs are argued with concrete reasoning rather than generic statements. Minor weaknesses: some sections are dense and could benefit from diagrams, and the cross-region DR section is brief, but overall this is a benchmark-quality response.
View Score Details ▼
Architecture Quality
Weight 30%Answer A presents a well-layered architecture with clear separation of concerns, a precise two-phase consistency model (Redis SETNX + Aurora optimistic locking), transactional outbox for reliable Kafka publishing, idempotent payment handling via provider_ref uniqueness and DynamoDB dedupe, and a virtual waiting room. Every component choice is justified and tied to a specific constraint. The data model is detailed and correct, including version columns, hold_id, and outbox table.
Completeness
Weight 20%Answer A covers all required sections: services, data stores, APIs, data model, all four request flows (browse, reserve, pay, expire), scaling strategy, reliability/DR with RPO/RTO analysis, consistency guarantees, monitoring with specific SLO-tied alerts, and trade-offs. It also addresses bot defense, connection pooling, pre-warming, and late webhook handling after expiry.
Trade-off Reasoning
Weight 20%Answer A provides concrete, well-argued trade-offs: Redis-first vs. DB-only (with RPS justification), pessimistic vs. optimistic locking (with contention reasoning), single-region vs. multi-region active-active (with oversell risk explanation), Kafka vs. SQS, waiting room vs. pure autoscaling, and async vs. sync ticket issuance. Each trade-off is tied to a specific constraint or failure mode.
Scalability & Reliability
Weight 20%Answer A addresses the 150k concurrent user / 8k RPS spike with a virtual waiting room (admitting N tokens/sec), pre-warming pods 15 min before on-sales, event-sharded Redis, RDS Proxy for connection pooling, WAF/CAPTCHA for bot defense, and async Kafka fan-out. Reliability covers multi-AZ for all stores, Aurora PITR meeting RPO < 1 min, failover meeting RTO < 15 min, Redis reconstructability from DB, chaos drills, and circuit breakers. Cross-region Global Database is mentioned for region-wide DR.
Clarity
Weight 10%Answer A is well-organized with numbered sections, clear headings, and step-by-step flows. The Redis key schema and data model are explicitly stated. The writing is precise and technical without being verbose. Minor issue: the density of some sections (especially consistency and scaling) could benefit from a summary diagram, but the prose is clear throughout.
Total Score
Overall Comments
Answer A provides an outstanding and highly detailed system design. Its core strength lies in the specific and robust mechanism for handling the high-concurrency seat reservation problem, using a combination of a Redis Lua script for a fast, atomic check and optimistic locking in the database for correctness. The scaling strategy is comprehensive and proactive, including practical details like a virtual waiting room, pre-warming instances, and sharding hot events in the cache. The design is complete, well-reasoned, and demonstrates a deep understanding of the trade-offs involved in building such a system.
View Score Details ▼
Architecture Quality
Weight 30%The architecture is excellent. The choice of using Redis with an atomic operation (SETNX/Lua) as a fast guard in front of a database with optimistic locking is a superior pattern for this high-contention problem. It effectively handles the 8k RPS load without overwhelming the database. The integration of a virtual waiting room is also a key strength.
Completeness
Weight 20%The answer is exceptionally complete, addressing every single point in the prompt with significant detail. It includes specific implementation details like the transactional outbox pattern, use of DynamoDB for idempotency, and bot defense strategies, which go beyond the basic requirements.
Trade-off Reasoning
Weight 20%The tradeoff reasoning is excellent and demonstrates deep expertise. The discussion on Redis vs. DB-only, pessimistic vs. optimistic locking, and the choice of Kafka over SQS are all directly relevant and well-justified in the context of the system's specific constraints.
Scalability & Reliability
Weight 20%This is a standout section. The scaling strategy is both comprehensive and proactive, mentioning specific techniques like pre-warming pods, sharding Redis by event ID for hot events, and using RDS Proxy. The reliability plan is also top-tier, including multi-AZ for all components, a cross-region DR plan, and a commitment to chaos drills.
Clarity
Weight 10%The answer is exceptionally clear, well-structured, and easy to follow. The use of numbered sections, subheadings, and bold text effectively guides the reader through the complex design.