Answer A: OpenAI GPT-5.4
A. Short URL Generation Strategy 1. Requirements-driven choice - New writes: 100M/month ≈ 3.3M/day ≈ 38.6 writes/sec average. - Reads at 100:1 ≈ 3.3B/month ≈ 1,270 reads/sec average, with peak likely 10–50x higher. - We need compact, globally unique IDs with very fast lookup. 2. Code generation approach I would use a deterministic ID-generation service plus Base62 encoding. Flow: - A globally unique numeric ID is generated. - That integer is encoded into Base62 using characters [0-9a-zA-Z]. - Example: 125 -> "21" in Base62. Why this approach: - No collision risk if IDs are unique. - Very fast to generate. - E...
Show Full Answer ▼
A. Short URL Generation Strategy 1. Requirements-driven choice - New writes: 100M/month ≈ 3.3M/day ≈ 38.6 writes/sec average. - Reads at 100:1 ≈ 3.3B/month ≈ 1,270 reads/sec average, with peak likely 10–50x higher. - We need compact, globally unique IDs with very fast lookup. 2. Code generation approach I would use a deterministic ID-generation service plus Base62 encoding. Flow: - A globally unique numeric ID is generated. - That integer is encoded into Base62 using characters [0-9a-zA-Z]. - Example: 125 -> "21" in Base62. Why this approach: - No collision risk if IDs are unique. - Very fast to generate. - Easy to operate compared with random-code collision checking. - Predictable code length growth. 3. ID generator design Preferred option: - Use a Snowflake-style 64-bit ID generator or a centralized range allocator. - ID fields can include timestamp + machine ID + sequence, or we can allocate monotonically increasing ranges to app servers. Two good implementations: - Snowflake-like IDs: decentralized, high throughput, low coordination. - Range allocation from metadata store: each write node leases a block of IDs, e.g. 1M IDs at a time. I prefer range allocation for simplicity because write throughput is modest. Each write server can generate IDs locally from its leased range, avoiding a hot central dependency. 4. Encoding scheme and expected length Base62 capacity: - 62^6 ≈ 56.8B - 62^7 ≈ 3.52T Total links over 5 years: - 100M/month * 60 months = 6B links So 6 Base62 chars are not enough for long-term headroom if IDs are densely used globally, but 7 chars are more than enough. Expected length: - Start at 6 chars for early lifecycle if desired. - In practice, standardize on 7 chars to keep implementation simple and avoid length changes in user expectations. 5. Collision handling With deterministic unique IDs: - No collisions at the code level. - Database uniqueness constraint on short_code provides a final safety net. - If a very rare duplicate is observed due to software bug or ID misconfiguration, regenerate from a fresh ID and alert. 6. Exhaustion strategy - 7-char Base62 gives 3.52 trillion combinations, far beyond 6B needed in 5 years. - If future scale grows substantially, support 8-char codes seamlessly. - The redirect service should treat code length as variable, so no migration issue. 7. Optional custom aliases If the product supports user-defined aliases: - Store in the same mapping table with uniqueness enforced. - Reserve blocked words and abusive terms. - Apply stricter rate limits because custom aliases are a contention hotspot. B. Data Storage 1. Primary database choice Use a distributed, highly available key-value store for the URL mapping, such as: - DynamoDB / Bigtable / Cassandra Why: - Primary access pattern is simple key lookup by short_code. - Need horizontal scaling, high availability, and multi-replica reads. - Writes are modest, reads are dominant. - Key-value access is ideal for sub-50ms redirect latency. I would choose DynamoDB-like or Cassandra-like storage conceptually: - Partition by short_code hash. - Replicate across availability zones. - Tune for low-latency point reads. 2. Secondary stores - Relational DB for control plane metadata, billing, users, domains, policies. - Object storage + data lake for analytics/click logs. - Optional search/index store if admins need querying by creator, date, tags, etc. 3. Schema design Primary mapping table: - short_code (PK) - destination_url - created_at - expires_at nullable - owner_id nullable - status (active, disabled, deleted) - redirect_type (301/302) - checksum or normalized_url hash optional - metadata pointer optional Optional dedup table if product wants same long URL to return same short URL per tenant: - dedup_key = hash(tenant_id + canonicalized_url) - short_code This is optional; many URL shorteners do not globally deduplicate because semantics differ by user, campaign, or analytics needs. 4. Storage estimate over 5 years Total links: - 6B records Per record rough estimate: - short_code: ~8 bytes raw equivalent - destination_url: assume average 200 bytes - timestamps/status/flags: ~24 bytes - replication/versioning/index overhead: substantial in real systems - storage engine overhead: assume total effective ~350–500 bytes per record in primary DB Using 400 bytes/record: - 6B * 400 bytes = 2.4 TB raw logical data With replication factor 3: - 7.2 TB Add indexes, compaction overhead, tombstones, metadata, operational headroom: - Plan for 10–15 TB in primary storage over 5 years Analytics/click logs are much larger than mappings. At 100 reads per write: - 600B redirects over 5 years If each click log event averaged even 100 bytes compressed, that is ~60 TB compressed, likely much more with richer fields. Therefore logs should go to cheap object storage, not the serving database. 5. Partitioning/sharding strategy Primary table sharding: - Partition key: short_code or hash(short_code) - Uniformly distributed because IDs are encoded from well-distributed numeric IDs or range IDs mixed appropriately Important note: - Pure sequential IDs can create hot partitions in some databases if key locality matters. - To avoid this, either: 1. Hash the short_code for partition placement, or 2. Use IDs with enough entropy in lower bits, or 3. Prefix with shard bits before Base62 encoding I would explicitly hash-partition on short_code to ensure even distribution. C. Read Path Architecture 1. Read path overview Request flow: - User hits https://sho.rt/abc1234 - DNS routes to nearest edge/CDN - CDN forwards to regional redirect service if not cache-hit - Redirect service checks in-memory/Redis cache - On cache miss, read from distributed KV store - Return HTTP 301/302 with Location header - Asynchronously emit click event to analytics pipeline 2. Meeting <50ms p95 latency The redirect path must be extremely lightweight. Latency budget example: - Edge/CDN routing: small, geographically local - App processing: 1–3ms - Redis/in-memory cache hit: 1–5ms - DB miss path within region: 5–15ms typical - Total p95 under 50ms is achievable with regional serving and aggressive caching 3. Caching layers Layer 1: CDN/edge caching - Cache redirect responses for hot links at the edge. - Very effective because many popular short links are repeatedly accessed. - Use cache-control headers with TTL, e.g. minutes to hours depending on mutability. - If links can be disabled quickly, choose shorter TTL or purge support. Layer 2: Application local cache - Each redirect node keeps an LRU cache of hot mappings. - Ultra-low latency, avoids network hop to Redis. - Good for the most frequently requested codes. Layer 3: Distributed cache (Redis/Memcached) - Shared cache for regional fleet. - Stores short_code -> destination_url, status, redirect type. - TTL can be long for immutable links; shorter for mutable/admin-controlled links. 4. Replication strategy for reads - Multi-AZ replicas inside each region. - Serve reads from local region storage replicas. - Active-active across multiple regions for global traffic. - Use geo-DNS or Anycast to route to nearest healthy region. 5. Cache population strategy - Read-through on miss: app fetches from DB, populates Redis and local cache. - Negative caching for nonexistent/disabled codes for short TTL to absorb abuse and typo storms. - Prewarm cache for trending links if known from analytics. 6. Redirect semantics - 302 by default if destination may change or if analytics/tracking policies require flexibility. - 301 for permanent links where clients and CDNs may cache aggressively. - Product decision can expose both options. 7. Abuse handling on read path - Rate limit by IP / ASN / token for suspicious high-rate requests. - WAF and bot filters at CDN. - Protect origin from brute-force short-code scanning. D. Write Path Architecture 1. Write path overview Request flow: - Client sends long URL and optional metadata/custom alias. - API gateway authenticates and rate limits. - URL validation and normalization happen in stateless app tier. - App obtains/generates short code. - Persist mapping to primary KV store. - Populate caches asynchronously or synchronously for immediate availability. - Emit creation event to queue for analytics, abuse scanning, and downstream indexing. 2. Capacity 100M/month is not high for modern systems: - Average ~39 writes/sec - Even 100x burst is only a few thousand writes/sec So the main goals are reliability, idempotency, and operational simplicity rather than extreme write throughput. 3. Validation steps - Validate URL syntax. - Enforce allowed protocols, usually http/https only. - Optional safe-browsing or malware scan asynchronously; if risk found, disable link. - Canonicalize URL for optional dedup logic. 4. Queuing and asynchronous work Use a durable queue or log such as Kafka/SQS/PubSub for side effects: - Analytics event for new link creation - Abuse/scam/phishing detection - Cache warmup - Search indexing - Notification/audit pipeline Critical path should only include what is necessary to create the mapping and return the shortened URL. 5. Consistency model For create response, use write-after-create consistency for the new short code: - Once API returns success, redirect should work immediately. How: - Persist mapping to quorum/leader in primary DB before acknowledging. - Optionally write-through to Redis after DB commit. - Redirect path falls back to DB on cache miss, so cache propagation lag does not break correctness. 6. Idempotency Support idempotency keys for API clients to prevent duplicate links during retries. - Store request_id -> short_code for a bounded TTL in a fast store. - Especially useful for mobile/network-retry scenarios. 7. Rate limiting - Per-user/API-key quotas to control abuse. - Stronger limits on custom alias requests. - Global safeguards to prevent attack-induced write amplification. E. Reliability and Fault Tolerance 1. Availability target: 99.9% 99.9% uptime allows about 43.8 minutes downtime/month. This is achievable with multi-AZ deployment and regional failover. 2. Failure handling Node failure: - Stateless app nodes behind load balancers; auto-replace unhealthy instances. - Local caches are disposable. - Redis deployed as highly available cluster/sentinel mode. AZ failure: - App tier deployed across at least 3 AZs. - Primary DB replicated across AZs. - Load balancer removes unhealthy AZ. Regional/DC outage: - Active-active read serving across multiple regions. - Data replicated cross-region asynchronously or near-synchronously depending on DB capabilities. - DNS/traffic manager fails over users to healthy region. 3. Data durability - Primary DB replication factor 3 across AZs. - Periodic snapshots/backups to object storage. - WAL/commit-log archival if supported. - Cross-region backup copies for disaster recovery. 4. Backup and recovery strategy - Daily full snapshots plus incremental backups. - Point-in-time recovery for accidental deletion/corruption. - Regular restore drills into staging to verify RTO/RPO. - Backup retention aligned with 5-year accessibility requirement and compliance needs. Reasonable objectives: - RPO: minutes to 1 hour depending on cross-region replication model - RTO: under 1 hour for regional failover, longer for full storage rebuild but that should be rare 5. Cache invalidation strategy If links are mostly immutable, caching is simple. For mutable links (disable/change destination/expiration): - Update DB first. - Publish invalidation event. - Evict Redis key and local cache entries. - Purge CDN cache if edge caching is used for that code. - Use bounded TTL so stale cache self-heals even if invalidation misses. 6. Protection against data corruption and abuse - DB uniqueness constraints on short_code. - Checksums/version fields for records. - Audit logs for admin actions. - Soft delete or disabled state instead of hard delete where possible. - Malware/phishing scanning and takedown tooling. F. Key Trade-offs 1. Deterministic ID generation vs random short codes Options: - Deterministic sequential/range-based IDs: simple, no collisions, compact, fast. - Random codes: less predictable/enumerable but require collision checks and more complexity. Choice: - I choose deterministic IDs encoded in Base62. Why: - Simpler, faster, and operationally safer for this scale. - Collision-free without retry loops. - Better fit because throughput is modest and the main challenge is read latency. Cost: - Codes may be more enumerable/predictable. Mitigation: - Rate limiting, bot detection, optional longer/non-sequential aliases for sensitive links. 2. Stronger consistency on create vs eventual consistency everywhere Options: - Eventual consistency could reduce write latency and simplify multi-region writes. - Strong acknowledgment after durable write ensures a returned short URL works immediately. Choice: - Strong consistency for the single-link create path within a home region; eventual consistency for secondary caches and cross-region propagation. Why: - Better user experience: once the API returns, redirect should succeed. - Write volume is low enough that we do not need to relax consistency on the critical write. 3. Aggressive edge caching vs fast invalidation Options: - Long CDN TTLs maximize read performance and reduce origin cost. - Short TTLs make disable/update operations propagate faster. Choice: - Moderate TTL with explicit purge for mutable/admin-managed links; longer TTL for immutable links. Why: - Most shortened URLs are effectively immutable, so caching gives major latency and cost benefits. - But abuse takedowns require reasonably fast invalidation. 4. Simplicity vs analytics richness on redirect path Options: - Synchronous click logging gives immediate analytics guarantees but adds latency and failure coupling. - Asynchronous event emission keeps redirects fast but analytics may lag or lose small amounts during failures. Choice: - Asynchronous analytics. Why: - The product’s hard SLA is redirect latency and availability, not zero-loss real-time analytics. - Redirect path should never block on downstream logging systems. Summary architecture - Short code generation: range-leased or Snowflake-style unique 64-bit IDs encoded in Base62, standard 7-char codes. - Primary store: distributed KV database, hash-partitioned by short_code, RF=3, multi-AZ. - Read path: CDN + local cache + Redis + KV store fallback; multi-region active-active reads. - Write path: stateless API + durable DB write + async queue for side effects; idempotency and rate limits. - Reliability: multi-AZ, cross-region replication, backups, cache invalidation events, DNS failover. - This design comfortably supports 100M new URLs/month, 100:1 read ratio, 5-year retention, 99.9% uptime, and <50ms p95 redirects.
Result
Winning Votes
3 / 3
Average Score
Total Score
Overall Comments
Comprehensive, well-structured design that addresses all required sections with solid quantitative reasoning (write/read QPS, 5-year volume, Base62 keyspace) and practical mechanisms (range-leased IDs, multi-layer caching with CDN + local + Redis, negative caching, multi-region routing, idempotency, queue-based side effects, cache invalidation/purge strategy). Storage sizing is reasonably conservative and acknowledges real overheads and log volume separation. Trade-offs are concrete and tied to constraints (latency vs analytics coupling, CDN TTL vs invalidation, consistency choices).
View Score Details ▼
Architecture Quality
Weight 30%End-to-end architecture is cohesive: deterministic ID generation with range leasing, KV primary store, multi-layer caching (CDN/local/Redis), async analytics, and explicit partitioning/hotspot mitigation. Addresses latency and operational concerns well.
Completeness
Weight 20%Fully covers A–F with calculations, schema, storage estimate, read/write paths, reliability, and multiple trade-offs; includes extras like negative caching, abuse controls, and custom aliases.
Trade-off Reasoning
Weight 20%Trade-offs are specific and justified (deterministic IDs vs randomness/enumeration, strong create consistency vs eventual propagation, CDN TTL vs takedown speed, async analytics vs redirect latency).
Scalability & Reliability
Weight 20%Clear multi-AZ + multi-region approach, failover routing, RF=3, backups/PITR, cache-failure behavior, and mechanisms to keep redirect latency low at scale.
Clarity
Weight 10%Well organized with concrete bullets, flows, and back-of-the-envelope math; easy to follow despite being detailed.
Total Score
Overall Comments
Answer A provides an outstanding and comprehensive system design. It is well-structured, detailed, and demonstrates a deep understanding of the practical challenges involved. The quantitative estimates for storage are realistic and well-justified. The architecture for both read and write paths is robust, practical, and perfectly suited to the scale of the problem. The discussion of trade-offs is particularly strong, identifying four distinct and relevant points with clear reasoning. The inclusion of operational details like cache invalidation strategies and abuse handling further elevates the quality of the design.
View Score Details ▼
Architecture Quality
Weight 30%The architecture is exceptionally well-designed. The multi-layered caching on the read path (CDN, local, distributed) is excellent. The write path is simple, robust, and provides strong consistency for the user-facing critical path while correctly offloading side-effects to an async queue. This is a practical and highly effective design.
Completeness
Weight 20%The answer is extremely complete, addressing all sections of the prompt in great detail. It goes beyond the minimum requirements by discussing optional features like custom aliases, providing a latency budget, and detailing abuse handling on both read and write paths.
Trade-off Reasoning
Weight 20%The trade-off analysis is outstanding. The answer identifies four distinct and highly relevant trade-offs, exceeding the prompt's requirement for two. Each choice is explained with clear, compelling reasoning that reflects a deep understanding of system design principles.
Scalability & Reliability
Weight 20%The design is highly scalable and reliable. It correctly employs standard patterns like multi-AZ/multi-region deployments, distributed databases with replication, and robust backup strategies. The reliability discussion is thorough, covering various failure scenarios from node to region level.
Clarity
Weight 10%The answer is presented as a very clear and well-structured design document. The sections are logical, and the explanations are easy to follow. The quantitative reasoning is laid out step-by-step, making it easy to verify.
Total Score
Overall Comments
Answer A is a comprehensive, well-structured design document that thoroughly addresses all six required sections. It demonstrates strong quantitative reasoning throughout, with detailed capacity calculations, latency budget breakdowns, and storage estimates. The architecture is well-justified with clear explanations for each choice. The read path architecture is particularly strong, with a multi-layer caching strategy (CDN, local cache, Redis, DB fallback) and detailed latency budget analysis. The write path correctly identifies the modest write throughput and focuses on reliability and idempotency. The trade-off section is exceptional, identifying four genuine trade-offs with nuanced reasoning for each choice, including mitigations for the downsides. The answer also covers important practical concerns like abuse handling, negative caching, custom aliases, and analytics logging separation. Minor weaknesses include some verbosity and the storage estimate using 200 bytes for URLs (slightly high but reasonable as a conservative estimate).
View Score Details ▼
Architecture Quality
Weight 30%Answer A presents a well-designed multi-layer architecture with CDN, local cache, Redis, and DB fallback for reads, with a clear latency budget showing how 50ms p95 is achievable. The write path correctly ensures durable DB write before acknowledging to the client. The range-based ID allocation is well-justified for the modest write throughput. The separation of analytics to async pipelines and object storage is practical and well-reasoned.
Completeness
Weight 20%Answer A thoroughly covers all six sections with additional practical considerations: custom aliases, abuse handling on read path, negative caching, redirect semantics (301 vs 302), idempotency keys, soft deletes, and analytics storage separation. The storage estimate includes analytics/click log storage considerations. The schema includes useful fields like redirect_type and status.
Trade-off Reasoning
Weight 20%Answer A identifies four genuine, well-reasoned trade-offs: deterministic vs random IDs, strong vs eventual consistency on create, aggressive edge caching vs fast invalidation, and synchronous vs asynchronous analytics. Each trade-off includes clear options, the chosen direction, reasoning, and acknowledgment of costs with mitigations. The caching vs invalidation trade-off is particularly nuanced, proposing different strategies for immutable vs mutable links.
Scalability & Reliability
Weight 20%Answer A provides detailed reliability analysis including specific RPO/RTO targets, multi-AZ and cross-region deployment, cache invalidation event propagation, DNS failover, and restore drills. The discussion of handling different failure modes (node, AZ, regional) is systematic. The cache invalidation strategy with bounded TTLs as a safety net is practical. The 99.9% uptime calculation (43.8 min/month) grounds the discussion in reality.
Clarity
Weight 10%Answer A is well-organized with clear section headers and numbered subsections. The writing is direct and technical. The latency budget breakdown is particularly clear. The summary architecture section at the end provides a good recap. However, the answer is quite long and some sections could be more concise. The numbered list format within sections aids readability.