Answer A: Google Gemini 2.5 Pro
### 1. High-Level Architecture The system is designed as a globally distributed, multi-region service to ensure low latency and high availability. Each major region (e.g., US-East, EU-West, AP-Southeast) hosts a complete Point of Presence (PoP). **Components:** 1. **Global Load Balancer (GLB):** Uses latency-based DNS routing to direct users to the nearest healthy regional PoP. 2. **Regional PoP Stack:** * **API Gateway:** Terminates TLS, authenticates requests, and routes them to the appropriate backend service. * **Redirect Service:** A lightweight, highly-scaled service that handles `GET /{shor...
Show Full Answer ▼
### 1. High-Level Architecture The system is designed as a globally distributed, multi-region service to ensure low latency and high availability. Each major region (e.g., US-East, EU-West, AP-Southeast) hosts a complete Point of Presence (PoP). **Components:** 1. **Global Load Balancer (GLB):** Uses latency-based DNS routing to direct users to the nearest healthy regional PoP. 2. **Regional PoP Stack:** * **API Gateway:** Terminates TLS, authenticates requests, and routes them to the appropriate backend service. * **Redirect Service:** A lightweight, highly-scaled service that handles `GET /{short_alias}` requests. It is optimized for speed. * **Write Service:** Handles link creation (`POST /api/v1/links`), including custom alias validation and unique ID generation. * **Analytics Service:** Provides access to link statistics via `GET /api/v1/links/{short_alias}/stats`. 3. **Data Infrastructure:** * **Global Database:** A strongly consistent, distributed database for the core link mapping. * **Regional Cache:** An in-memory cache in each region for hot links. * **Analytics Pipeline:** A message queue for event ingestion, a stream processor for aggregation, and a NoSQL database for storing and serving analytics data. **Data Flow:** * **Link Creation:** A user's `POST` request is routed to the nearest Write Service. The service generates or validates an alias and writes the `short_alias -> long_URL` mapping to the Global Database. The write is synchronously replicated across all regions within 2 seconds. * **Redirection:** A `GET` request is routed to the nearest Redirect Service. It first checks the Regional Cache. On a cache hit, it immediately issues a 301 redirect and asynchronously sends a click event to the analytics queue. On a miss, it queries the Global Database's local replica, populates the cache, then performs the redirect and event logging. * **Analytics:** Click events from all regions are published to a central message queue. A stream processor consumes these events, aggregates them into time-windowed statistics (total clicks, 24h clicks, country counts), and writes the results to the Analytics Database. ### 2. Storage Choices * **Link Mappings (Primary Storage):** * **Technology:** A globally distributed SQL database like Google Cloud Spanner or CockroachDB. * **Justification:** This choice is driven by the need for strong consistency on redirects and the <2 second global write propagation requirement. These databases provide synchronous replication and low-latency regional reads, justifying their higher cost for the core business function. The `short_alias` serves as the primary key for fast lookups. * **Cached Hot Links:** * **Technology:** A regional in-memory distributed cache like Redis Cluster. * **Justification:** With 8 billion daily reads and heavy traffic skew, a cache is essential for meeting the <80ms P95 latency target and protecting the database. Each region maintains its own cache using a cache-aside pattern. * **Analytics Data:** * **Ingestion Queue:** Apache Kafka or AWS Kinesis. This decouples the high-throughput redirect path from analytics processing and provides a durable buffer for click events. * **Serving Database:** A wide-column NoSQL store like Apache Cassandra or ScyllaDB. It is optimized for the high-write, time-series aggregation workload of analytics and is more cost-effective at scale for these query patterns than a relational database. ### 3. ID Generation and Alias Strategy We will use a 7-character alias from the alphabet `[a-zA-Z0-9]`, providing `62^7` (over 3.5 trillion) unique combinations. * **ID Generation:** A background service pre-generates a large pool of unique random IDs and stores them in a dedicated queue (e.g., a Redis list). The Write Service simply fetches a pre-generated ID from this pool when a new link is created. This approach is fast, scalable, and avoids on-the-fly collision checks during a user request. * **Custom Alias Handling:** When a user submits a custom alias, the Write Service attempts to `INSERT` it into the Global Database. The `short_alias` column is the primary key with a unique constraint. This atomically handles the check-and-set operation; if the alias is taken, the database rejects the write, and the service returns a `409 Conflict` error to the user. ### 4. API Design * **Create Link:** `POST /api/v1/links` * **Body:** `{ "long_url": "...", "custom_alias": "..." (optional), "expires_at": "..." (optional) }` * **Response:** `201 Created` with `{ "short_url": "...", "long_url": "..." }` * **Redirect:** `GET /{short_alias}` * **Response:** `301 Moved Permanently` with `Location: {long_url}` header for permanent links. `404 Not Found` if the alias doesn't exist or is expired. * **Get Analytics:** `GET /api/v1/links/{short_alias}/stats` * **Response:** `200 OK` with `{ "total_clicks": ..., "clicks_last_24h": ..., "top_countries": [...] }` ### 5. Scaling Approach * **Hot Keys:** A multi-layer caching strategy is employed. For extremely viral links, the `301` redirect response can be cached at the CDN layer. The primary defense is the regional Redis cache, which will absorb the majority of the skewed traffic. * **Partitioning:** The Global Database (Spanner/CockroachDB) automatically handles partitioning of link data by its primary key (`short_alias`). The Analytics Database (Cassandra) will be partitioned by `short_alias` to co-locate all stats for a given link, enabling efficient queries. * **Multi-Region Traffic:** The GLB ensures users are routed to their nearest PoP for low latency. The architecture is active-active, as writes can be handled by any region and are globally replicated by the database. This design maximizes resource utilization and availability. ### 6. Reliability Strategy * **Failover:** The GLB performs health checks and will automatically route traffic away from any unhealthy region. Within a region, load balancers will route around failed service instances. * **Data Replication:** The Global Database is configured to replicate data across multiple availability zones and regions, providing resilience against both instance and regional failures. The Cassandra cluster for analytics will also be deployed across multiple regions. * **Backup:** Automated, periodic snapshots of both the primary and analytics databases are taken and stored in durable object storage (e.g., AWS S3) for disaster recovery. * **Graceful Degradation:** The system is designed so that a failure in the non-critical analytics pipeline will not impact the core redirect functionality. The Redirect Service will continue to serve requests even if it cannot publish events to the message queue. ### 7. Trade-offs and Alternatives Rejected * **Alternative 1: Hash-based ID Generation (Rejected).** * **Concept:** Generate a short ID by hashing the long URL. If a collision occurs, add a salt and re-hash. * **Reason for Rejection:** This adds unpredictable latency and complexity to the write path. At 120 million writes per day, collision probability becomes significant, requiring multiple database lookups to find a unique ID. The chosen pre-generation strategy provides constant-time performance for ID acquisition during a user request. * **Alternative 2: Single-Master Relational Database (Rejected).** * **Concept:** Use a traditional database like PostgreSQL in a single primary region with read replicas in other regions. * **Reason for Rejection:** This design cannot meet the strict `<2 second` global write propagation requirement. Replication lag to distant read replicas would frequently exceed this threshold, meaning a newly created link would not be immediately available to all global users. The higher cost of a globally distributed database is justified to meet this core functional requirement.
Result
Winning Votes
0 / 3
Average Score
Total Score
Overall Comments
Presents a coherent multi-region architecture with a clear separation between redirect and analytics paths, good core storage choices (global strongly consistent DB for mappings, regional Redis cache, async analytics pipeline), and includes required APIs and two rejected alternatives. However, several areas are under-specified or somewhat hand-wavy for this scale: ID pre-generation pool details and safety (de-duplication, multi-region coordination, refill behavior) are not addressed; analytics design is generic and doesn’t clearly explain how to compute last-24h and top countries efficiently; expiration semantics and cache/CDN interaction aren’t deeply covered; multi-region replication and consistency/budget justification is relatively shallow beyond “use Spanner/Cockroach.” Reliability/degradation is reasonable but misses concrete RTO/RPO and operational details for the global DB and analytics backpressure/data loss policy.
View Score Details ▼
Architecture Quality
Weight 30%Sound regional PoP model with separate redirect/write/analytics services and async eventing, but some key components are described at a generic level ("Global Database" with synchronous replication everywhere) without detailing practical global write/read behavior and edge caching/TTL interaction.
Completeness
Weight 20%Covers all requested sections, but analytics queries/aggregation approach is not fully specified (especially last-24h/top-5 countries), expiration handling is thin, and the ID pre-generation mechanism lacks operational details (coordination, collisions, refill, multi-region).
Trade-off Reasoning
Weight 20%Includes two alternatives rejected with reasonable rationale, but budget/consistency trade-offs are mostly asserted (pay for global strong DB) with limited discussion of what can be eventual or how to minimize cost beyond caching.
Scalability & Reliability
Weight 20%Regional caching and async analytics support read-heavy skew; reliability includes failover and degradation, but lacks deeper handling of hot-key saturation beyond CDN/Redis and gives limited detail on multi-region failure behavior and replication modes/cost.
Clarity
Weight 10%Organized and readable with clear headings and flows; some statements are oversimplified (synchronous replication within 2 seconds globally) which reduces precision.
Total Score
Overall Comments
Answer A provides a solid, well-structured system design that addresses all key requirements. It clearly separates the redirect and analytics paths, makes appropriate storage choices, and outlines a reasonable ID generation and scaling strategy. The trade-offs discussed are relevant and well-justified. However, it lacks some of the depth and detail found in Answer B, particularly in areas like multi-layered caching for hot keys, more robust ID generation, and a comprehensive budget justification.
View Score Details ▼
Architecture Quality
Weight 30%The architecture is well-defined with clear separation of concerns (Redirect, Write, Analytics services) and appropriate use of regional PoPs and a global database. The data flow is logical and addresses the core requirements.
Completeness
Weight 20%Answer A covers all the requested sections, providing a complete high-level design. However, some sections, like ID generation and API design, could benefit from more detail and additional considerations.
Trade-off Reasoning
Weight 20%The answer identifies two relevant alternatives and provides clear justifications for their rejection, primarily focusing on latency and consistency requirements. The reasoning is sound but limited in scope.
Scalability & Reliability
Weight 20%The answer outlines a good strategy for scaling hot keys (CDN, regional cache) and multi-region traffic. Reliability aspects like failover, data replication, and graceful degradation are mentioned, providing a solid foundation.
Clarity
Weight 10%The answer is well-organized with clear headings and bullet points, making it easy to read and understand. The language is precise and avoids ambiguity.
Total Score
Overall Comments
Answer A presents a coherent and well-structured system design that covers all major components. It correctly separates the redirect path from the analytics pipeline, chooses appropriate technologies (Spanner/CockroachDB, Redis, Kafka, Cassandra), and provides a clean API design. The ID generation strategy using pre-generated pools is reasonable. However, the answer lacks depth in several areas: it does not discuss the hot key problem beyond mentioning CDN and Redis caching, provides no capacity estimates, does not address the 301 vs 302 trade-off (defaulting to 301 which would break analytics and link expiration), lacks detail on cache invalidation strategies, does not mention in-process caching, provides only two rejected alternatives, and the reliability section is relatively thin without specific failure scenarios or RTO/RPO details. The budget justification is absent. The design is correct but reads more like a summary than a detailed implementation plan.
View Score Details ▼
Architecture Quality
Weight 30%Answer A presents a clean, coherent architecture with proper separation of concerns between redirect, write, and analytics paths. The choice of Spanner/CockroachDB for the link store and Kafka for analytics ingestion is sound. However, it lacks a multi-tier caching strategy (no in-process cache), and the use of 301 redirects by default is a significant design flaw that would break analytics tracking and link expiration. The architecture is correct at a high level but misses important nuances.
Completeness
Weight 20%Answer A covers all seven required sections but with limited depth. It lacks capacity estimates, does not discuss rate limiting, omits the 301 vs 302 consideration, provides no schema details, has no budget justification, and only presents two rejected alternatives. The API design is minimal without error codes for rate limiting or authorization. Cache invalidation strategy is not discussed. The reliability section lacks specific timing for failover or degradation details per component.
Trade-off Reasoning
Weight 20%Answer A presents only two rejected alternatives: hash-based ID generation and single-master relational database. Both are reasonable but the reasoning is somewhat surface-level. The answer does not discuss the critical 301 vs 302 trade-off, does not compare analytics storage options, and does not address the trade-off between synchronous and asynchronous analytics. The budget justification for where to spend on consistency vs where to save is entirely absent despite being explicitly required by the constraints.
Scalability & Reliability
Weight 20%Answer A mentions CDN caching and regional Redis for hot keys but lacks specificity. There are no capacity estimates, no discussion of auto-scaling, and the hot key mitigation is limited to two layers. The reliability section covers failover, replication, backup, and graceful degradation at a high level but without specific timing, RTO/RPO targets, or per-component failure analysis. The 99.99% availability target is not explicitly addressed in terms of how the architecture achieves it.
Clarity
Weight 10%Answer A is well-organized with clear section headers and concise writing. The data flow descriptions are easy to follow. The use of bold text and bullet points aids readability. However, the brevity sometimes comes at the cost of important details. The writing is clean and professional.