Answer A: Anthropic Claude Opus 4.8
# Real-Time Collaborative Whiteboard — System Design ## 1. High-Level Architecture Components and request/data flow: ``` ┌──────────────┐ Web/Mobile/Desktop │ Clients │ (Canvas rendering, local CRDT replica, Clients ─────► │ (100/board) │ WebSocket client, offline buffer) └──────┬───────┘ │ HTTPS (REST) + WSS (WebSocket) ┌──────▼───────┐ │ CDN │ (static assets, exported images) └──────┬───────┘ ┌──────▼────────────...
Show Full Answer ▼
# Real-Time Collaborative Whiteboard — System Design ## 1. High-Level Architecture Components and request/data flow: ``` ┌──────────────┐ Web/Mobile/Desktop │ Clients │ (Canvas rendering, local CRDT replica, Clients ─────► │ (100/board) │ WebSocket client, offline buffer) └──────┬───────┘ │ HTTPS (REST) + WSS (WebSocket) ┌──────▼───────┐ │ CDN │ (static assets, exported images) └──────┬───────┘ ┌──────▼───────────────┐ │ Global Load Balancer │ (L7, TLS termination, │ + API Gateway │ auth, rate limiting) └───┬───────────────┬───┘ REST traffic │ │ WS upgrade (sticky by sessionId) ┌──────────────▼──┐ ┌─────▼───────────────┐ │ Stateless App │ │ Realtime Collab │ │ Services │ │ Servers (WS) │ │ (auth, board │ │ - hold in-memory │ │ CRUD, sharing, │ │ board state │ │ exports) │ │ - merge ops (CRDT) │ └───┬────────┬─────┘ │ - broadcast deltas │ │ │ └───┬───────────┬──────┘ ┌───────▼──┐ ┌───▼─────┐ ┌────▼────┐ ┌───▼────────┐ │ Metadata │ │ Object/ │ │ Redis │ │ Session │ │ DB │ │ Blob │ │ Pub/Sub │ │ Routing │ │(Postgres)│ │ Store(S3)│ │+Presence│ │(Consistent │ └──────────┘ └─────────┘ └─────────┘ │ hashing) │ │ └────────────┘ ┌───────▼─────────────┐ ┌─────────────────────────────┐ │ Document/Op Store │ │ Async Workers (Kafka queue) │ │ (DynamoDB/Cassandra: │◄──│ - snapshotting │ │ ops log + snapshots) │ │ - thumbnail/export gen │ └─────────────────────┘ │ - analytics │ └─────────────────────────────┘ ``` **Interaction summary:** Clients authenticate via the API Gateway (REST), then open a persistent WebSocket to a Realtime Collab server. The gateway uses consistent hashing on `sessionId` so that all participants of one board land on the same server (or a small replica set), keeping the authoritative live state in one place. App Services handle non-real-time CRUD (creating boards, sharing, listing, exports). Redis Pub/Sub bridges Realtime servers so that if participants are split across instances, ops still propagate. Async workers periodically persist snapshots and the op log to durable storage. ## 2. Real-Time Communication - **Protocol:** WebSocket (WSS) for full-duplex, low-latency bidirectional messaging. Falls back to HTTP long-polling via a library like Socket.IO for restrictive networks. WebRTC data channels are considered for cursor/presence peer-to-peer, but a server-relayed model is chosen for simplicity and reliability. - **Message model:** Clients send small **operations/deltas** (e.g., `{type:'stroke_add', objId, points, color}`, `{type:'obj_move', objId, dx, dy}`) rather than full board state. The server validates, assigns a sequence/version, merges, and broadcasts the delta to all other session members. - **Fan-out:** Each Realtime server keeps the connection set per board in memory and broadcasts deltas directly. For boards whose members span multiple servers, the originating server publishes the op to a Redis Pub/Sub channel keyed by `sessionId`; subscribed servers re-broadcast to their local connections. - **Presence & cursors:** High-frequency, low-value data (live cursor positions, selections) is throttled (~30–60ms) and sent best-effort, never persisted. - **Latency target (<200ms):** Achieved via regional Realtime clusters, sticky routing (no cross-region hops), tiny binary/compact JSON payloads, and optimistic local rendering (client applies its own op immediately, then reconciles). ## 3. Data Model **Board metadata (Postgres — relational, transactional):** - `boards(board_id, owner_id, title, created_at, updated_at, latest_snapshot_id)` - `users(user_id, name, email, ...)` - `board_permissions(board_id, user_id, role[owner|editor|viewer])` - `sessions(session_id, board_id, started_at, active_user_count)` **Board content (DynamoDB/Cassandra — high write throughput, append-friendly):** - **Op log:** partition key `board_id`, sort key `version` (monotonic). Each row is one operation `{op_type, object_id, payload, user_id, timestamp}`. - **Snapshots:** periodic materialized full-state blobs `{board_id, snapshot_version, state_json/binary}` stored in object storage (S3) with a pointer row. Loading a board = latest snapshot + replay of ops since that snapshot version. **Object structure within a board:** ``` WhiteboardObject { id, type: "stroke" | "text" | "sticky", layer/zIndex, geometry: { x, y, width, height, rotation }, props: { // type-specific stroke: { points:[...], color, thickness }, text: { content, font, color }, sticky: { content, bgColor } }, createdBy, lastModified, version } ``` **Conflict resolution:** Use a **CRDT** (e.g., a list/map CRDT like those in Yjs/Automerge) or OT for the object set, so concurrent edits (two users moving/editing different or same objects) converge deterministically without a central lock. Each object carries a logical clock for last-writer-wins on conflicting property updates. **Large/binary assets** (uploaded images, exported PNG/PDF) live in S3-style blob storage, referenced by URL in the object. ## 4. Scalability and Reliability Strategy **Scaling to 10k sessions / 1M users:** - **Stateless app tier:** Horizontally autoscaled behind the load balancer; trivial to add nodes. - **Realtime tier:** Sharded by `sessionId` via consistent hashing. With 100 users max per board, 10k sessions = up to ~1M concurrent WS connections. A single tuned node handles ~10–50k connections; fan-out is sized so each board's members concentrate on one node. Autoscale on connection count and CPU. - **Redis:** Clustered, used for Pub/Sub fan-out and presence; partitioned by session. - **Databases:** Postgres with read replicas for metadata; DynamoDB/Cassandra scale horizontally for the high-volume op log. Object storage scales effectively infinitely. - **Snapshotting:** Async workers compact the op log into snapshots on a cadence (e.g., every N ops or T seconds) to bound replay cost and storage. **High availability (99.9%):** - Multi-AZ deployment for every tier; multi-region for disaster recovery. - Realtime servers run with a warm standby/replica per board; on node failure, clients auto-reconnect and the new node rehydrates state from the latest snapshot + op log (last few seconds of in-memory ops may be re-sent by clients from their local buffer). - Health checks + automated failover at the load balancer. - Backpressure and rate limiting to protect against op storms. - Durable op log gives a recovery point — no committed work is lost even on server crash. ## 5. Trade-offs **Chosen trade-off: Latency/availability over strong consistency (AP over CP), using CRDTs.** I prioritize sub-200ms responsiveness and offline tolerance by letting clients apply edits optimistically and resolving conflicts with CRDTs/last-writer-wins, accepting **eventual consistency**. The benefit: the UI feels instant, users can edit during brief network partitions, and the system stays available during failovers. The cost: in rare concurrent-edit cases the converged result may not match any single user's intuitive expectation (e.g., simultaneous edits to the same text), and CRDT metadata adds memory/storage overhead. For a whiteboard this is the right call — creative collaboration values fluidity and availability far more than the strict serialized consistency a transactional/locking (CP) approach would impose, which would add round-trip latency and reduce availability during partitions. **Secondary trade-off:** keeping authoritative live state in-memory on a single Realtime node per board (fast broadcasts, simple merge) versus a fully stateless tier (more resilient but higher latency). Mitigated with snapshots + op log so any node can rebuild state on failover.
Result
Winning Votes
3 / 3
Average Score
Total Score
Overall Comments
Answer A provides a highly detailed and well-structured system design. Its text-based architecture diagram is exceptionally clear and comprehensive, immediately conveying the system's complexity and component interactions. The discussion of real-time communication, data modeling, and scalability is robust, with specific technology choices and thoughtful explanations. The trade-off analysis is particularly strong, demonstrating a deep understanding of the implications of design decisions, especially regarding consistency and latency with CRDTs.
View Score Details ▼
Architecture Quality
Weight 30%The text-based architecture diagram is exceptionally clear, comprehensive, and immediately conveys the system's structure and component interactions. Specific technologies are well-integrated and their roles are clearly defined.
Completeness
Weight 20%Answer A covers all core requirements and constraints comprehensively, providing detailed explanations for each section of the prompt. The data model section is particularly well-structured.
Trade-off Reasoning
Weight 20%Answer A provides an excellent and well-justified major trade-off (Latency/availability over strong consistency using CRDTs), explicitly framing it as AP over CP. The inclusion of a secondary trade-off further demonstrates a deep understanding of design implications.
Scalability & Reliability
Weight 20%Answer A presents a robust strategy for both scalability and reliability, detailing horizontal scaling, sharding, multi-AZ/multi-region deployments, warm standbys, and durable op logs. It's very comprehensive.
Clarity
Weight 10%The answer is exceptionally clear, well-structured with distinct headings, and easy to follow. The text diagram significantly enhances the clarity of the architecture.
Total Score
Overall Comments
Answer A is a highly detailed, well-structured system design that covers all required sections with depth and precision. It includes an ASCII architecture diagram, clearly explains component interactions, justifies technology choices (WebSockets, CRDTs, DynamoDB/Cassandra), provides a concrete data model with schema examples, and discusses both primary and secondary trade-offs. The CRDT discussion is particularly strong, showing deep understanding of distributed systems. The latency strategy is concrete and multi-layered. Minor weakness: the diagram is somewhat complex and could be clearer, but overall this is a strong, benchmark-quality response.
View Score Details ▼
Architecture Quality
Weight 30%A provides a detailed ASCII diagram with explicit component roles, consistent hashing for session routing, Redis Pub/Sub for cross-node fan-out, and clear separation of stateless app tier from stateful real-time tier. Component interactions are well-explained with specific technology choices justified. Minor complexity in the diagram but overall excellent.
Completeness
Weight 20%A covers all five required sections thoroughly: architecture with diagram, real-time communication with protocol justification and fallback, data model with schema details and CRDT mention, scalability with concrete numbers, and two trade-offs. Large/binary asset handling is also addressed. Very complete.
Trade-off Reasoning
Weight 20%A's trade-off discussion is insightful and specific: AP vs CP framing, CRDT metadata overhead, the implication for user experience, and a secondary trade-off about stateful vs stateless real-time tier. Demonstrates genuine understanding of distributed systems implications.
Scalability & Reliability
Weight 20%A provides concrete scaling math (10k sessions, 1M WS connections, 10-50k connections per node), multi-AZ + multi-region strategy, snapshotting cadence details, backpressure mechanisms, and a clear failover rehydration path. Very thorough.
Clarity
Weight 10%A is well-organized with clear section headers, a detailed diagram, and code-style schema examples. The ASCII diagram is somewhat dense but readable. The writing is precise and technical without being verbose.
Total Score
Overall Comments
Answer A provides a highly coherent and practical architecture with clear separation between REST services, real-time WebSocket collaboration servers, persistence, metadata storage, pub/sub, and async workers. It gives a strong data model, explicitly addresses worst-case connection scale, explains snapshot plus operation-log persistence, and offers a thoughtful consistency-versus-latency trade-off. Its main weakness is some ambiguity around exactly when operations become durably committed versus asynchronously persisted, but overall it is very complete and implementation-oriented.
View Score Details ▼
Architecture Quality
Weight 30%The architecture is well-structured and practical, with clients, CDN, global load balancer/API gateway, stateless app services, stateful real-time servers, Redis pub/sub, metadata DB, object storage, op store, and async workers. The interaction flow is clear and maps well to the whiteboard requirements. Minor ambiguity remains around whether the realtime path synchronously appends to a durable log before acknowledgement.
Completeness
Weight 20%It covers all requested areas: high-level architecture, WebSocket real-time communication, data model for boards and objects, persistence through snapshots and op logs, scalability, reliability, and trade-offs. It also includes presence, cursors, assets, permissions, and conflict resolution, making it very complete.
Trade-off Reasoning
Weight 20%The trade-off discussion is strong, focusing on latency and availability over strict consistency, with CRDTs and optimistic rendering. It clearly explains benefits and costs, including user-visible conflict outcomes and metadata overhead. The secondary trade-off around in-memory board ownership is also useful.
Scalability & Reliability
Weight 20%It directly addresses scaling to 10,000 sessions and up to 1,000,000 concurrent WebSocket connections, using horizontal scaling, session sharding, clustered Redis, scalable op storage, snapshots, multi-AZ deployment, failover, backpressure, and client reconnect. The main gap is that the durability path for operations could be specified more rigorously to avoid loss during realtime server crashes.
Clarity
Weight 10%The answer is very clear, with a readable diagram, well-labeled sections, concrete examples, and concise explanations of each subsystem. The terminology is mostly consistent, though the mix of CRDT, server sequencing, and last-writer-wins could be clarified further.