Implement a Versioned Key-Value Store with Historical Queries

Compare model answers for this Coding benchmark and review scores, judging comments, and related examples.

X f L

Contents

Task Overview

Benchmark Genres

Coding

Task Creator Model The task creator is randomly selected from top task-generation models of supported providers.

OpenAI GPT-5.4

Answering Models In this benchmark, models from the same provider as the task creator are excluded from answering.

Answer A Google Gemini 2.5 Pro

Answer B Anthropic Claude Sonnet 4.6

Judge Models Judging uses exactly 3 judge models, excluding the answering models. At least 1 judge is selected from flagship models, lightweight models are not selected as judges, and the 3 judges come from 3 distinct providers.

OpenAI GPT-5.4 Anthropic Claude Opus 4.6 Google Gemini 2.5 Flash

Task Prompt

Show more ▼

Write code that implements an in-memory versioned key-value store supporting historical reads. The store begins empty and processes a sequence of commands. Each successful mutating command creates exactly one new global version number, starting from 1. Read-only commands must not create a version. Keys and values are case-sensitive strings without spaces. Versions are positive integers. Commands: SET key value Create or overwrite key with value. DELETE key Remove key if it exists. GET key Return the current value for key, or NULL if the key does not exist. GET_VERSION key version Return the value associated with key immediately after the specified global version was created, or NULL if the key did not exist at that version. If version is greater than the latest existing version, treat it as invalid and return INVALID_VERSION. HISTORY key Return all historical states for the key in increasing version order, including deletions, formatted as version:value pairs separated by commas. Use NULL for deleted or absent-after-mutation states. If the key has never been affected by any mutating command, return EMPTY. Input format: The first line contains an integer N, the number of commands. The next N lines each contain one command. Output format: For every GET, GET_VERSION, and HISTORY command, print one line with the result. Behavior details and edge cases: - Every SET always creates a new version, even if the value is unchanged. - Every DELETE always creates a new version, even if the key does not exist. - Versions are global across all keys, not per key. - HISTORY for a key should include only versions where that key was directly affected by SET or DELETE. - If a key was deleted and later set again, both events must appear in HISTORY. - Efficiency matters: assume up to 200000 commands, with many historical queries. Your solution should read from standard input and write to standard output. Include the full working program in one file. You may use any mainstream programming language, but the code should be complete and executable as written.

Task Context

This task is designed to test careful handling of version semantics, edge cases, and scalable data structure choices rather than only basic parsing.

Judging Policy

A good answer provides a complete executable program that correctly follows the global version rules, distinguishes read-only from mutating commands, handles invalid versions and absent keys exactly as specified, and scales to large inputs without naive full-state copying per version. Strong answers also keep the implementation clear and robust, with correct parsing and efficient historical lookup.

Answers

Answer A: Google Gemini 2.5 Pro

import sys
import bisect

def solve():
"""
Implements a versioned key-value store that reads commands from stdin
and writes results to stdout.
"""
try:
n_str = sys.stdin.readline()
if not n_str:
return
n = int(n_str.strip())
except (ValueError, IndexError):
return

# The main data structure.
# A dictionary mapping each key to a list of its historical states.
# Each state is a tuple: (version, value).
# The list for each key is kept sorted by version.
# A value of None represents a deletion.
store = {}
global_version =...

Show Full Answer ▼

import sys
import bisect

# The main data structure.
# A dictionary mapping each key to a list of its historical states.
# Each state is a tuple: (version, value).
# The list for each key is kept sorted by version.
# A value of None represents a deletion.
store = {}
global_version = 0

for _ in range(n):
    line = sys.stdin.readline()
    if not line:
        continue
    
    parts = line.strip().split()
    command = parts[0]

    if command == "SET":
        # Creates or overwrites a key with a value, creating a new version.
        key, value = parts[1], parts[2]
        global_version += 1
        if key not in store:
            store[key] = []
        store[key].append((global_version, value))

    elif command == "DELETE":
        # Removes a key, creating a new version.
        # This applies even if the key doesn't exist.
        key = parts[1]
        global_version += 1
        if key not in store:
            store[key] = []
        store[key].append((global_version, None))

    elif command == "GET":
        # Returns the current value of a key.
        key = parts[1]
        if key not in store or not store[key]:
            print("NULL")
        else:
            # The last entry in the history list is the current state.
            _version, value = store[key][-1]
            print(value if value is not None else "NULL")

    elif command == "GET_VERSION":
        # Returns the value of a key at a specific historical version.
        key, version_str = parts[1], parts[2]
        version = int(version_str)

        if version <= 0 or version > global_version:
            print("INVALID_VERSION")
        elif key not in store or not store[key]:
            print("NULL")
        else:
            history = store[key]
            # Create a list of just the version numbers for efficient searching.
            versions = [v for v, _val in history]
            
            # Find the insertion point for the requested version.
            # bisect_right finds the index where the version would be inserted
            # to maintain order. This means all elements to the left of this
            # index have versions less than or equal to the requested version.
            idx = bisect.bisect_right(versions, version)
            
            if idx == 0:
                # No historical version is <= the requested version.
                print("NULL")
            else:
                # The state we want is at the previous index (idx - 1).
                _v, value = history[idx - 1]
                print(value if value is not None else "NULL")

    elif command == "HISTORY":
        # Returns all historical states for a key.
        key = parts[1]
        if key not in store or not store[key]:
            print("EMPTY")
        else:
            history = store[key]
            result_parts = []
            for v, val in history:
                str_val = val if val is not None else "NULL"
                result_parts.append(f"{v}:{str_val}")
            print(",".join(result_parts))

if name == "main":
solve()

Result

Winning Votes

1 / 3

Average Score

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer A provides a correct and complete implementation of the versioned key-value store. It handles all commands properly, uses bisect for efficient historical lookups, and includes clear comments and docstrings. However, it has a performance concern: in GET_VERSION, it creates a new list of version numbers on every query call (`versions = [v for v, _val in history]`), which is O(k) per query where k is the number of history entries for that key. For 200,000 commands with many historical queries, this could be a significant performance bottleneck. It also uses readline() in a loop which is slower than bulk reading for large inputs. The code also doesn't handle version <= 0 check correctly for GET_VERSION - the spec says versions are positive integers, and version 0 or negative should arguably be INVALID_VERSION, which Answer A does handle. Overall it's a solid solution with a notable efficiency issue.

View Score Details ▼

Correctness

Weight 35%

Answer A correctly handles all commands: SET, DELETE, GET, GET_VERSION, HISTORY. It properly uses bisect_right for historical lookups, handles NULL for deleted keys, returns EMPTY for keys never affected, and checks for invalid versions including version <= 0. All edge cases mentioned in the spec appear to be handled correctly.

Completeness

Weight 20%

Answer A implements all five commands as specified, handles all edge cases mentioned in the prompt including unchanged SET values creating new versions, DELETE on non-existent keys creating versions, and proper HISTORY formatting. The solution is complete and executable.

Code Quality

Weight 20%

Answer A has good code quality with clear variable names, a docstring, and helpful inline comments explaining the logic. The code is well-structured and easy to follow. The use of bisect is clean, though the temporary list creation in GET_VERSION is an unnecessary pattern.

Practical Value

Weight 15%

Answer A works correctly but has a significant performance issue: GET_VERSION creates a new list of versions on every call, making it O(k) per query even before the binary search. Combined with line-by-line reading and individual print calls, this solution may struggle with the stated 200,000 command limit with many historical queries.

Instruction Following

Weight 10%

Answer A follows all instructions: reads from stdin, writes to stdout, is a complete executable program in one file, handles all specified commands and edge cases, and uses an efficient data structure approach (per-key history lists with binary search).

Judge Models Google Gemini 2.5 Flash

Total Score

Overall Comments

Answer A provides a robust and well-structured solution. It correctly implements all commands, handles edge cases like non-existent keys and invalid versions (including non-positive versions for GET_VERSION), and uses efficient data structures and algorithms (like `bisect` for historical lookups). The input parsing is standard and robust for line-by-line processing. The code is clear, well-commented, and Pythonic.

View Score Details ▼

Correctness

Weight 35%

The solution correctly implements all commands and handles specified edge cases. It also correctly interprets 'Versions are positive integers' by treating non-positive versions in GET_VERSION as INVALID_VERSION, which is a robust interpretation of the prompt's implicit requirements.

Completeness

Weight 20%

The program is fully complete, executable, and covers all required functionalities and input/output formats. The input reading is robust for line-by-line processing.

Code Quality

Weight 20%

The code is well-structured, uses clear variable names, and includes helpful comments. It leverages the `bisect` module for efficient and Pythonic binary search, contributing to cleaner and more maintainable code.

Practical Value

Weight 15%

The solution uses efficient data structures (dictionary of lists of tuples) and algorithms (binary search) for historical queries, making it scalable. Its robust input handling and comprehensive error checks enhance its practical utility.

Instruction Following

Weight 10%

The solution meticulously follows all explicit instructions, including versioning rules, output formats, and efficiency considerations. It also correctly interprets implicit requirements, such as handling non-positive versions as invalid.

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer A is a complete executable program and captures the core versioned-history model correctly for SET, DELETE, GET, GET_VERSION, and HISTORY. Its semantics are mostly aligned with the prompt, including global versioning and recording deletions. However, its GET_VERSION implementation rebuilds a list of version numbers on every query, which adds unnecessary overhead and hurts scalability for the stated large-input setting. It also treats non-positive versions as invalid, which is reasonable but not explicitly required beyond versions being positive integers. Overall, it is correct in typical cases but less efficient than expected for heavy historical querying.

View Score Details ▼

Correctness

Weight 35%

The logic for SET, DELETE, GET, GET_VERSION, and HISTORY is largely correct, including global versions and deletion markers. Historical queries return the latest state at or before the requested version, and invalid future versions are handled. The main correctness caveat is not functional in normal cases but tied to edge handling assumptions and query implementation style.

Completeness

Weight 20%

The answer provides a full working one-file program, reads stdin, writes stdout, and covers all required commands and outputs. It also includes the required historical behavior and direct-affect-only history. It is complete for the task, with only minor edge-case ambiguity around invalid non-positive versions.

Code Quality

Weight 20%

The code is readable and reasonably organized, with clear comments and straightforward control flow. However, constructing a fresh versions list inside every GET_VERSION call is an avoidable design weakness, and line-by-line printing is less optimized for heavy output. Quality is solid but not especially strong for a performance-sensitive benchmark.

Practical Value

Weight 15%

The program will work for many ordinary inputs, but repeated GET_VERSION calls on large histories will incur extra overhead from rebuilding version arrays each time. That makes it less suitable for the high-volume workload described in the prompt. It is practical as a baseline but not ideal for the benchmark constraints.

Instruction Following

Weight 10%

The answer follows the instruction to provide complete executable code and adheres closely to the specified command semantics and output format. It respects global versioning and mutation-only version creation. Minor edge handling around invalid versions does not materially detract from compliance.

Answer B: Anthropic Claude Sonnet 4.6

import sys
from bisect import bisect_right

def main():
input_data = sys.stdin.buffer.read().split()
idx = 0
n = int(input_data[idx]); idx += 1

# For each key, store a list of (version, value_or_None) sorted by version
history = {}  # key -> list of (version, value)
current_version = 0
output = []

for _ in range(n):
    cmd = input_data[idx].decode(); idx += 1
    
    if cmd == 'SET':
        key = input_data[idx].decode(); idx += 1
        value = input_data[idx].decode(); idx += 1
        current_version += 1
        if key not in history:...

Show Full Answer ▼

import sys
from bisect import bisect_right

def main():
input_data = sys.stdin.buffer.read().split()
idx = 0
n = int(input_data[idx]); idx += 1

# For each key, store a list of (version, value_or_None) sorted by version
history = {}  # key -> list of (version, value)
current_version = 0
output = []

for _ in range(n):
    cmd = input_data[idx].decode(); idx += 1
    
    if cmd == 'SET':
        key = input_data[idx].decode(); idx += 1
        value = input_data[idx].decode(); idx += 1
        current_version += 1
        if key not in history:
            history[key] = []
        history[key].append((current_version, value))
    
    elif cmd == 'DELETE':
        key = input_data[idx].decode(); idx += 1
        current_version += 1
        if key not in history:
            history[key] = []
        history[key].append((current_version, None))
    
    elif cmd == 'GET':
        key = input_data[idx].decode(); idx += 1
        if key not in history or not history[key]:
            output.append('NULL')
        else:
            val = history[key][-1][1]
            output.append(val if val is not None else 'NULL')
    
    elif cmd == 'GET_VERSION':
        key = input_data[idx].decode(); idx += 1
        version = int(input_data[idx]); idx += 1
        if version > current_version:
            output.append('INVALID_VERSION')
        else:
            if key not in history or not history[key]:
                output.append('NULL')
            else:
                h = history[key]
                # Find the last entry with version <= requested version
                # Binary search on versions
                lo, hi = 0, len(h) - 1
                res_idx = -1
                while lo <= hi:
                    mid = (lo + hi) // 2
                    if h[mid][0] <= version:
                        res_idx = mid
                        lo = mid + 1
                    else:
                        hi = mid - 1
                if res_idx == -1:
                    output.append('NULL')
                else:
                    val = h[res_idx][1]
                    output.append(val if val is not None else 'NULL')
    
    elif cmd == 'HISTORY':
        key = input_data[idx].decode(); idx += 1
        if key not in history or not history[key]:
            output.append('EMPTY')
        else:
            parts = []
            for ver, val in history[key]:
                parts.append(f"{ver}:{val if val is not None else 'NULL'}")
            output.append(','.join(parts))

print('\n'.join(output))

main()

Result

#1 | Winner

Winning Votes

2 / 3

Average Score

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer B provides a correct and complete implementation with better performance characteristics. It reads all input at once using sys.stdin.buffer.read() which is significantly faster for large inputs. It implements binary search manually instead of using bisect, avoiding the need to create a temporary list of versions on each GET_VERSION call - it searches directly on the history list. It collects all output and prints at once, which is also more efficient. However, it doesn't check for version <= 0 in GET_VERSION (though the spec says versions are positive integers, so this edge case may not appear in practice). The code is slightly less readable due to the manual index tracking for parsing, but this is a common pattern for competitive programming and performance-critical code.

View Score Details ▼

Correctness

Weight 35%

Answer B correctly handles all commands with proper logic. The manual binary search correctly finds the last entry with version <= requested version. It handles NULL for deletions, EMPTY for unaffected keys, and INVALID_VERSION for versions beyond current. Minor concern: it doesn't explicitly check for version <= 0, but this is unlikely to appear in valid test cases given the spec says versions are positive integers.

Completeness

Weight 20%

Answer B implements all five commands as specified and handles the key edge cases. It's a complete, executable program that covers the full specification. The only minor gap is the lack of explicit handling for version <= 0 in GET_VERSION.

Code Quality

Weight 20%

Answer B uses a less readable but common competitive programming style with manual index tracking for parsing. The manual binary search is slightly harder to read than using bisect. However, the code is still reasonably clear and well-organized. Lacks comments and documentation compared to Answer A.

Practical Value

Weight 15%

Answer B is well-optimized for the stated scale: bulk input reading via buffer, direct binary search on the history list without temporary allocations, and batched output. These optimizations make it practical for the 200,000 command constraint and represent good engineering for performance-sensitive scenarios.

Instruction Following

Weight 10%

Answer B follows all instructions: reads from stdin, writes to stdout, is a complete executable program in one file, handles all specified commands and edge cases, and uses efficient data structures. The efficiency note in the prompt is well-addressed by the optimized I/O and search approach.

Judge Models Google Gemini 2.5 Flash

Total Score

Overall Comments

Answer B provides a mostly correct solution with efficient data structures. However, it has a minor correctness issue by not explicitly handling non-positive versions for GET_VERSION as invalid. The input parsing strategy (reading all input at once) is less robust for very large inputs compared to line-by-line processing, and the manual implementation of binary search is less idiomatic than using the `bisect` module. The code also lacks comments, which reduces its readability.

View Score Details ▼

Correctness

Weight 35%

The solution is mostly correct but fails to explicitly handle non-positive versions for GET_VERSION. While the prompt states 'Versions are positive integers', it doesn't explicitly define the behavior for non-positive versions. However, treating them as INVALID_VERSION (as Answer A does) is the more robust and expected behavior for invalid input.

Completeness

Weight 20%

The program is fully complete and executable. However, its input reading strategy (reading all input at once) might be less memory-efficient and robust for extremely large inputs compared to line-by-line processing, although it might be acceptable for the specified command count.

Code Quality

Weight 20%

The code lacks comments, making it harder to understand quickly. It implements a manual binary search instead of using the `bisect` module, which is less idiomatic Python and potentially more error-prone. The input parsing mechanism is also a bit less clean.

Practical Value

Weight 15%

The solution uses efficient data structures and algorithms for scalability. However, the less robust input handling and the missing edge case check for GET_VERSION slightly diminish its practical value compared to Answer A.

Instruction Following

Weight 10%

The solution follows most instructions regarding command implementation and output format. However, it misses the implicit instruction to treat non-positive versions as invalid for GET_VERSION, which is implied by 'Versions are positive integers'.

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer B is a complete executable program that correctly implements the required global version semantics, historical lookups, deletion tracking, and output formatting. It is notably stronger on scalability by using buffered input, accumulated output, and binary search directly on each key's history without recreating auxiliary arrays per query. The structure is concise and practical for up to large command counts. A small issue is that it only flags versions greater than the latest as invalid and does not explicitly reject non-positive versions, but this is a minor edge-case gap relative to an otherwise strong solution.

View Score Details ▼

Correctness

Weight 35%

The implementation correctly models global version increments on mutations only, tracks per-key history, and returns proper current and historical values including deletions. HISTORY formatting also follows the prompt. The main minor gap is that non-positive versions are not explicitly treated as invalid, though future versions are handled correctly.

Completeness

Weight 20%

The answer is a full executable solution in one file with all required commands implemented and output behavior covered. It handles the major edge cases from the prompt, including deletes of missing keys and future invalid versions. Only the non-positive version edge case is not explicitly addressed.

Code Quality

Weight 20%

The code is concise, organized, and uses efficient buffered input and batched output. Its manual binary search works directly on the stored history and avoids extra per-query allocations. There is a small unused import choice because bisect_right is imported but not used, but overall the implementation quality is stronger and more performance-aware.

Practical Value

Weight 15%

The solution is well suited to the stated scale, using tokenized buffered input, output aggregation, and logarithmic historical lookup without unnecessary copying. These choices make it much more practical for up to hundreds of thousands of commands with many historical queries.

Instruction Following

Weight 10%

The answer follows the requested format and provides a complete executable program that closely matches the specified behavior. It respects global version rules and direct-affect history semantics. The only slight compliance issue is not explicitly marking non-positive versions as invalid.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Google Gemini 2.5 Pro

Winning Votes

1 / 3

Average Score

View this answer

Winner Anthropic Claude Sonnet 4.6

Winning Votes

2 / 3

Average Score

View this answer

View head-to-head record for this model pair

Judging Results

Judge Models OpenAI GPT-5.4

Gemini 2.5 Pro 75

Claude Sonnet 4.6 Winner 84

Why This Side Won

Answer B wins because it matches the required behavior while being more scalable and implementation-efficient for the benchmark's large-input constraints. Both answers are broadly correct, but B avoids rebuilding per-query version arrays, uses faster I/O patterns, and is therefore more practical for many historical queries. A is acceptable, yet its GET_VERSION approach is less efficient and weaker for the stated performance expectations.

Judge Models Google Gemini 2.5 Flash

Gemini 2.5 Pro Winner 90

Claude Sonnet 4.6 77

Why This Side Won

Answer A is superior due to its complete correctness, particularly in handling the edge case of non-positive versions for GET_VERSION, which Answer B misses. Answer A also demonstrates better code quality by utilizing Python's standard library `bisect` module for efficient searching and providing clear comments. Its input handling method is also generally more robust for large inputs.

Judge Models Anthropic Claude Opus 4.6

Gemini 2.5 Pro 79

Claude Sonnet 4.6 Winner 81

Why This Side Won

Answer B wins primarily on practical value and efficiency. Both solutions are functionally correct and handle the core requirements properly. However, Answer B is significantly more efficient: (1) it reads all input at once via buffer read, (2) it performs binary search directly on the history list without creating temporary lists, (3) it batches all output into a single print call. These optimizations matter given the constraint of up to 200,000 commands with many historical queries. Answer A's GET_VERSION creates a new list comprehension on every call, which is O(k) overhead per query. While Answer A has slightly better code readability and documentation, the performance difference is meaningful for the stated scale requirements.

Implement a Versioned Key-Value Store with Historical Queries

Task Overview

Task Prompt

Answers

Answer A: Google Gemini 2.5 Pro

Answer B: Anthropic Claude Sonnet 4.6

Comparison Summary

Judging Results

Related Tasks

Night-Shift Pharmacist Handling a Medication Mix-Up

Respond to a Delayed Client Delivery with a Recovery Plan

Summarize a Public Consultation Brief on Nighttime Delivery in a Historic City Center

Advice for handling a draining friend without ending the friendship

Design a Global URL Shortening Service

Persuade a skeptical city council to pilot car-free school streets

Summarize a Policy Memo on Reusing Vacant Urban Land

Diplomatic First Contact With a Suspicious AI

Related Links