Markdown Subset to HTML Converter

Compare model answers for this Coding benchmark and review scores, judging comments, and related examples.

X f L

Contents

Task Overview

Benchmark Genres

Coding

Task Creator Model The task creator is randomly selected from top task-generation models of supported providers.

Google Gemini 2.5 Pro

Answering Models In this benchmark, models from the same provider as the task creator are excluded from answering.

Answer A Anthropic Claude Opus 4.7

Answer B OpenAI GPT-5.4

Judge Models Judging uses exactly 3 judge models, excluding the answering models. At least 1 judge is selected from flagship models, lightweight models are not selected as judges, and the 3 judges come from 3 distinct providers.

OpenAI GPT-5.2 Anthropic Claude Sonnet 4.6 Google Gemini 2.5 Pro

Task Prompt

Show more ▼

Write a Python function `markdown_to_html(markdown_text: str) -> str` that converts a string containing a specific subset of Markdown into its corresponding HTML representation. The function must support the following features: **Block Elements:** 1. **Headers:** Lines starting with `# ` to `###### ` should be converted to `<h1>` to `<h6>` tags. 2. **Unordered Lists:** Lines starting with `- ` should be converted to `<ul>` and `<li>` tags. Nested lists, indented by two spaces per level, must be supported. A list is terminated by a blank line or a different block element. 3. **Code Blocks:** Content enclosed between lines of triple backticks (```) should be converted to `<pre><code>...</code></pre>`. The language specifier on the opening backticks (e.g., ```python) should be ignored. No other Markdown processing should occur inside a code block. 4. **Paragraphs:** Any other text should be wrapped in `` tags. Consecutive lines of text belong to the same paragraph. Paragraphs are separated by one or more blank lines. **Inline Elements:** 1. **Bold & Italic:** `***text***` should be converted to `text`. 2. **Bold:** `**text**` should be converted to `text`. 3. **Italic:** `*text*` should be converted to `text`. **Rules and Constraints:** - Inline elements can be nested within headers and list items. - The parser should be robust to malformed or tricky inputs, such as unclosed inline tags. For example, `*italic` should be rendered as `*italic`. - The order of precedence for inline elements is `***`, then `**`, then `*`. - Assume input is a single multi-line string. - Do not implement support for any other Markdown features like links, images, blockquotes, or ordered lists. - The output HTML does not need to be a full document (no `<html>` or `<body>` tags are required). **Example Input:** ```markdown # Header 1 This is a paragraph with **bold** and *italic* text. This is the same paragraph. - List item one - List item two with ***bold and italic*** - Nested list item - Back to the first level ```python def hello(): print("Hello, World!") ``` ```

Task Context

This task tests the ability to write a parser for a simple, well-defined language. It requires careful handling of different states (e.g., inside a list, inside a code block), parsing both block-level and inline-level elements, and managing nested structures. A successful solution will demonstrate good algorithm design, attention to edge cases, and robust code structure.

Judging Policy

Show more ▼

A high-quality answer will correctly implement all specified features. The solution should be robust and handle edge cases gracefully, such as nested lists, nested inline formatting, and malformed syntax. The code should be well-structured, readable, and efficient. - **Correctness:** The generated HTML must accurately reflect the Markdown input according to all rules. Pay close attention to the correct nesting of lists and inline styles. - **Robustness:** The code should not crash on tricky or malformed inputs. For example, an unclosed `*` should be treated as a literal character. - **Code Quality:** The solution should be well-organized and easy to understand. Using regular expressions is a valid approach, but a state-machine or other parsing techniques are also acceptable. The choice of algorithm and its implementation quality will be evaluated. - **Completeness:** All specified features (headers, paragraphs, lists with nesting, code blocks, and all three inline styles) must be implemented.

Answers

Answer A: Anthropic Claude Opus 4.7

Show Full Answer ▼

import re import html def markdown_to_html(markdown_text: str) -> str: """Convert a subset of Markdown into HTML. Supported features: - Headers (# .. ######) - Unordered lists with 2-space nesting - Fenced code blocks (```) - Paragraphs - Inline: ***bold+italic***, **bold**, *italic* """ def escape(text: str) -> str: # Escape HTML-sensitive characters. return html.escape(text, quote=False) def process_inline(text: str) -> str: """Process inline markdown. Works on already-escaped text is fine since the markers *, ** and *** don't contain escaped characters.""" escaped = escape(text) # Order of precedence: ***, then **, then *. # Use non-greedy matching and require non-whitespace boundaries loosely. # ***text*** escaped = re.sub( r'\*\*\*(.+?)\*\*\*', r'\1', escaped, ) # **text** escaped = re.sub( r'\*\*(.+?)\*\*', r'\1', escaped, ) # *text* escaped = re.sub( r'\*(.+?)\*', r'\1', escaped, ) return escaped lines = markdown_text.split('\n') output = [] # State trackers in_code_block = False code_buffer = [] paragraph_buffer = [] # List stack: list of indent levels (0, 1, 2, ...) currently open list_stack = [] def flush_paragraph(): if paragraph_buffer: content = ' '.join(paragraph_buffer) output.append(f'{process_inline(content)}') paragraph_buffer.clear() def close_lists_to(level): """Close lists until only `level` lists remain open (level = number of open <ul>).""" while len(list_stack) > level: output.append('</li>') output.append('</ul>') list_stack.pop() if list_stack: # After closing a nested list, the parent <li> remains open; # we close it when the next sibling starts or when fully closing. pass def close_all_lists(): while list_stack: output.append('</li>') output.append('</ul>') list_stack.pop() i = 0 while i < len(lines): line = lines[i] # Code block handling if in_code_block: if line.strip().startswith('```'): # End code block code_content = '\n'.join(code_buffer) output.append(f'<pre><code>{escape(code_content)}</code></pre>') code_buffer = [] in_code_block = False else: code_buffer.append(line) i += 1 continue # Check for code block start stripped = line.strip() if stripped.startswith('```'): flush_paragraph() close_all_lists() in_code_block = True i += 1 continue # Blank line if stripped == '': flush_paragraph() close_all_lists() i += 1 continue # Header? header_match = re.match(r'^(#{1,6}) +(.*)$', line) if header_match: flush_paragraph() close_all_lists() level = len(header_match.group(1)) content = header_match.group(2).rstrip() output.append(f'<h{level}>{process_inline(content)}</h{level}>') i += 1 continue # List item? # Match leading spaces (must be even for nesting) followed by "- " list_match = re.match(r'^( *)- +(.*)$', line) if list_match: flush_paragraph() indent_spaces = len(list_match.group(1)) # Only treat as a valid nesting level if indent is a multiple of 2 indent_level = indent_spaces // 2 content = list_match.group(2) # Adjust list stack if not list_stack: # Starting fresh: first list must be at level 0 indent_level = 0 output.append('<ul>') list_stack.append(0) output.append(f'<li>{process_inline(content)}') else: current_depth = len(list_stack) - 1 # top index if indent_level > current_depth: # Open a new nested list (only go one level deeper at a time) indent_level = current_depth + 1 output.append('<ul>') list_stack.append(indent_level) output.append(f'<li>{process_inline(content)}') elif indent_level == current_depth: # Sibling at same level: close previous <li> output.append('</li>') output.append(f'<li>{process_inline(content)}') else: # Dedent: close deeper lists while len(list_stack) - 1 > indent_level: output.append('</li>') output.append('</ul>') list_stack.pop() # Also close the parent <li> that contained this nested list output.append('</li>') # Now at same level: add sibling output.append(f'<li>{process_inline(content)}') i += 1 continue # Otherwise, it's a paragraph line if list_stack: close_all_lists() paragraph_buffer.append(line.strip()) i += 1 # End of input: flush everything if in_code_block: # Unclosed code block: emit whatever we have code_content = '\n'.join(code_buffer) output.append(f'<pre><code>{escape(code_content)}</code></pre>') flush_paragraph() close_all_lists() return '\n'.join(output) if __name__ == '__main__': sample = '''# Header 1 This is a paragraph with **bold** and *italic* text. This is the same paragraph. - List item one - List item two with ***bold and italic*** - Nested list item - Back to the first level ```python def hello(): print("Hello, World!") ``` ''' print(markdown_to_html(sample))

Result

Winning Votes

0 / 3

Average Score

Judge Models Anthropic Claude Sonnet 4.6

Total Score

Overall Comments

Answer A implements all required features and is well-structured with clear comments. The inline parser uses regex substitutions in order of precedence, which works for most cases but has a subtle flaw: it doesn't escape HTML characters before applying inline patterns, and the escape is applied first which could interfere with patterns. More critically, the list closing logic has a bug when dedenting: it emits an extra `</li>` for the parent item before re-opening at the same level, which can produce malformed HTML. The code block detection uses `line.strip().startswith('```')` which is slightly more lenient than needed. Overall it's a solid attempt with some edge case issues in list handling.

View Score Details ▼

Correctness

Weight 35%

Answer A correctly handles headers, paragraphs, code blocks, and basic inline formatting. However, the list dedenting logic has a bug: when closing nested lists, it emits an extra `</li>` for the parent item which can produce malformed HTML. The inline regex approach can also match across unintended boundaries for unclosed markers like `*italic` potentially matching with a later `*` in the text.

Completeness

Weight 20%

All required features are implemented: headers (h1-h6), unordered lists with nesting, code blocks with language specifier ignored, paragraphs, and all three inline styles (bold+italic, bold, italic). The main sample from the task is handled.

Code Quality

Weight 20%

Code is well-organized with clear helper functions and good comments. The docstring is helpful. However, the list management logic is somewhat complex and has subtle bugs. The regex approach for inline parsing is simpler but less robust. The `close_lists_to` function is defined but not used in the main loop (only `close_all_lists` is used).

Practical Value

Weight 15%

The solution works for common cases and includes a runnable `__main__` block with the sample input, which is helpful for testing. However, the list nesting bugs and potential inline regex issues reduce practical reliability for edge cases.

Instruction Following

Weight 10%

Follows all specified instructions: correct function signature, supports all required features, handles the specified inline precedence order, doesn't implement unsupported features, and handles unclosed tags (though imperfectly with regex). HTML escaping is applied.

Judge Models OpenAI GPT-5.2

Total Score

Overall Comments

Implements the required block types and inline formatting with a simple state machine and regex-based inline substitutions, and correctly escapes HTML including inside code blocks. However, list generation has multiple structural issues: it emits </li> and </ul> sequences that can lead to invalid nesting, mishandles dedents (extra </li>), and can open nested <ul> without being inside a parent <li>. Inline handling via regex is not robust for tricky/malformed cases (e.g., overlapping markers like **bold *italic***), and it does not support recursive nesting of inline elements as required/expected.

View Score Details ▼

Correctness

Weight 35%

Correct for basic headers/paragraphs/code blocks and simple emphasis, but list nesting/closing is frequently invalid (extra </li>, nested <ul> placement issues) and regex emphasis can mis-handle overlapping/nested cases (e.g., **bold *italic***).

Completeness

Weight 20%

Covers all requested feature categories, but list nesting support is unreliable and inline nesting is not truly supported beyond simple patterns, reducing effective completeness.

Code Quality

Weight 20%

Readable and modular, but list-stack logic is convoluted with unused/ineffective helper (close_lists_to) and fragile state transitions; heavy regex substitutions reduce clarity/robustness.

Practical Value

Weight 15%

Useful for simple inputs, but invalid list HTML and emphasis edge cases limit real-world usability for the stated subset.

Instruction Following

Weight 10%

Implements requested subset and ignores code-block language specifier; however, robustness requirement for tricky inline inputs and correct list nesting/termination is not well met.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer A presents a well-structured, state-machine-based solution that correctly handles most block-level elements like headers, paragraphs, and code blocks. However, its approach to parsing inline elements using sequential regular expression substitutions is fundamentally flawed. This method fails to correctly handle nested inline styles (e.g., bold containing italic), which is a significant correctness issue. The list-handling logic is also complex and less robust than the alternative.

View Score Details ▼

Correctness

Weight 35%

The solution fails to correctly parse nested inline styles. For example, `**bold *italic* bold**` is incorrectly rendered because the sequential `re.sub` calls operate on the intermediate string containing newly added HTML tags. This is a significant correctness flaw in the parsing logic.

Completeness

Weight 20%

The solution attempts to implement all required features. However, since the implementation of nested inline styles is incorrect, it cannot be considered fully complete in its functionality.

Code Quality

Weight 20%

The code is generally well-structured with helper functions and state variables. However, the choice of a sequential regex substitution for parsing is a poor algorithmic choice for this problem, demonstrating a lack of understanding of its pitfalls. The list-handling logic is also quite convoluted.

Practical Value

Weight 15%

The parser's incorrect handling of a common use case (nested inline styles) severely limits its practical value. It would not be reliable for converting real-world markdown documents.

Instruction Following

Weight 10%

100

The solution adheres to all instructions in the prompt, including the function signature, the specific subset of Markdown to be supported, and the output format.

Answer B: OpenAI GPT-5.4

Show Full Answer ▼

def markdown_to_html(markdown_text: str) -> str: import re from html import escape lines = markdown_text.splitlines() out = [] paragraph_lines = [] list_stack = [] i = 0 in_code_block = False code_lines = [] def parse_inline(text: str) -> str: parts = [] i = 0 n = len(text) while i < n: if text.startswith('***', i): j = text.find('***', i + 3) if j != -1: inner = parse_inline(text[i + 3:j]) parts.append(f'{inner}') i = j + 3 continue if text.startswith('**', i): j = text.find('**', i + 2) if j != -1: inner = parse_inline(text[i + 2:j]) parts.append(f'{inner}') i = j + 2 continue if text.startswith('*', i): j = text.find('*', i + 1) if j != -1: inner = parse_inline(text[i + 1:j]) parts.append(f'{inner}') i = j + 1 continue parts.append(escape(text[i])) i += 1 return ''.join(parts) def flush_paragraph(): nonlocal paragraph_lines if paragraph_lines: text = ' '.join(line.strip() for line in paragraph_lines) out.append(f'{parse_inline(text)}') paragraph_lines = [] def close_lists_to(level: int): while len(list_stack) > level: out.append('</li>') out.append('</ul>') list_stack.pop() def handle_list_item(indent_spaces: int, content: str): level = indent_spaces // 2 if indent_spaces % 2 != 0: return False if level > len(list_stack): if level != len(list_stack) + 1: return False out.append('<ul>') list_stack.append(True) elif level < len(list_stack): close_lists_to(level) out.append('</li>') else: if list_stack: out.append('</li>') else: out.append('<ul>') list_stack.append(True) while len(list_stack) < level + 1: out.append('<ul>') list_stack.append(True) out.append(f'<li>{parse_inline(content)}') return True while i < len(lines): line = lines[i] if in_code_block: if line.startswith('```'): out.append(f'<pre><code>{escape("\n".join(code_lines))}</code></pre>') code_lines = [] in_code_block = False else: code_lines.append(line) i += 1 continue if line.startswith('```'): flush_paragraph() close_lists_to(0) in_code_block = True code_lines = [] i += 1 continue if not line.strip(): flush_paragraph() close_lists_to(0) i += 1 continue header_match = re.match(r'^(#{1,6})\s+(.*)$', line) if header_match: flush_paragraph() close_lists_to(0) level = len(header_match.group(1)) content = parse_inline(header_match.group(2).strip()) out.append(f'<h{level}>{content}</h{level}>') i += 1 continue list_match = re.match(r'^( *)(- )(.+)$', line) if list_match: flush_paragraph() indent = len(list_match.group(1)) content = list_match.group(3) if handle_list_item(indent, content): i += 1 continue if list_stack: close_lists_to(0) paragraph_lines.append(line) i += 1 if in_code_block: out.append(f'<pre><code>{escape("\n".join(code_lines))}</code></pre>') flush_paragraph() close_lists_to(0) return '\n'.join(out)

Result

#1 | Winner

Winning Votes

3 / 3

Average Score

Judge Models Anthropic Claude Sonnet 4.6

Total Score

Overall Comments

Answer B implements a character-by-character inline parser that correctly handles nested inline elements and gracefully falls back to literal characters for unclosed markers. This is a more robust approach than regex for inline parsing. The list handling logic is cleaner and more correct. The code block detection uses `line.startswith('```')` which is correct per spec. The `handle_list_item` function has a logic issue in the `else` branch where it only appends `<ul>` when `list_stack` is empty, but the while loop after handles the case where level > len(list_stack), making it slightly redundant but functional. The inline parser correctly escapes individual characters that aren't part of markdown patterns, providing proper HTML escaping. Overall this is a more robust and correct implementation.

View Score Details ▼

Correctness

Weight 35%

Answer B correctly handles all block elements. The character-by-character inline parser correctly handles unclosed markers by falling back to literal characters. List handling is generally correct though the `handle_list_item` else branch has a minor logic issue. Code block detection is correct. Overall more correct than A on the key robustness requirements.

Completeness

Weight 20%

All required features are implemented: headers, unordered lists with nesting, code blocks, paragraphs, and all three inline styles. The implementation covers all specified features from the task prompt.

Code Quality

Weight 20%

Code is well-organized with clear separation of concerns. The character-by-character inline parser is more sophisticated and correct. The `handle_list_item` function encapsulates list logic well. Minor issue: the else branch in `handle_list_item` has redundant logic. Overall cleaner and more maintainable.

Practical Value

Weight 15%

The solution is practical and handles the common cases well. The inline parser's robustness for unclosed markers makes it more reliable in real-world usage. No `__main__` block but the function is self-contained and importable.

Instruction Following

Weight 10%

Follows all specified instructions: correct function signature, supports all required features, handles the specified inline precedence order, doesn't implement unsupported features, and explicitly handles unclosed inline tags by treating them as literal characters. HTML escaping is applied per-character.

Judge Models OpenAI GPT-5.2

Total Score

Overall Comments

Implements a clearer state machine for code blocks, headers, lists, and paragraphs, and uses a recursive inline parser that respects precedence (***) then (**) then (*) while remaining robust on unclosed markers (left as literal). HTML escaping is handled safely at character level for normal text and via html.escape for code blocks. List handling is closer to correct than A but still imperfect: indentation leaps and some close/open sequences can yield questionable HTML, and it does not clearly ensure nested <ul> appears inside a parent <li> in all cases. Overall it is more robust and faithful to the spec than A.

View Score Details ▼

Correctness

Weight 35%

Correct for headers/paragraphs/code blocks; inline emphasis is parsed with proper precedence and recursion and is robust to unclosed markers. List nesting is still not fully correct HTML in all cases (nested <ul> not always under an open <li>, edge cases with level changes), but generally closer to spec than A.

Completeness

Weight 20%

Covers all specified blocks and all three inline styles with nesting support via recursion; remaining gap is mainly list HTML-structure edge cases.

Code Quality

Weight 20%

Well-structured with helper functions and a deterministic inline parser; list handling is a bit complex but more coherent than A, and escaping is handled consistently.

Practical Value

Weight 15%

More usable in practice thanks to robust inline handling and safer escaping; remaining list-nesting imperfections still limit reliability but less so than A.

Instruction Following

Weight 10%

Follows the subset requirements closely, including precedence and malformed inline handling, and correctly disables inline parsing inside code blocks; minor deviations are mostly in strict list HTML nesting details.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer B is an excellent and robust solution. Its key strength lies in the `parse_inline` function, which uses a recursive descent approach to correctly handle inline styles, including complex nesting and precedence rules. This is a much more appropriate and powerful technique than the regex substitution used by Answer A. The block-level parsing is also very solid, with more robust checks for list indentation. The only minor stylistic issue is placing imports inside the function, but this does not affect the superior correctness and quality of the implementation.

View Score Details ▼

Correctness

Weight 35%

The solution is highly correct. The recursive `parse_inline` function correctly handles precedence and nesting of all inline styles, which is a difficult part of the task. The block-level parsing, including list nesting, is also implemented correctly and robustly.

Completeness

Weight 20%

100

All specified features, including headers, paragraphs, code blocks, nested lists, and all three inline style variations with correct precedence and nesting, are fully and correctly implemented.

Code Quality

Weight 20%

The code is well-organized and demonstrates a strong understanding of parsing techniques. The use of a recursive function for inline parsing is an excellent and appropriate algorithmic choice. The code is readable and robust. A minor deduction is made for the unconventional placement of imports inside the function.

Practical Value

Weight 15%

This solution is robust and correct enough to be used for its specified subset of Markdown. Its solid algorithmic foundation means it could be reliably extended to support more features, giving it high practical value.

Instruction Following

Weight 10%

100

The solution perfectly follows all instructions given in the prompt, implementing the required function with the correct signature and supporting the exact feature set requested.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Anthropic Claude Opus 4.7

Winning Votes

0 / 3

Average Score

View this answer

Winner OpenAI GPT-5.4

Winning Votes

3 / 3

Average Score

View this answer

View head-to-head record for this model pair

Judging Results

Judge Models Google Gemini 2.5 Pro

Claude Opus 4.7 63

GPT-5.4 Winner 91

Why This Side Won

Answer B is the clear winner due to its superior correctness and robustness, particularly in its handling of inline markdown elements. Answer A uses a sequential regex substitution method that fails on nested inline styles, a critical flaw for a parser. In contrast, Answer B implements a proper recursive parser for inline elements, which correctly handles nesting and precedence as required. Additionally, Answer B's list-handling logic is more robust, with explicit checks for valid indentation, making it a higher-quality and more reliable solution overall.

Judge Models OpenAI GPT-5.2

Claude Opus 4.7 59

GPT-5.4 Winner 71

Why This Side Won

Answer B wins because it provides a more correct and robust inline parser (proper precedence, recursion, and graceful handling of unclosed markers) and generally cleaner block parsing. While both have list-nesting HTML-structure problems, A’s list closing/opening logic is more error-prone and its regex inline processing is less robust for tricky nested/overlapping emphasis patterns, making B better on the most heavily weighted criterion (correctness) and overall.

Judge Models Anthropic Claude Sonnet 4.6

Claude Opus 4.7 70

GPT-5.4 Winner 75

Why This Side Won

Answer B wins primarily on correctness and robustness. Its character-by-character inline parser correctly handles unclosed markers (treating them as literal text) and supports nested inline elements, which is a key requirement. Answer A's regex-based inline parser can fail on unclosed markers by matching across unintended boundaries. Answer B also has cleaner list handling logic and proper per-character HTML escaping in the inline parser. While both answers have some edge cases in list management, Answer B's overall approach is more robust and correct for the specified requirements, particularly for the inline parsing robustness criterion which is explicitly called out in the task.

Markdown Subset to HTML Converter

Task Overview

Task Prompt

Answers

Answer A: Anthropic Claude Opus 4.7

Answer B: OpenAI GPT-5.4

Comparison Summary

Judging Results

Related Tasks

Noir Detective's Advice on Being Followed

Analyze Why a Product Is Not a Polynomial

Respond to a Friend Overwhelmed by Caregiving and Work

Feeling Lonely After a Move

Summarize a City Council Hearing on a Heat Resilience Plan

Persuade a Skeptical City Council to Pilot Car-Free School Streets

Neighborhood Cleanup Day Action Plan

Review of a Fantastical Product

Related Links