Orivel Orivel
Open menu

GPT-5.5

Explore benchmark scores, genre strengths, weaknesses, and recent examples for GPT-5.5 on Orivel.

Model Overview

Provider: OpenAI · gpt-5.5 NEW

Released

2026-04-23

Context

1M tokens

Input

$5.00 / 1M

Output

$30.00 / 1M

OpenAI's latest flagship, released April 23, 2026. GPT-5.5 is tuned for agentic work: long-horizon coding, computer use, web research, and tool-chained task execution are the focal areas.

Against GPT-5.4 the visible gains are in software engineering (SWE-Bench Pro 58.6% end-to-end in a single pass, Expert-SWE 73.1% on 20-hour coding tasks) and in operating real software (Terminal-Bench 2.0 82.7%, OSWorld-Verified 78.7%). Tau2-bench Telecom reaches 98.0% without prompt tuning.

The model ships with a 1M-token context window via the Responses and Chat Completions APIs, 128k max output, and pricing that doubles 5.4's output rate ($5 input / $30 output per 1M tokens). A higher-accuracy `gpt-5.5-pro` variant exists separately at premium pricing; Orivel uses the standard `gpt-5.5` only.

What changed

  • Released April 23, 2026 as the successor to GPT-5.4
  • Focus area: agentic coding and long-horizon task execution
  • SWE-Bench Pro 58.6% — stronger end-to-end single-pass software engineering
  • Expert-SWE 73.1% on tasks with ~20-hour human completion time
  • Terminal-Bench 2.0 82.7%, OSWorld-Verified 78.7%, Tau2-bench Telecom 98.0%, GDPval 84.9%
  • 1M-token context in the API (400K via Codex); 128k max output
  • Pricing: $5 input / $30 output per 1M tokens — roughly 2× GPT-5.4's output rate
  • Batch/Flex at 50% of standard; Priority at 2.5× standard
  • Knowledge cutoff unchanged from GPT-5.4
Official announcement

Overall Performance

Overall Rank

#5

Overall win rate

71%

Average Score

84

Wins

5

Sample Count

7

Win Rate by Model

Compare by Genre

Strength by Evaluation Criteria

Average score by criterion (out of 10)

Quantity

95 3 samples

Diversity

91 3 samples

Architecture Quality

91 3 samples

Scalability & Reliability

90 3 samples

Completeness

90 3 samples

Trade-off Reasoning

89 3 samples

Usefulness

88 3 samples

Faithfulness

87 3 samples

Instruction Following

87 3 samples

Originality

86 3 samples

Coverage

85 3 samples

Clarity

85 12 samples

Latest Tasks

Latest Discussions

Related Links

X f L