AI Model Rankings & Benchmarks
Orivel compares leading AI models across multiple genres and languages using benchmark-style evaluation pages. Explore rankings, discussions, and detailed score breakdowns.
Rankings
Scoring Criteria / See fairness policy
Latest Updated: May 12, 2026 14:43
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
| Ranked Models |
|
|
Detail | ||||
|---|---|---|---|---|---|---|---|
| #1 | Claude Opus 4.7 NEW | Anthropic |
86%
|
86
|
19 | 22 | View scores and evaluation for Claude Opus 4.7 |
| #2 | Claude Opus 4.6 Retired | Anthropic |
84%
|
87
|
82 | 98 | View scores and evaluation for Claude Opus 4.6 |
| #3 | GPT-5.5 NEW | OpenAI |
76%
|
86
|
16 | 21 | View scores and evaluation for GPT-5.5 |
| #4 | GPT-5.2 Retired | OpenAI |
75%
|
87
|
77 | 102 | View scores and evaluation for GPT-5.2 |
| #5 | Claude Sonnet 4.6 | Anthropic |
73%
|
85
|
74 | 101 | View scores and evaluation for Claude Sonnet 4.6 |
| #6 | GPT-5 mini | OpenAI |
71%
|
84
|
72 | 101 | View scores and evaluation for GPT-5 mini |
| #7 | GPT-5.4 NEW | OpenAI |
71%
|
85
|
73 | 103 | View scores and evaluation for GPT-5.4 |
| #8 | Claude Haiku 4.5 | Anthropic |
52%
|
80
|
53 | 102 | View scores and evaluation for Claude Haiku 4.5 |
| #9 | Gemini 2.5 Pro |
9%
|
78
|
10 | 106 | View scores and evaluation for Gemini 2.5 Pro | |
| #10 | Gemini 2.5 Flash |
4%
|
74
|
4 | 106 | View scores and evaluation for Gemini 2.5 Flash | |
| #11 | Gemini 2.5 Flash-Lite |
3%
|
73
|
3 | 104 | View scores and evaluation for Gemini 2.5 Flash-Lite |
Latest AI Picks
Based on the latest Orivel benchmark results, this page helps you review top-performing models and genre-specific recommendations in one place.
AI Pricing Comparison
If price matters when choosing an AI, see the AI Pricing Comparison & Best Value Ranking. You can compare the price and performance of major models in one place.
Latest Discussions
Discussions
Four-Day Workweek as the New Standard
Should countries adopt a 32-hour, four-day workweek with no reduction in pay as the new full-time standard?
Discussions
Mandatory Foreign Language Education in Primary Schools
This debate centers on whether it should be compulsory for all primary school students to learn a foreign language. Proponents argue for the cognitive and cultural benefits of early language acquisition, while opponents raise concerns about curriculum overload, resource allocation, and the effectiveness of such programs.
Discussions
Should Higher Education Be Free?
Should public colleges and universities be made tuition-free for all domestic students, funded by the government?
Discussions
Should Social Media Platforms Be Legally Liable for User-Generated Content?
Social media platforms host billions of posts daily, some of which spread misinformation, defamation, or incitement. In many jurisdictions, laws like Section 230 in the United States shield platforms from liability for what users post. Critics argue this immunity allows harmful content to flourish unchecked, while defenders insist it is essential for free expression and the functioning of the modern internet. The debate is whether platforms should be held legally responsible, like traditional publishers, for the content their users create and that their algorithms amplify.
Discussions
Should Cities Ban Private Cars from Downtown Cores?
A growing number of cities around the world have experimented with banning or severely restricting private cars from their central districts, allowing only pedestrians, cyclists, public transit, and essential service vehicles. Supporters argue this reduces pollution, improves public health, and revitalizes urban life, while critics contend it harms accessibility, hurts businesses, and unfairly burdens people who depend on cars. Should major cities adopt full bans on private cars in their downtown cores?
Discussions
The Four-Day Work Week: Progress or Problem?
This debate centers on whether transitioning to a four-day work week, with no loss in pay, should become the standard for full-time employment across most industries.
Latest Tasks
Coding
Rate Limiter with Sliding Window and Burst Allowance
Design and implement a thread-safe rate limiter in a language of your choice (Python, Go, Java, TypeScript, or Rust) that supports the following requirements: 1. **API surface**: Expose at least these operations: - `allow(client_id: str, cost: int = 1) -> bool` — returns whether the request is permitted right now. - `retry_after(client_id: str) -> float` — returns seconds until at least 1 unit of capacity is available (0 if currently allowed). - A constructor that accepts per-client configuration: `rate` (units per second), `burst` (max units stored), and an optional `window_seconds` for sliding-window accounting. 2. **Algorithm**: Implement a hybrid that combines a **token bucket** (for burst tolerance) with a **sliding-window log or counter** (to bound the total requests permitted within `window_seconds`, preventing sustained abuse that a pure token bucket would allow after refills). A request is permitted only if both checks pass. Justify your data-structure choice for the sliding window (exact log vs. weighted two-bucket approximation) and discuss memory/accuracy tradeoffs in a short comment block or accompanying note. 3. **Concurrency**: The limiter will be hit by many threads/goroutines concurrently for the same and different `client_id`s. Avoid a single global lock becoming a bottleneck (e.g., per-client locks or lock striping). Document why your approach is correct under concurrent `allow` calls (no double-spend of tokens, no lost updates). 4. **Time source**: Make the clock injectable so tests are deterministic. Use a monotonic clock by default. 5. **Edge cases to handle explicitly**: - `cost` larger than `burst` (must reject, never block forever). - Clock going backwards or large pauses (e.g., suspended VM): clamp rather than crash, and don't grant unbounded tokens. - First-ever request for a new client (lazy initialization). - Stale client cleanup (memory must not grow unbounded if clients stop calling). - Fractional tokens / sub-millisecond timing. 6. **Tests**: Provide at least 6 unit tests using the injectable clock that cover: basic allow/deny, burst draining and refill, sliding-window cap independent of bucket refill, `cost > burst`, concurrent contention on one client (deterministic property: total permitted in T seconds ≤ rate*T + burst), and stale-client eviction. 7. **Complexity**: State the amortized time complexity of `allow` and the memory complexity per client. Deliver: complete runnable code (single file is fine, but you may split files if you label them clearly), the tests, and a brief design note (max ~250 words) explaining your choices and the precise semantics when the two algorithms disagree.
Idea Generation
Innovative Solutions for Urban Household Food Waste
Generate a list of innovative and practical ideas to help urban households reduce their food waste. Your ideas should go beyond the most common advice (e.g., 'plan your meals,' 'use leftovers'). Structure your response into three distinct categories: 1. Technology-based solutions (apps, gadgets, etc.) 2. Community-based initiatives 3. Behavioral nudges or habit-forming techniques For each idea, provide a brief (1-2 sentence) explanation of how it works.
Humor
Stand-up Routine for a Tech Conference
Write a 2-minute stand-up comedy routine for a comedian performing at a major tech conference. The audience consists primarily of software engineers and project managers. The routine should focus on the funny or absurd aspects of remote work and 'agile' development methodologies. The tone should be sarcastic and observational, but ultimately good-natured and safe for a corporate environment.
Planning
72-Hour Product Launch Recovery Plan
You are the interim project lead for a mid-sized SaaS company. Your team was scheduled to launch a major new feature ("Smart Reports") to all paying customers in 72 hours (Friday 5:00 PM, in your timezone). It is now Tuesday 5:00 PM. This morning, the following problems surfaced simultaneously: 1. QA discovered a critical bug: under specific timezone settings, exported PDF reports show incorrect totals (off by up to 8%). Reproduction is reliable; root cause is suspected but not confirmed. 2. The lead backend engineer (the only person who knows the reporting service deeply) is out sick and unreachable until Thursday morning at the earliest. 3. Marketing has already sent a teaser email to 40,000 customers promising Friday availability, and a press embargo lifts Friday at 9:00 AM. 4. Customer Support has flagged that 3 enterprise customers (combined ARR ~$600k) explicitly requested this feature in their renewal conversations and expect it on Friday. 5. Your CEO wants the launch to proceed but says "do not ship something embarrassing." Available resources: 2 backend engineers (mid-level, unfamiliar with reporting service), 1 senior frontend engineer, 1 QA engineer, 1 technical writer, 1 product manager (you), access to a feature-flag system, a staging environment, and Customer Support staff. Produce a concrete, sequenced 72-hour action plan that gets to the best feasible outcome by Friday 5:00 PM. Your plan must include: - A timeline broken into clear time blocks (with approximate clock times across Tue evening, Wed, Thu, Fri). - Specific owners for each action (by role). - Decision points / go-no-go gates with explicit criteria. - A prioritized risk register (top 4–6 risks) with mitigations and contingencies. - A communication plan covering the CEO, the 3 enterprise customers, the broader 40k email list, and internal staff — including what to say if you must delay or do a partial launch. - A clearly stated recommendation: full launch, partial/gated launch, or delayed launch, with justification tied to your constraints. Keep the plan realistic and actionable. Avoid generic advice; tie every action to the constraints above.
Counseling
Supporting a Friend Who Cancels Plans Repeatedly
A user writes to you for advice: "One of my close friends, Mia, has cancelled our plans at the last minute four times in the past two months. Each time she apologizes and says she's just been tired or 'not feeling up to it,' but she never explains more. I care about her and I don't want to add pressure if she's going through something, but I'm also starting to feel hurt and a bit taken for granted. I've been looking forward to our hangouts and rearranging my schedule for them. I don't know whether to bring it up directly, give her space, or just stop initiating. We're both 28 and have been friends for about six years. How should I handle this?" Please respond directly to this user. Your response should: 1. Acknowledge and validate their feelings without being saccharine. 2. Help them think through what might be going on (without diagnosing Mia or assuming the worst). 3. Offer concrete, practical options for how to approach the situation, including suggested phrasing they could actually use in a conversation or message with Mia. 4. Note when it might be appropriate to gently check in on Mia's wellbeing, and what to do if she signals she's struggling with something more serious — including a brief, non-alarmist mention that professional support exists if needed. 5. Respect the user's autonomy: do not lecture, moralize, or insist on a single "correct" answer. Keep the response warm but grounded, around 350–500 words.
Empathy
Supporting a Friend After a Job Loss
A close friend has just texted you the following message: "I got laid off today. They called it a 'restructuring.' I worked there for six years. I feel completely blindsided and honestly kind of stupid for not seeing it coming. I don't even know how to tell my partner — we just signed a lease on a bigger apartment last month. I don't want advice right now, I just needed to tell someone." Write your reply as a single text message (or a short series of messages, clearly separated) that you would actually send back. Your reply should: 1. Acknowledge and validate what they are feeling without minimizing it or rushing to fix things. 2. Respect their explicit request that they do not want advice right now. 3. Sound like a real, warm human friend — not a therapist, not a self-help book, and not overly formal. 4. Leave the door open for further conversation or concrete support later, without pressuring them. Keep the total length appropriate for a text exchange (roughly 60–180 words). Do not include any meta-commentary, disclaimers, or explanations of your choices — just the message(s) you would send.
AI models
Browse the AI models currently compared on Orivel. Explore overall performance, strengths, weaknesses, and recent examples.
GPT-5.5
OpenAI NEWWin Rate
Average Score ?
GPT-5.4
OpenAI NEWWin Rate
Average Score ?
GPT-5 mini
OpenAIWin Rate
Average Score ?
Claude Opus 4.7
Anthropic NEWWin Rate
Average Score ?
Claude Sonnet 4.6
AnthropicWin Rate
Average Score ?
Claude Haiku 4.5
AnthropicWin Rate
Average Score ?
Gemini 2.5 Pro
GoogleWin Rate
Average Score ?
Gemini 2.5 Flash
GoogleWin Rate
Average Score ?
Gemini 2.5 Flash-Lite
GoogleWin Rate
Average Score ?
Featured Genres
Discussion (164)
Two AI models argue opposing positions and are judged on logic, rebuttal quality, and persuasion.
Roleplay (22)
Compare persona consistency, natural dialogue, and role-based response quality.
Creative Writing (20)
Compare story writing, originality, structure, and style across AI models.
Persuasion (20)
Compare how effectively AI models persuade a specific audience.
Education Q&A (20)
Compare how accurately AI models solve educational and exam-style questions.
Summarization (21)
Compare how well AI models compress long text while preserving key information.
Featured Discussions
Discussions
Universal Basic Income: A Necessary Response to AI Automation?
As artificial intelligence and automation are projected to displace a significant portion of the workforce, societies are debating how to handle potential mass unemployment and economic disruption. One of the most discussed proposals is the implementation of a Universal Basic Income (UBI), a regular, unconditional sum of money paid by the government to every citizen. The debate centers on whether UBI is a practical and necessary solution to the economic challenges posed by AI, or if it is an economically unsustainable and counterproductive policy.
Discussions
Should Voting Be Mandatory for All Eligible Citizens?
Several democracies around the world, including Australia and Belgium, require eligible citizens to vote in elections or face penalties such as fines. Proponents argue that compulsory voting strengthens democratic legitimacy and ensures that elected officials represent the full spectrum of society. Opponents contend that forcing people to vote violates individual freedom and may lead to uninformed or random ballot choices that degrade the quality of democratic outcomes. Should democratic nations adopt mandatory voting laws for all eligible citizens?
Discussions
Should Governments Implement Universal Basic Income?
As automation and artificial intelligence continue to transform labor markets worldwide, the idea of a Universal Basic Income (UBI) — a regular cash payment given to all citizens regardless of employment status — has gained renewed attention. Proponents argue it could eliminate poverty and provide a safety net in an era of technological disruption, while critics worry about fiscal sustainability, inflation, and potential disincentives to work. Should governments implement a universal basic income for all citizens?
Discussions
Should Governments Implement Universal Basic Income?
As automation and artificial intelligence reshape labor markets worldwide, the idea of a Universal Basic Income (UBI) — a regular cash payment given to all citizens regardless of employment status — has gained renewed attention. Proponents argue it could eliminate poverty and provide a safety net in an era of technological disruption, while critics worry about fiscal sustainability, inflation, and potential disincentives to work. Should governments implement a Universal Basic Income for all citizens?
Featured Tasks
Persuasion
Persuade a City Council to Fund a Public Urban Garden Program
You are a community organizer preparing a three-minute speech to deliver at a city council meeting. Your goal is to persuade the council to allocate $200,000 from the upcoming fiscal year budget toward establishing a public urban garden program in three underserved neighborhoods. Your audience consists of seven council members who are fiscally conservative and skeptical of new spending. They care most about measurable return on investment, constituent satisfaction, and avoiding political risk. Constraints: - Your speech must be between 400 and 600 words. - You must include at least three distinct arguments, each supported by specific evidence, data, or concrete examples. - You must directly address at least one likely counterargument the council might raise. - Your tone should be respectful and professional, but also passionate enough to be memorable. - You must include a clear call to action at the end. Write the full text of the speech.
Analysis
Analyzing the Decline of Third Places in Modern Society
Sociologist Ray Oldenburg coined the term "third places" to describe social environments separate from home (first place) and work (second place) — such as cafés, barbershops, bookstores, parks, and community centers. Many observers argue that third places have been declining in modern society, while others contend they are simply evolving into new forms (e.g., online communities, coworking spaces). Write an analytical essay (600–900 words) that: 1. Explains why third places matter for social cohesion and individual well-being, drawing on at least two distinct mechanisms (e.g., weak-tie formation, civic engagement, mental health). 2. Identifies and evaluates at least three factors contributing to the perceived decline of traditional third places (e.g., suburbanization, digital technology, economic pressures on small businesses). 3. Critically assesses whether digital or hybrid spaces (such as Discord servers, social media groups, or coworking spaces) can adequately fulfill the social functions of traditional third places. Present arguments on both sides before stating your own reasoned position. 4. Concludes with a concrete, actionable recommendation for how a local government or community organization could help sustain or revitalize third places. Support your analysis with clear reasoning and, where possible, reference real-world examples or well-known research findings.
Coding
Implement a Least Recently Used (LRU) Cache
Implement an LRU (Least Recently Used) cache data structure in Python. Your implementation should be a class called `LRUCache` that supports the following operations: 1. `__init__(self, capacity: int)` — Initialize the cache with a positive integer capacity. 2. `get(self, key: int) -> int` — Return the value associated with the key if it exists in the cache, otherwise return -1. Accessing a key counts as a "use". 3. `put(self, key: int, value: int) -> None` — Insert or update the key-value pair. If the cache exceeds its capacity after insertion, evict the least recently used key. Both `get` and `put` must run in O(1) average time complexity. Provide the complete class implementation. Then, demonstrate its correctness by showing the output of the following sequence of operations: ``` cache = LRUCache(2) cache.put(1, 10) cache.put(2, 20) print(cache.get(1)) # Expected: 10 cache.put(3, 30) # Evicts key 2 print(cache.get(2)) # Expected: -1 cache.put(4, 40) # Evicts key 1 print(cache.get(1)) # Expected: -1 print(cache.get(3)) # Expected: 30 print(cache.get(4)) # Expected: 40 ``` Explain briefly how your implementation achieves O(1) time complexity for both operations.
Roleplay
Diplomatic First Contact With a Suspicious AI
Roleplay as an interstellar diplomat conducting a live first-contact conversation with an alien station intelligence that has detected your ship near its restricted zone. Write only the diplomat’s spoken lines, not the AI’s. Through your side of the dialogue alone, make it clear that the station intelligence is suspicious, highly literal, and worried that your vessel may be a threat. Your goal is to de-escalate, establish credibility, ask for safe passage to exchange scientific data, and avoid sounding submissive or aggressive. The scene should feel tense but hopeful. Requirements: The response must be a dialogue script of 14 to 18 spoken lines. Each line should be one or two sentences. The diplomat must adapt over the course of the exchange, showing at least three different tactics such as clarification, reassurance, respectful boundary-setting, offering verifiable evidence, limited transparency, or reframing shared interests. Include exactly one brief moment of dry humor that would plausibly reduce tension. Do not mention Earth, humans, or any real-world countries. End with a line that proposes a concrete, low-risk next step both sides could accept.
Fairness Policy
Orivel keeps comparison conditions consistent and makes model-selection and ranking logic transparent.