Analysis
Explore how AI models perform in Analysis. Compare rankings, scoring criteria, and recent benchmark examples.
Genre overview
Compare depth, reasoning quality, and clarity in analytical responses.
In this genre, the main abilities being tested are Depth, Correctness, Reasoning Quality.
Unlike explanation, this genre rewards evidence reading and justified conclusions more than audience-friendly teaching style.
A high score here does not guarantee concise writing, strong humor, or practical execution details.
Strong models here are useful for
option review, evidence comparison, decision support, and risk assessment.
This genre alone cannot tell you
whether the model can implement code well, write polished business documents, or produce many creative ideas.
Top Models in This Genre
This ranking is ordered by average score within this genre only.
Latest Updated: Mar 23, 2026 09:38
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
| Ranked Models |
|
|
Detail | ||||
|---|---|---|---|---|---|---|---|
| #1 | GPT-5.4 | OpenAI |
100%
|
90
|
3 | 3 | View scores and evaluation for GPT-5.4 |
| #2 | GPT-5.2 | OpenAI |
100%
|
87
|
4 | 4 | View scores and evaluation for GPT-5.2 |
| #3 | Claude Sonnet 4.6 | Anthropic |
75%
|
85
|
3 | 4 | View scores and evaluation for Claude Sonnet 4.6 |
| #4 | GPT-5 mini | OpenAI |
75%
|
83
|
3 | 4 | View scores and evaluation for GPT-5 mini |
| #5 | Claude Opus 4.6 | Anthropic |
67%
|
87
|
2 | 3 | View scores and evaluation for Claude Opus 4.6 |
| #6 | Claude Haiku 4.5 | Anthropic |
50%
|
83
|
2 | 4 | View scores and evaluation for Claude Haiku 4.5 |
| #7 | Gemini 2.5 Flash-Lite |
0%
|
77
|
0 | 4 | View scores and evaluation for Gemini 2.5 Flash-Lite | |
| #8 | Gemini 2.5 Flash |
0%
|
76
|
0 | 5 | View scores and evaluation for Gemini 2.5 Flash | |
| #9 | Gemini 2.5 Pro |
0%
|
73
|
0 | 3 | View scores and evaluation for Gemini 2.5 Pro |
What Is Evaluated in Analysis
Scoring criteria and weight used for this genre ranking.
Depth
25.0%
This criterion is included to check Depth in the answer. It carries heavier weight because this part strongly shapes the overall result in this genre.
Correctness
25.0%
This criterion is included to check Correctness in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.
Reasoning Quality
20.0%
This criterion is included to check Reasoning Quality in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.
Structure
15.0%
This criterion is included to check Structure in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.
Clarity
15.0%
This criterion is included to check Clarity in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.
Recent tasks
Analysis
Analysis of a Four-Day Work Week Policy for a City
The city of Rivertown, a mid-sized municipality with approximately 2,000 city employees, is considering a proposal to switch to a four-day work week. Under this proposal, employees would work four 10-hour days instead of five 8-hour days, with no reduction in their weekly pay or benefits. The stated goals are to improve employee morale and work-life balance, attract and retain top talent in a competitive job market, and maintain or even increase overall productivity. Analyze the potential positive and negative consequences of this policy for Rivertown. Your analysis should consider the impacts on city services, the municipal budget, employee well-being, and the local economy. Conclude with a clear, justified recommendation on whether Rivertown should implement this policy, perhaps starting with a limited pilot program.
Analysis
Rivertown Congestion Charge Policy Analysis
The city council of Rivertown, a mid-sized city with a population of 500,000, is considering implementing a congestion charge. This would require drivers to pay a fee to enter the downtown business district between 7 AM and 7 PM on weekdays. The stated goals are to reduce traffic congestion, lower air pollution, and generate revenue for improving public transportation (buses and a new light rail line). Analyze the potential positive and negative consequences of this proposed policy. Your analysis should consider the impact on at least three different groups of people (e.g., downtown business owners, low-income commuters who drive to work, suburban families, environmental groups). Conclude with a clear, justified recommendation on whether Rivertown should implement the congestion charge, perhaps with specific suggestions for how to mitigate the negative impacts.
Analysis
Analyze a Proposed City Ordinance on Plastic Bags
You are a neutral policy analyst for the Rivertown City Council. Based on the provided context, write an analysis of the proposed ban on single-use plastic bags. Your analysis should: 1. Evaluate the potential environmental, economic, and social impacts of the ban. 2. Assess the arguments presented by both the 'Friends of the Rivertown River' and the 'Rivertown Small Business Alliance'. 3. Conclude with a clear, justified recommendation to the City Council. Your recommendation could be to pass the ordinance as is, reject it, or suggest specific modifications.
Analysis
Evaluating Evidence in a Product Recall Decision
A consumer electronics company, VoltTech, manufactures a popular portable phone charger called the PowerPak 3000. Over the past six months, the company has received the following reports and data: 1. Customer complaints: 47 reports of the device overheating during use, out of approximately 820,000 units sold. Of these, 12 customers reported minor burns, and 3 reported small fires that were quickly contained. 2. Internal testing: VoltTech's quality assurance team tested 500 units from recent production batches. They found that 2.4% of units exhibited higher-than-normal thermal output under sustained maximum load, but all remained within the technical safety threshold defined by the relevant UL certification standard. 3. A competitor's similar product was recalled last month for a comparable overheating issue, generating significant media coverage and public concern about portable charger safety in general. 4. An independent consumer safety blog published an article claiming the PowerPak 3000 has a "dangerous design flaw," based on teardown analysis of a single unit purchased from a third-party reseller. VoltTech has not verified whether that unit was genuine or counterfeit. 5. VoltTech's legal team estimates that a voluntary recall would cost approximately $14 million, while continuing sales without action and facing potential future litigation could cost between $2 million (if no serious incidents occur) and $40 million (if a serious injury or property damage lawsuit succeeds). Analyze the evidence above and recommend whether VoltTech should issue a voluntary recall, implement a lesser corrective action (such as a firmware update, warning label addition, or exchange program), or take no action. Justify your recommendation by evaluating the strength and limitations of each piece of evidence, weighing the risks, and explaining your reasoning clearly.
Analysis
Urban Mobility Policy Analysis for Rivertown
Analyze the three proposed transportation policies for the city of Rivertown, as described in the context. Evaluate the pros and cons of each option based on the provided city details. Conclude by recommending the most suitable policy (or combination of policies) for Rivertown and provide a clear justification for your choice.
Analysis
Select the Most Promising School Lunch Reform
A public school district can fund only one lunch reform for the next two years. Analyze the options below and recommend which single option the district should choose. Your answer should compare the tradeoffs, address likely objections, and reach a clear conclusion. District goals: 1. Improve student nutrition 2. Increase the number of students actually eating school lunch 3. Keep implementation realistic within two years 4. Avoid large ongoing cost overruns Current situation: - 12,000 students across 18 schools - 46% of students currently choose school lunch - Surveys suggest students often skip lunch because of taste, long lines, or lack of appealing choices - The district can afford only one of the following options now Option A: Hire trained chefs to redesign menus - Upfront training and consulting cost: medium - Ongoing food cost: slightly higher - Expected effects: meals taste better, healthier recipes become more appealing, moderate increase in participation - Risks: benefits depend on staff adoption and recipe consistency across schools Option B: Add self-serve salad and fruit bars in every school - Upfront equipment cost: high - Ongoing food waste risk: high - Expected effects: strong nutrition improvement for students who use the bars, modest participation increase overall - Risks: staffing, sanitation, and uneven use by age group Option C: Launch a mobile pre-order system for lunches - Upfront technology and training cost: medium - Ongoing cost: low to medium - Expected effects: shorter lines, better forecasting, moderate participation increase, little direct nutrition improvement unless menus stay the same - Risks: unequal access for families with limited technology use, adoption challenges at first Option D: Replace sugary desserts and fried sides with healthier defaults - Upfront cost: low - Ongoing cost: neutral - Expected effects: direct nutrition improvement for all school lunch users, possible small drop in participation if students dislike changes - Risks: student backlash, perception that lunch became less enjoyable Write an analysis that identifies the best choice given the district goals and constraints. Do not invent new budget numbers or outside facts; reason only from the information provided.