Internal Memo Proposing a Four-Day Pilot

Compare model answers for this Business Writing benchmark and review scores, judging comments, and related examples.

X f L

Contents

Task Overview

Benchmark Genres

Business Writing

Task Creator Model The task creator is randomly selected from top task-generation models of supported providers.

OpenAI GPT-5.4

Answering Models In this benchmark, models from the same provider as the task creator are excluded from answering.

Answer A Google Gemini 2.5 Flash-Lite

Answer B Anthropic Claude Sonnet 4.6

Judge Models Judging uses exactly 3 judge models, excluding the answering models. At least 1 judge is selected from flagship models, lightweight models are not selected as judges, and the 3 judges come from 3 distinct providers.

OpenAI GPT-5.4 Anthropic Claude Opus 4.6 Google Gemini 2.5 Pro

Task Prompt

Show more ▼

You are a team lead at a mid-sized software company. Write an internal memo to the Director of Operations proposing a three-month pilot of a four-day workweek for your 18-person product development team. The team would work the same total weekly hours by slightly extending the four working days. Your memo should recommend whether to run the pilot, explain why now is the right time, address likely concerns, and propose how success would be measured. Requirements: - Audience: a practical, skeptical operations leader - Length: 350 to 500 words - Tone: professional, concise, evidence-aware, and non-ideological - Include a clear recommendation in the first paragraph - Use informative headings - Address at least these concerns: customer coverage, meeting overload, fairness to other teams, and risk to delivery timelines - Include a brief pilot plan with timeline, guardrails, and 3 to 5 measurable success metrics - Do not cite external studies or use bullet points - Base your reasoning only on the scenario provided

Task Context

Relevant situation: The product development team includes engineers, designers, and product managers. Over the last two quarters, the team has delivered major releases on schedule, but pulse surveys show rising burnout and difficulty protecting focus time because of fragmented calendars and too many status meetings. Customer support is handled by a separate support team, but product developers are sometimes pulled into urgent escalations. The company is preparing next quarter's planning cycle, so any pilot would ne...

Show more ▼

Judging Policy

A strong answer should read like a credible internal business memo for a skeptical decision-maker. It should make a clear recommendation early, organize the message with useful headings, and stay within the requested length and tone. Good responses will connect the proposal to the current team situation, explain why timing matters, and address operational concerns concretely rather than vaguely. They should present a realistic pilot design with sensible guardrails and measurable metrics, including how customer cove...

Show more ▼

Answers

Answer A: Google Gemini 2.5 Flash-Lite

Show Full Answer ▼

MEMORANDUM TO: Director of Operations FROM: [Your Name/Team Lead] DATE: October 26, 2023 SUBJECT: Proposal for a Four-Day Workweek Pilot Program I recommend we proceed with a three-month pilot of a four-day workweek for the 18-person Product Development team, commencing in six weeks. Rationale for Timing This proposal aligns with the upcoming quarterly planning cycle, allowing us to integrate this experiment into our strategic roadmap. Following two quarters of successful major releases, our team has demonstrated its capacity for high performance. However, recent internal feedback indicates increasing burnout and challenges with maintaining focus time due to calendar fragmentation and excessive meetings. Implementing a four-day week now offers a proactive solution to address team well-being while continuing to drive productivity, especially as other departments are closely monitoring output post-cost-control initiative. Addressing Potential Concerns Customer Coverage: Our product development team does not have direct customer-facing responsibilities for immediate support. The dedicated customer support team handles all incoming inquiries. While developers are occasionally pulled into urgent escalations, we will establish clear on-call rotations and escalation protocols within the pilot to ensure critical issues are addressed without requiring a full team presence on the fifth day. This will be managed through existing escalation channels. Meeting Overload: A core objective of this pilot is to improve focus time. We will implement stricter meeting scheduling guidelines, encouraging asynchronous communication where possible and consolidating necessary meetings to the four working days. This initiative is designed to combat, not exacerbate, meeting fragmentation. Fairness to Other Teams: This pilot is specifically designed for the Product Development team, whose work structure and output metrics are distinct. We will ensure that the extended working days maintain total weekly hours, and performance will be evaluated against our team's specific deliverables. We will share learnings transparently to inform any future considerations for other departments. Risk to Delivery Timelines: Given the team's recent track record of on-time delivery, we are confident in our ability to maintain project momentum. The pilot's structure, with slightly longer workdays, is intended to preserve total output. We will closely monitor project progress against established milestones throughout the pilot. Pilot Plan and Success Metrics Timeline: The pilot will run for three months, starting in six weeks, with a one-week ramp-up period for adjustments. A formal review will occur at the end of the three months. Guardrails: We will maintain existing communication channels and on-call procedures for urgent escalations. Regular check-ins will be scheduled to address any emerging issues promptly. The pilot can be paused or terminated if critical metrics decline significantly. Success Metrics: 1. Team Burnout: Measured by a reduction in negative sentiment and an increase in positive feedback regarding work-life balance in pulse surveys. 2. Focus Time: Quantified by an increase in uninterrupted work blocks, tracked through calendar analysis and self-reporting. 3. Delivery Performance: Assessed by maintaining or improving the on-time delivery rate of planned features and releases. 4. Productivity: Evaluated through key performance indicators relevant to development cycles, such as sprint velocity and bug resolution times. 5. Employee Retention: Monitored through voluntary attrition rates within the pilot team compared to historical data. I believe this pilot offers a valuable opportunity to enhance team well-being and potentially boost productivity without compromising our operational commitments. I am available to discuss this proposal further at your convenience.

Result

Winning Votes

0 / 3

Average Score

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer A is credible and generally aligned with the memo task. It gives a clear recommendation in the first paragraph, uses headings, addresses the required concerns, and proposes a basic pilot plan with metrics. Its main weaknesses are format compliance and specificity: it uses numbered metrics despite the no-bullets instruction, stays somewhat generic in operational details, and some measures such as retention are less useful for a three-month pilot. The memo is practical but not especially sharp for a skeptical operations audience.

View Score Details ▼

Appropriateness

Weight 25%

The memo broadly fits the scenario and covers the required topics, but it relies on generic business-writing language and includes a numbered list despite the instruction not to use bullet points. Some reasoning is only loosely tied to the provided context.

Clarity

Weight 20%

The writing is readable and direct, but several points are high level and somewhat repetitive. The success metrics section is understandable yet includes vague items like productivity KPIs without defining what matters most.

Structure

Weight 20%

The memo has standard memo formatting and useful headings. However, the pilot plan is less tightly organized, and the numbered metrics list weakens compliance with the requested prose format.

Actionability

Weight 20%

It offers a workable outline, including on-call coverage and review at the end, but the guardrails and measurement plan are not very precise. Retention is not a strong short-pilot metric, and there are no clear trigger thresholds beyond a general pause option.

Tone

Weight 15%

Professional and concise overall, though a bit formal and generic. It sounds reasonable, but not consistently calibrated to a skeptical operator because it asserts confidence more than it demonstrates balanced scrutiny.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer A is a strong and well-structured memo that successfully addresses all the core requirements of the prompt. It makes a clear recommendation upfront, uses informative headings, and provides a solid rationale for the pilot. However, it has two main weaknesses. First, it violates the negative constraint against using bullet points by presenting its success metrics in a numbered list. Second, its proposals for addressing concerns and its pilot plan are less concrete and specific than those in Answer B, making it slightly less persuasive for a skeptical audience.

View Score Details ▼

Appropriateness

Weight 25%

The memo is highly appropriate for the context, using a standard format and addressing the key business concerns. It successfully adopts the persona of a team lead.

Clarity

Weight 20%

The memo is very clear and easy to understand. The use of headings effectively breaks down the argument into logical sections.

Structure

Weight 20%

The overall structure is good, with a clear recommendation upfront and informative headings. However, it loses significant points for using a numbered list for the success metrics, which violates the spirit and likely the letter of the "Do not use bullet points" constraint.

Actionability

Weight 20%

The memo is actionable, providing a clear plan with a timeline, guardrails, and metrics. A director could act on this information. However, the guardrails and metrics are slightly more generic than in the other answer.

Tone

Weight 15%

The tone is professional, concise, and evidence-aware as requested. It presents a balanced and non-ideological case for the pilot program.

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer A is a competent memo that covers all required elements: clear recommendation up front, headings, addresses the four concerns, includes a pilot plan with timeline, guardrails, and five metrics. However, it has several weaknesses. First, it significantly exceeds the 500-word limit (approximately 530+ words), which is a direct violation of the requirements. Second, it uses numbered lists for the success metrics section, which arguably violates the 'no bullet points' requirement. Third, the metrics are somewhat vague — for example, 'key performance indicators relevant to development cycles' is not concrete. Fourth, the addressing of concerns, while thorough, tends toward generic reassurance rather than specific operational solutions. The tone is professional but occasionally reads as slightly promotional rather than balanced. The fairness concern is addressed somewhat weakly — it doesn't fully acknowledge the optics issue or explain how it would be communicated to other teams in a way that prevents resentment.

View Score Details ▼

Appropriateness

Weight 25%

Answer A addresses all required elements but exceeds the word limit and uses numbered lists that border on bullet points, violating requirements. The recommendation is clear but the overall framing is somewhat generic. The connection to the scenario is adequate but not as sharp as it could be.

Clarity

Weight 20%

Answer A is generally clear but some sections are wordy and could be more concise. Phrases like 'key performance indicators relevant to development cycles' are vague. The metrics section, while comprehensive, lacks the specificity needed for a skeptical reader to evaluate.

Structure

Weight 20%

Answer A uses informative headings and organizes content logically. However, the success metrics section uses numbered items that resemble bullet points, which conflicts with the requirements. The pilot plan section could be better integrated rather than listing components separately.

Actionability

Weight 20%

Answer A provides a pilot plan but the guardrails are somewhat generic ('regular check-ins,' 'can be paused if critical metrics decline'). The concern responses tend toward reassurance rather than specific operational solutions. A decision-maker would need to ask follow-up questions about implementation details.

Tone

Weight 15%

Answer A maintains a professional tone but occasionally reads as slightly promotional ('I believe this pilot offers a valuable opportunity'). The treatment of concerns sometimes feels like it's minimizing rather than genuinely engaging with them. The tone doesn't fully match what a skeptical operations leader would expect.

Answer B: Anthropic Claude Sonnet 4.6

Show Full Answer ▼

MEMORANDUM TO: Director of Operations FROM: [Team Lead, Product Development] DATE: [Current Date] RE: Proposal — Four-Day Workweek Pilot, Product Development Team RECOMMENDATION I recommend approving a three-month pilot of a four-day workweek for the 18-person product development team, beginning six weeks from now to align with the upcoming planning cycle. Total weekly hours would remain unchanged; each working day would extend by approximately one hour. Given the team's strong delivery record and measurable burnout signals, this is a low-risk, high-information experiment that can be evaluated on objective criteria before any permanent decision is made. WHY NOW IS THE RIGHT TIME The team has delivered two consecutive major releases on schedule, establishing a credible performance baseline against which pilot outcomes can be measured. Pulse surveys from the last two quarters consistently flag burnout and calendar fragmentation as the primary obstacles to sustained output. Entering the next planning cycle without addressing these signals risks degrading the delivery reliability we have built. Starting the pilot now also means results will be available before annual headcount and structure decisions are made, giving leadership data rather than anecdote. ADDRESSING LIKELY CONCERNS Customer coverage. Product developers are occasionally pulled into urgent escalations, but primary customer support sits with a separate team. During the pilot, the team will maintain a rotating on-call engineer each Friday to handle any escalations, ensuring no gap in coverage. Meeting overload. The pilot will include a structured calendar audit in week one, consolidating recurring status meetings and establishing protected focus blocks on the four working days. This directly targets the fragmentation problem and should reduce, not increase, meeting pressure. Fairness to other teams. The pilot is explicitly time-bounded and team-specific, tied to a defined business rationale. It will be communicated to other departments as a structured experiment with published success criteria, not a permanent benefit. If results do not meet thresholds, the pilot ends without precedent. Risk to delivery timelines. The team's sprint cadence and release commitments will not change. Guardrails include a mid-pilot checkpoint at six weeks where delivery velocity, defect rates, and stakeholder satisfaction are reviewed. If any metric falls below baseline, the pilot pauses immediately. PILOT PLAN The pilot runs for twelve weeks. Weeks one and two are a setup phase covering calendar restructuring, on-call rotation setup, and baseline metric capture. Weeks three through ten are the active pilot. Weeks eleven and twelve are reserved for data collection and analysis before a go or no-go recommendation is presented. Success will be measured on five criteria: sprint velocity relative to the prior two-quarter average, defect escape rate per release, individual burnout scores from a mid-pilot and end-of-pilot pulse survey, escalation response time on Fridays compared to baseline, and on-time delivery rate for committed milestones. I am available to discuss this proposal at your convenience and can provide supporting survey data on request.

Result

#1 | Winner

Winning Votes

3 / 3

Average Score

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer B is a strong, well-targeted internal memo. It opens with a clear recommendation, uses informative headings, stays concise, and addresses all required operational concerns with more concrete guardrails than Answer A. Its pilot design is more disciplined, with timeline phases, a checkpoint, pause conditions, and measurable metrics tied to baseline performance. Minor weaknesses are slight overstatement in phrases like low-risk, high-information and a somewhat compressed explanation of how extended hours would work, but overall it reads like a more decision-ready memo.

View Score Details ▼

Appropriateness

Weight 25%

Highly appropriate for the stated audience and scenario. It stays grounded in the provided facts, frames the proposal as a time-bounded experiment, and addresses the exact concerns an operations leader would raise.

Clarity

Weight 20%

Very clear throughout. Each section states its point directly, and the proposal, risks, guardrails, and measures are easy to follow without excess wording.

Structure

Weight 20%

Excellent structure for an internal memo. The recommendation comes first, headings are informative, concerns are grouped logically, and the timeline is broken into usable phases that support decision-making.

Actionability

Weight 20%

Strongly actionable. It specifies start timing, phase design, setup steps, on-call Friday coverage, a six-week checkpoint, pause conditions, and concrete operational metrics that could realistically be tracked during the pilot.

Tone

Weight 15%

Professional, concise, and evidence-aware. It sounds measured and pragmatic, with a balanced emphasis on testing and evaluation rather than ideology, though a few phrases lean slightly promotional.

Judge Models Google Gemini 2.5 Pro

Total Score

Overall Comments

Answer B is an exceptional response that perfectly captures the required tone and delivers a highly persuasive, actionable proposal. It adheres to all constraints, including skillfully weaving its success metrics into a paragraph to avoid a list format. The language is sharp, concise, and perfectly tailored to a skeptical operations leader, framing the pilot as a "low-risk, high-information experiment." The proposed plan is highly concrete, with specific details like a Friday on-call rotation and a mid-pilot checkpoint, which adds significant credibility and actionability.

View Score Details ▼

Appropriateness

Weight 25%

This answer is exceptionally appropriate. The language is perfectly calibrated for a skeptical, practical operations leader, using phrases like "low-risk, high-information experiment" and "data rather than anecdote" to build a compelling business case.

Clarity

Weight 20%

The memo is exceptionally clear. Its proposals are not just stated but are described with concrete specifics (e.g., "rotating on-call engineer each Friday," "structured calendar audit in week one"), which makes the plan tangible and unambiguous.

Structure

Weight 20%

The structure is excellent. It follows the classic memo format, uses clear headings, and makes its recommendation in the first paragraph. Crucially, it successfully integrates the list of success metrics into a paragraph, fully adhering to all constraints.

Actionability

Weight 20%

This memo is highly actionable. The pilot plan is detailed and practical, including specific phases and a very strong, reassuring guardrail: a mid-pilot checkpoint with an immediate pause trigger if metrics decline. This makes the proposal much easier to approve.

Tone

Weight 15%

The tone is outstanding. It is not just professional but also strategically persuasive for the intended audience. It's concise, confident, and data-driven, directly addressing the likely mindset of a skeptical operations leader without being defensive.

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer B is a strong, well-crafted memo that reads as highly credible for a skeptical operations leader. It makes a clear recommendation in the first paragraph with a compelling framing ('low-risk, high-information experiment'). The timing rationale is particularly strong, connecting to annual headcount decisions and the value of data over anecdote. Concerns are addressed with specific, actionable solutions — the rotating on-call engineer on Fridays, the structured calendar audit in week one, and the explicit framing of the pilot as time-bounded with published criteria. The pilot plan is well-structured with a clear phased approach (setup, active, analysis). The five metrics are concrete and measurable. The memo stays within the word limit, avoids bullet points, and maintains a professional, balanced tone throughout. The fairness concern is handled particularly well with the 'not a permanent benefit' framing. One minor weakness is that it could have acknowledged tradeoffs more explicitly, but overall it demonstrates excellent judgment and practical thinking.

View Score Details ▼

Appropriateness

Weight 25%

Answer B meets all requirements cleanly — stays within word count, avoids bullet points, includes clear recommendation first, addresses all four concerns, and provides measurable metrics. The framing as a 'low-risk, high-information experiment' is perfectly calibrated for a skeptical operations leader. Strong connection to the specific scenario details.

Clarity

Weight 20%

Answer B is consistently clear and precise. Each section communicates its point efficiently. Metrics are specific and measurable (e.g., 'sprint velocity relative to the prior two-quarter average,' 'escalation response time on Fridays compared to baseline'). The phased pilot plan is easy to understand at a glance.

Structure

Weight 20%

Answer B uses clear, informative headings and maintains excellent flow throughout. The pilot plan is presented as a coherent narrative with phases rather than disconnected elements. The structure guides the reader naturally from recommendation through rationale, concerns, and plan. No bullet points or numbered lists.

Actionability

Weight 20%

Answer B excels in actionability. It specifies a rotating on-call engineer on Fridays, a calendar audit in week one, a mid-pilot checkpoint at six weeks with a clear pause trigger, and a phased timeline with setup, active, and analysis periods. The five metrics are concrete and baseline-referenced. An operations leader could approve this with minimal follow-up.

Tone

Weight 15%

Answer B strikes an excellent balance — confident but not overselling, acknowledging risks while making a clear case. Phrases like 'data rather than anecdote' and 'not a permanent benefit' show awareness of the audience's perspective. The closing offer to provide supporting survey data adds credibility without being pushy.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Google Gemini 2.5 Flash-Lite

Winning Votes

0 / 3

Average Score

View this answer

Winner Anthropic Claude Sonnet 4.6

Winning Votes

3 / 3

Average Score

View this answer

View head-to-head record for this model pair

Judging Results

Judge Models Anthropic Claude Opus 4.6

Gemini 2.5 Flash-Lite 62

Claude Sonnet 4.6 Winner 82

Why This Side Won

Answer B wins because it is more concise, stays within the word limit, avoids bullet points more cleanly, provides more specific and actionable solutions to concerns (rotating on-call engineer, calendar audit, mid-pilot checkpoint with pause trigger), has a better-structured pilot plan with clear phases, offers more concrete metrics, and maintains a more balanced and credible tone for a skeptical audience. Its framing of the proposal as a 'low-risk, high-information experiment' and its connection of timing to annual decisions demonstrate superior strategic thinking.

Judge Models Google Gemini 2.5 Pro

Gemini 2.5 Flash-Lite 76

Claude Sonnet 4.6 Winner 94

Why This Side Won

Answer B is the clear winner because it is superior across multiple criteria. Most importantly, it adheres to all task constraints, including the negative constraint against using bullet points, which Answer A violates by using a numbered list. Beyond compliance, Answer B is more persuasive and actionable. Its tone is more finely tuned to the skeptical audience, and its proposed plan is significantly more concrete, with specific guardrails and metrics that would give a decision-maker more confidence to approve the pilot.

Judge Models OpenAI GPT-5.4

Gemini 2.5 Flash-Lite 67

Claude Sonnet 4.6 Winner 86

Why This Side Won

Answer B wins because it better matches the needs of a practical, skeptical operations leader. It is clearer and more structured, gives more concrete operating guardrails, and proposes stronger success measures tied to baseline outcomes and decision points. It also handles fairness, delivery risk, meeting overload, and customer coverage with more specificity. Answer A is solid but more generic and slightly less compliant with the no-bullets instruction.

Internal Memo Proposing a Four-Day Pilot

Task Overview

Task Prompt

Answers

Answer A: Google Gemini 2.5 Flash-Lite

Answer B: Anthropic Claude Sonnet 4.6

Comparison Summary

Judging Results

Related Tasks

Power Outage Recovery Plan for a Small Clinic

Hotel Concierge Handles a Delicate Booking Error

Client Email Explaining a Project Delay and Recovery Plan

Weekend Move Plan Under Tight Constraints

Advice for Declining a Promotion Without Hurting Your Reputation

Practical Ways to Reduce Food Waste in a Small Apartment

Helping a Friend Set Boundaries Without Damaging the Friendship

Explain Why the Sequence Converges and Find Its Limit

Related Links