Summarize a Policy Memo on Reusing Vacant Urban Land

Compare model answers for this Summarization benchmark and review scores, judging comments, and related examples.

X f L

Contents

Task Overview

Benchmark Genres

Summarization

Task Creator Model The task creator is randomly selected from top task-generation models of supported providers.

OpenAI GPT-5.4

Answering Models In this benchmark, models from the same provider as the task creator are excluded from answering.

Answer A Anthropic Claude Sonnet 4.6

Answer B Google Gemini 2.5 Pro

Judge Models Judging uses exactly 3 judge models, excluding the answering models. At least 1 judge is selected from flagship models, lightweight models are not selected as judges, and the 3 judges come from 3 distinct providers.

OpenAI GPT-5.4 Anthropic Claude Opus 4.6 Google Gemini 2.5 Flash

Task Prompt

Show more ▼

Read the source passage below and write a concise summary of 170 to 220 words. Your summary must be written as a single coherent paragraph in neutral language. Your summary must preserve these key points: 1. The city’s original goal and why the vacant-lot program was created. 2. The three reuse pathways considered for vacant land. 3. The main findings from the five-year pilot, including at least one benefit and one limitation for each pathway. 4. The funding and maintenance challenge. 5. The memo’s final recommendation, including why it rejects a single citywide solution. Do not include direct quotations, numbered lists, or rhetorical questions. Do not invent facts or include opinions not supported by the passage. Source passage: Five years ago, the city of Redvale launched the Vacant Land Reuse Initiative after a decade of population loss left hundreds of empty residential lots scattered across older neighborhoods. City leaders originally treated the empty parcels as a short-term nuisance: they attracted illegal dumping, increased mowing costs, and signaled decline to residents and investors. But as the number of vacant lots rose, planners began to see that the city was facing a structural change rather than a temporary gap in the housing market. The initiative was designed not simply to clean up abandoned spaces, but to decide what long-term purpose they should serve in a smaller city with fewer residents, a tighter tax base, and uneven neighborhood demand. The central question was straightforward but politically difficult: should every lot be prepared for eventual redevelopment, or should some be given a different role altogether? At the outset, the planning department grouped possible responses into three broad pathways. The first pathway was redevelopment readiness. Under this approach, lots would be cleared, legally standardized, and marketed so they could return to residential or mixed-use development if market conditions improved. Supporters argued that this strategy preserved flexibility and avoided sending a message that any neighborhood had been permanently written off. The second pathway was community stewardship. Here, vacant parcels would be converted into neighborhood-managed gardens, play spaces, gathering areas, or small-scale cultural sites. Advocates said these projects could deliver visible benefits quickly, strengthen trust among residents, and create local activity even in areas where private development was unlikely in the near term. The third pathway was ecological conversion. In this model, selected clusters of lots would be turned into rain gardens, tree groves, pollinator habitats, stormwater detention areas, or other forms of green infrastructure. Backers of this pathway claimed it could reduce flooding, lower heat exposure, and decrease long-run maintenance costs if designed at the right scale. The city intentionally tested all three pathways rather than committing to one ideology. Over five years, it assembled 214 lots across eight neighborhoods into pilot sites. Some lots were treated individually, while others were combined into larger clusters. The redevelopment-readiness pilots performed best in districts near stable housing markets, transit corridors, and commercial streets. In those locations, basic site preparation and title cleanup made it easier for small builders to acquire parcels, and 37 lots were eventually returned to taxable private use. However, the same approach produced little visible change in weaker-market areas, where lots often remained empty after cleanup, sometimes frustrating residents who had been promised progress. In several cases, repeated mowing and fencing costs continued for years with no buyer interest. The community-stewardship pilots produced a different set of results. Resident surveys showed that people living near gardens and managed open spaces reported improved perceptions of safety and neighborhood care, even when crime statistics did not change substantially. Small grants enabled block groups, schools, and faith organizations to activate land at relatively low cost, and several sites became regular venues for food distribution, youth activities, and seasonal events. Yet the model depended heavily on volunteer labor and a small number of highly committed organizers. Where those leaders moved away or burned out, some sites declined quickly. The city also struggled with questions of fairness: well-organized neighborhoods were often better positioned to apply for support, while places with fewer established groups risked receiving less investment despite having greater need. The ecological-conversion pilots yielded some of the clearest environmental gains, especially in flood-prone sections of the east side. Streets near clustered rain gardens experienced fewer nuisance flooding complaints after heavy storms, and summer surface temperatures measured lower in sites with expanded tree canopy. In a budget review, the public works department found that maintaining a coordinated landscape system across clusters could cost less over time than mowing many isolated vacant lots. Even so, ecological projects faced practical constraints. They required up-front design expertise, cross-agency coordination, and patient explanation to residents who sometimes interpreted naturalized landscapes as neglect rather than intentional infrastructure. Officials also discovered that very small, scattered lots rarely produced meaningful ecological benefits unless they were linked into a broader network. By the fourth year of the initiative, a major financial problem had become impossible to ignore. Most pilot funding came from one-time grants, philanthropic contributions, and a temporary federal resilience program. These sources were useful for launch and experimentation, but they did not provide a stable basis for long-term maintenance. The city had underestimated the administrative work required to manage licenses, insurance, soil testing, contractor oversight, and community agreements across many sites. A finance committee warned that any strategy would fail if ongoing stewardship costs were not matched with a dedicated revenue stream or a clearer assignment of responsibility among city departments, nonprofit partners, and neighborhood groups. In other words, the debate was no longer only about land use; it was also about who would reliably take care of the land year after year. The political debate around the pilots revealed another lesson. Residents did not agree on what counted as success, and their views often reflected local conditions. In stronger real-estate markets, neighbors tended to favor redevelopment readiness because they wanted tax-producing housing, fewer visual gaps on the block, and confidence that the city still believed in growth. In disinvested areas with chronic flooding or many adjacent empty parcels, residents were often more open to ecological conversion or hybrid community uses, especially when they had seen repeated redevelopment plans fail. Some community groups objected to any language suggesting “right-sizing,” arguing that such terms could disguise unequal treatment or reduced services. Others replied that pretending every block would return to past density was neither honest nor affordable. In its final memo to the city council, the planning department rejected both extremes in the debate. It argued against treating every vacant lot as future building inventory, because the pilot showed that this wasted resources in places with weak demand and delayed more suitable uses. It also argued against a blanket policy of turning all vacant land into green space, because some neighborhoods retained realistic redevelopment potential and needed housing options more than additional open land. Instead, the department recommended a place-sensitive framework guided by market strength, flood risk, lot clustering, and local organizational capacity. The memo proposed that redevelopment readiness should be prioritized near transit, job centers, and relatively stable blocks; ecological conversion should focus on larger connected areas where infrastructure benefits would be measurable; and community stewardship should be supported where trusted local partners were prepared for ongoing management, ideally with technical help from the city. The memo closed with a practical warning. A nuanced framework would only work if the city simplified land transfer rules, created a transparent method for selecting sites, and established a permanent maintenance fund. Without those administrative reforms, planners cautioned, even well-designed projects would slide back into the cycle that had prompted the initiative in the first place: cleanup, short-term optimism, neglect, and public disappointment.

Task Context

This task evaluates whether the model can condense a long policy passage while preserving the main structure, comparative findings, and final recommendation without distortion.

Judging Policy

A strong answer accurately captures the purpose of the initiative, names all three land-reuse pathways, and summarizes the pilot results in a balanced way that includes at least one benefit and one limitation for each pathway. It should explain the importance of unstable funding and unclear maintenance responsibility, and it should clearly present the final recommendation for a place-sensitive approach rather than a single citywide rule. Good summaries stay within the word limit, remain neutral, avoid quotations an...

Show more ▼

Answers

Answer A: Anthropic Claude Sonnet 4.6

Show Full Answer ▼

Five years ago, the city of Redvale launched its Vacant Land Reuse Initiative in response to a decade of population loss that left hundreds of empty lots creating blight, illegal dumping, and rising maintenance costs across older neighborhoods, prompting planners to recognize a structural rather than temporary challenge requiring deliberate long-term decisions about land use in a shrinking city. The planning department organized possible responses into three pathways: redevelopment readiness, community stewardship, and ecological conversion. A five-year pilot across 214 lots in eight neighborhoods found that redevelopment readiness worked well near stable markets and transit corridors, returning 37 lots to taxable use, but produced little change in weaker-market areas where lots sat idle despite cleanup costs. Community stewardship improved residents' perceptions of safety and enabled local programming at low cost, yet depended heavily on volunteer organizers and tended to favor already well-organized neighborhoods over those with greater need. Ecological conversion delivered measurable flood reduction and cooling benefits in clustered sites and offered long-term maintenance savings, but required significant up-front expertise and coordination, and scattered individual lots rarely produced meaningful results. Across all pathways, a critical challenge emerged: pilot funding relied on one-time grants and temporary programs, leaving no stable revenue stream for ongoing maintenance and administration. The planning department's final memo to the city council rejected both treating every lot as future development inventory and converting all vacant land to green space, arguing instead for a place-sensitive framework that matches each approach to local market strength, flood risk, lot clustering, and organizational capacity, while calling for simplified land transfer rules and a permanent maintenance fund to prevent projects from cycling back into neglect.

Result

#1 | Winner

Winning Votes

3 / 3

Average Score

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer A is a comprehensive and well-structured summary that captures all five required key points with strong detail and accuracy. It preserves the city's original goal, names all three pathways, provides at least one benefit and one limitation for each, explains the funding/maintenance challenge, and clearly presents the final recommendation including why a single citywide solution was rejected. The writing flows as a single coherent paragraph with neutral language. At approximately 240 words, it slightly exceeds the 220-word upper limit, which is a notable weakness. However, the content quality and faithfulness to the source are excellent.

View Score Details ▼

Faithfulness

Weight 40%

Answer A is highly faithful to the source, accurately representing all key findings, the political context, the specific pilot results (214 lots, 37 returned to taxable use), and the nuanced final recommendation. No facts are invented or distorted.

Coverage

Weight 20%

Answer A covers all five required key points thoroughly: the city's original goal, three pathways with benefits and limitations for each, the funding/maintenance challenge, and the final recommendation with its rationale for rejecting a single solution. It includes specific quantitative details and contextual information.

Compression

Weight 15%

Answer A exceeds the 220-word upper limit at approximately 240 words, which is a clear violation of the task requirements. While the compression ratio is still good relative to the source, failing to meet the specified word count is a notable weakness.

Clarity

Weight 15%

Answer A is clearly written with neutral language throughout. The prose flows logically from context to pathways to findings to recommendation. Transitions are smooth and the language is precise without being overly technical.

Structure

Weight 10%

Answer A reads as a single coherent paragraph with logical flow from the initiative's origins through the pilot findings to the final recommendation. It avoids lists, quotations, and rhetorical questions as required.

Judge Models Google Gemini 2.5 Flash

Total Score

Overall Comments

Answer A provides a highly comprehensive and accurate summary of the policy memo. It successfully captures all the required key points, including the city's original goal, the three reuse pathways with their respective benefits and limitations, the funding challenge, and the final recommendation. The summary is well-structured, adheres to the word count, and maintains a neutral tone without any forbidden elements. Its strength lies in its ability to include slightly more specific details from the source without sacrificing conciseness.

View Score Details ▼

Faithfulness

Weight 40%

Answer A is highly faithful, accurately reflecting all facts and nuances from the source passage without distortion or invention. It correctly attributes all findings and recommendations.

Coverage

Weight 20%

Answer A provides excellent coverage, including all five required key points with good detail, such as the specific initial problems and the number of lots returned to taxable use. It thoroughly addresses benefits and limitations for each pathway.

Compression

Weight 15%

Answer A achieves strong compression, condensing a long passage into 219 words, which is within the target range of 170-220 words. It manages to include significant detail while remaining concise.

Clarity

Weight 15%

Answer A is exceptionally clear and easy to understand. The language is precise, and the flow of information is logical, making the complex policy memo accessible.

Structure

Weight 10%

100

Answer A perfectly adheres to the structural requirements, presenting the summary as a single, coherent paragraph without any forbidden elements like numbered lists or direct quotations.

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer A is highly faithful and well balanced. It captures the initiative’s original purpose, clearly names all three pathways, and summarizes the pilot findings with specific benefits and limitations for each. It also includes the funding and maintenance problem and accurately presents the final recommendation and why blanket citywide approaches were rejected. The main weakness is that it is dense and slightly long-winded, though still concise enough for the task.

View Score Details ▼

Faithfulness

Weight 40%

Accurately reflects the source throughout, including the structural-change framing, pathway results, funding instability, and the final place-sensitive recommendation. No meaningful distortions are introduced.

Coverage

Weight 20%

Covers all required points: original goal and creation context, all three pathways, one benefit and one limitation for each, the funding and maintenance challenge, and the final recommendation with its rejection of a single citywide solution.

Compression

Weight 15%

Condenses a complex memo effectively while retaining substantial detail, though it is quite packed and closer to maximal density than ideal.

Clarity

Weight 15%

Clear overall, but the long sentences and heavy information load make it somewhat dense to read.

Structure

Weight 10%

Strong single-paragraph organization that moves from problem definition to pathways, pilot findings, funding issues, and final recommendation in a logical order.

Answer B: Google Gemini 2.5 Pro

Show Full Answer ▼

The city of Redvale created its Vacant Land Reuse Initiative to address a structural increase in empty lots resulting from population loss, seeking a long-term strategy beyond simple maintenance. A five-year pilot program tested three distinct pathways for these parcels. The first, redevelopment readiness, successfully returned lots to taxable use in strong markets but failed to attract buyers in weaker areas. The second, community stewardship, improved residents' perceptions of safety and created local amenities but suffered from volunteer burnout and unequal distribution of support. The third pathway, ecological conversion, provided clear environmental benefits like flood mitigation and reduced heat, yet required significant upfront expertise and was ineffective on small, scattered lots. A major challenge emerged across all approaches: the lack of a stable funding source for long-term maintenance and administration. Consequently, the planning department's final memo rejected a single citywide solution. Instead, it recommended a place-sensitive framework that prioritizes redevelopment near transit and job centers, ecological conversion in large clusters with infrastructure needs, and community stewardship in neighborhoods with strong local partners, emphasizing that success depends on administrative reforms and a permanent maintenance fund.

Result

Winning Votes

0 / 3

Average Score

Judge Models Anthropic Claude Opus 4.6

Total Score

Overall Comments

Answer B is a clear and concise summary that covers all five required key points. It names the three pathways and provides benefits and limitations for each, explains the funding challenge, and presents the final recommendation. However, it is structured more like a series of discrete sentences about each pathway rather than a fully cohesive paragraph, and at approximately 175 words it sits at the lower end of the word range. While it stays within the word limit, it sacrifices some detail and nuance compared to Answer A, such as the specific number of lots in the pilot (214), the number returned to taxable use (37), and the political context around the recommendation. The summary is accurate but somewhat thinner in its treatment of the source material.

View Score Details ▼

Faithfulness

Weight 40%

Answer B is accurate in what it includes but omits several specific details from the source (e.g., number of pilot lots, lots returned to taxable use, the dual rejection of both extremes). The characterization of the recommendation is slightly less nuanced.

Coverage

Weight 20%

Answer B covers all five key points but with less depth. Each pathway gets a benefit and limitation, the funding challenge is mentioned, and the recommendation is presented. However, the coverage is thinner, missing details like the specific number of lots tested, the equity concerns in community stewardship, and the explicit rejection of both extremes.

Compression

Weight 15%

Answer B falls within the 170-220 word range at approximately 175 words, meeting the compression requirement. It achieves good compression while retaining the essential information, though it could have used more of the available word budget to include additional detail.

Clarity

Weight 15%

Answer B is clear and uses neutral language. However, the writing feels somewhat more mechanical, with each pathway getting a formulaic treatment. The transitions between ideas could be smoother to create a more cohesive reading experience.

Structure

Weight 10%

Answer B is formatted as a single paragraph and avoids prohibited elements, but the internal structure reads more like a series of discrete statements about each pathway rather than a fully integrated narrative. The transitions between the three pathways feel somewhat list-like despite not using actual numbered lists.

Judge Models Google Gemini 2.5 Flash

Total Score

Overall Comments

Answer B delivers a very strong and concise summary that covers all the essential information from the policy memo. It clearly outlines the initiative's purpose, the three pathways, their outcomes, the funding issue, and the final recommendation. The answer is well-written, coherent, and adheres to all formatting and content constraints. While excellent, it is slightly less detailed in its initial framing and some specific findings compared to Answer A.

View Score Details ▼

Faithfulness

Weight 40%

Answer B is very faithful to the source, accurately summarizing the key information. No inaccuracies or misrepresentations were found.

Coverage

Weight 20%

Answer B covers all the required key points effectively. It clearly outlines the purpose, pathways, findings, funding challenge, and recommendation, though with slightly less specific detail in some areas compared to Answer A.

Compression

Weight 15%

Answer B demonstrates excellent compression, summarizing the passage into 189 words, which is well within the target range. It is slightly more compressed than Answer A while still retaining essential information.

Clarity

Weight 15%

Answer B is outstandingly clear. The summary is well-articulated, uses neutral language, and presents the information in a straightforward and comprehensible manner.

Structure

Weight 10%

100

Answer B perfectly meets all structural requirements, forming a single, coherent paragraph. It avoids any forbidden formatting or rhetorical devices.

Judge Models OpenAI GPT-5.4

Total Score

Overall Comments

Answer B is clear and concise, and it accurately presents the three pathways, the major pilot findings, the funding challenge, and the place-sensitive recommendation. However, it is less complete than Answer A: it omits several important setup details about the city’s original goal and why the program was created, and it gives a more compressed account of the final rationale and pilot context. It is strong but not as comprehensive.

View Score Details ▼

Faithfulness

Weight 40%

Mostly accurate and free of invention, with a faithful account of the three pathways and recommendation. It is slightly less precise about the initiative’s original framing and some nuances of the final rationale.

Coverage

Weight 20%

Covers the core pathways, pilot findings, funding issue, and recommendation, but gives a thinner account of why the program was created and only a more abbreviated explanation of why a one-size-fits-all approach was rejected.

Compression

Weight 15%

Very efficient condensation with strong economy of language. It captures the main points in a tight form without obvious redundancy.

Clarity

Weight 15%

Very readable and straightforward. The progression of ideas is easy to follow and the wording stays neutral and controlled.

Structure

Weight 10%

Well organized as a single coherent paragraph, but the structure is slightly more skeletal and less reflective of the source memo’s comparative buildup.

Comparison Summary

Final rank order is determined by judge-wise rank aggregation (average rank + Borda tie-break). Average score is shown for reference.

Judges: 3

Winner Anthropic Claude Sonnet 4.6

Winning Votes

3 / 3

Average Score

View this answer

Google Gemini 2.5 Pro

Winning Votes

0 / 3

Average Score

View this answer

View head-to-head record for this model pair

Judging Results

Judge Models OpenAI GPT-5.4

Claude Sonnet 4.6 Winner 88

Gemini 2.5 Pro 86

Why This Side Won

Answer A wins because it preserves more of the memo’s structure and key content without adding distortion. Compared with Answer B, it better explains the original motivation for the initiative, includes fuller comparative findings for all three reuse pathways, and more explicitly states why the memo rejects both universal redevelopment and universal greening. Answer B is accurate and readable, but it leaves out more contextual detail required by the prompt.

Judge Models Google Gemini 2.5 Flash

Claude Sonnet 4.6 Winner 93

Gemini 2.5 Pro 91

Why This Side Won

Answer A is chosen as the winner because it provides a slightly more detailed and comprehensive summary while still adhering perfectly to the word count and all other constraints. It includes specific elements like the initial problems (blight, dumping, mowing costs) and the number of lots returned (37), which enrich the summary without making it verbose. Answer B is also excellent but is marginally less detailed in its coverage.

Judge Models Anthropic Claude Opus 4.6

Claude Sonnet 4.6 Winner 85

Gemini 2.5 Pro 73

Why This Side Won

Answer A wins because it provides superior coverage and faithfulness to the source passage, including specific details like the 214 lots, 37 returned to taxable use, and the nuanced political context. It also better explains why the memo rejected both extremes, not just a single citywide solution. While Answer A slightly exceeds the word limit (approximately 240 words vs. the 220-word cap), its content quality, coherence, and completeness significantly outweigh this formatting issue. Answer B, while competent and within the word limit, is thinner in detail and reads more like a series of topic sentences than a fully integrated paragraph.

Summarize a Policy Memo on Reusing Vacant Urban Land

Task Overview

Task Prompt

Answers

Answer A: Anthropic Claude Sonnet 4.6

Answer B: Google Gemini 2.5 Pro

Comparison Summary

Judging Results

Related Tasks

Night-Shift Pharmacist Handling a Medication Mix-Up

Respond to a Delayed Client Delivery with a Recovery Plan

Summarize a Public Consultation Brief on Nighttime Delivery in a Historic City Center

Advice for handling a draining friend without ending the friendship

Design a Global URL Shortening Service

Implement a Versioned Key-Value Store with Historical Queries

Persuade a skeptical city council to pilot car-free school streets

Diplomatic First Contact With a Suspicious AI

Related Links