When Three AIs Fixed a README - The Unanimous Verdict Nobody Expected -

Image courtesy of Google Gemini

This is the third article in my AI Code-Off experiments series. You can read the first article and second article for context.

Line 184. That’s where the Quick Start section lived in our project README.

Think about that for a moment. A new developer lands on your project, excited to try it out, and has to scroll through 183 lines of architectural philosophy, memory bank explanations, and universal design principles before learning the single most important thing: how to actually use it.

The 1,232-line README for an AI coding rules system was well-written and thorough. But it had a problem: it was organized for people who already understood the project, not for newcomers trying to get started.

I’d recently run an experiment where three AI models tackled a massive rule file consolidation. That competition revealed fascinating differences in how AIs approach complex problems. But code refactoring and documentation are different beasts. Documentation requires understanding human psychology—what frustrates users, what builds trust, what gets people productive fast.

So I gave the same three AIs—Claude 4.5 Sonnet, Gemini 2.5 Pro, and GPT-5—a different challenge: fix this README. Make it serve new users without losing depth for advanced ones. The experiment followed the same four-stage methodology: independent analysis, comparative evaluation, synthesis of best ideas, and final judging.

I expected disagreement. Documentation feels subjective—some readers want comprehensive detail, others want minimalist quick-starts. Different audiences, different preferences. Unlike code refactoring where “correct” has clearer boundaries, README improvements seemed ripe for philosophical splits on which approach was “best.”

I was half right.

Three Philosophies Collide

The three AIs did bring radically different philosophies about what makes documentation good. That part I expected.

Claude 4.5 Sonnet: The Completionist Claude’s first analysis was 808 lines long. Yes, you read that correctly—808 lines to analyze a README. It identified 15 specific issues, each with exact line numbers, before/after examples, standards citations, time estimates, and validation checklists. It read like a professional technical writer’s deliverable, complete with a three-phase implementation plan spanning 6-9 hours.

If you asked Claude to fix a typo, it would probably give you a 50-page style guide.

Gemini 2.5 Pro: The Minimalist
Gemini’s analysis was 36 lines. For the same README that inspired Claude’s 808-line opus, Gemini wrote one-and-a-half pages. But here’s the thing—those 36 lines cut straight to the problem: “Quick Start at line 184” wasn’t a documentation issue, it was a respect for the user’s time issue. Five recommendations. No fluff. Done.

If Gemini were a surgeon, it would make one precise incision instead of ten thorough ones.

GPT-5: The Fixer GPT-5 came in at 121 lines, right in the middle. But its focus was different. While Claude catalogued everything and Gemini identified the core problem, GPT-5 was busy catching things the others missed: “Hey, the template filename is wrong—it’s EXAMPLE_PROMPT.md, not UNIVERSAL_PROMPT.md.” It found typos (“affective” vs “effective”), broken links, and inconsistent terminology. Then it added practical touches like a “Quick Start TL;DR” and mini-FAQ.

GPT-5 is that coworker who fixes your grammar and your logic and makes it more readable.

The fascinating part? These weren’t just different writing styles. They represented fundamentally different beliefs about what makes documentation good.

The Consensus Nobody Saw Coming

Here’s where their different philosophies should have produced different diagnoses. The completionist should have found dozens of issues. The minimalist should have focused on one or two critical flaws. The fixer should have caught surface-level problems.

Instead, something unexpected happened: they agreed on everything that mattered.

All three identified the same five critical problems:

1. The Line 184 Problem (all three flagged this first)
The Quick Start section was buried after 183 lines of philosophy and architecture. Gemini called it “disrespecting the user’s time.” Claude cited violation of project standard 801-project-readme-rules.md line 74-89: “Provide immediate value within first 30 seconds.” GPT-5 simply noted: “New users will abandon before reaching line 184.”

2. The Navigation Disaster
A 1,200+ line README with no table of contents. Claude noted this violated the >500 line rule. Gemini said it more bluntly: “How is anyone supposed to find anything?” GPT-5 proposed the actual TOC structure.

3. The Broken Entry Point
The installation URL pointed to an internal GitLab instance that external users couldn’t access. All three caught this, but only GPT-5 suggested the dual-URL solution for internal/external users.

4. The Prerequisites Confusion
The README mixed up what you needed to generate rules (Python, uv, Task) versus use them (nothing—they’re just Markdown files). Claude provided detailed before/after examples. Gemini identified the conceptual problem. GPT-5 wrote the fix.

5. The Verification Gap
After setup, users had no way to confirm it worked. Claude proposed a comprehensive validation checklist. Gemini noted it as essential for confidence. GPT-5 provided the specific bash commands.

What differed wasn’t what they found, but how they wanted to fix it. Claude specified exact line numbers and validation steps. Gemini outlined strategic principles. GPT-5 provided copy-paste solutions.

The diagnosis was unanimous. The treatment plans were philosophical opposites.

The Scoring Game: When AIs Judge Themselves

The second phase introduced an interesting dynamic: each AI had to evaluate all three improvement plans—including their own—and pick a winner. I gave them a constraint (score out of 100 points) but complete freedom in how to judge. Each AI could define its own evaluation categories and decide how many points each category was worth.

Claude might value “comprehensiveness” at 20 points while GPT-5 valued “actionable specificity” at 30 points. Gemini could create entirely different criteria focused on strategic clarity. The 100-point total was fixed, but everything else was up to them.

This is where things got psychologically interesting.

The Tough Self-Critic Gemini gave itself 70 out of 100. Think about that. It looked at its own work and essentially said, “Yeah, I identified the problems clearly, but I didn’t provide enough implementation detail. This is a C+.” Then it turned around and gave Claude a 98/100, basically admitting: “That 808-line monster I silently judged? It’s actually exactly what an implementer needs.”

The self-awareness was striking. Gemini’s evaluation noted that while GPT-5 and itself produced “excellent recommendations,” Claude produced a complete “project plan"—recognizing a fundamental difference in deliverable types.

The Rigorous Scorekeeper
Claude built a seven-criteria rubric (Comprehensiveness, Actionability, Prioritization, Standards Alignment, Specificity, Organization, Implementation Guidance) and scored itself 99/100. It praised its own “production-ready specifications” while docking itself one point for being potentially overwhelming. It gave GPT-5 a respectable 82/100 for accuracy focus, but Gemini only scored 47/100—essentially saying “too strategic, not actionable enough.”

Claude wasn’t being arrogant; it was being consistent with its values. If comprehensiveness wins, the comprehensive plan should win.

The Practical Evaluator GPT-5 scored Claude at 93/100 (notably not perfect), gave itself 84/100, and Gemini 54/100. Its criteria emphasized “actionable implementation readiness” and “new-user onboarding focus.” It acknowledged Claude’s plan as having “best overall implementation readiness” despite being “long and dense,” appreciating that “it minimizes ambiguity and risk.”

The Verdict Despite different scoring scales and criteria, the ranking order was identical:

Claude (99, 98, 93)
GPT-5 (82, 87, 84)
Gemini (47, 70, 54)

This wasn’t groupthink. Each AI created its own rubric, valued different things, yet reached the same conclusion: for this specific documentation overhaul task, the comprehensive approach beat both the strategic view and the practical fixes.

The Synthesis Challenge: Combining Without Compromising

Round three threw a curveball: instead of defending their own approaches, each AI had to synthesize all three plans into one optimal solution, cherry-picking the best ideas from their competitors.

The interesting question: would synthesis mean compromise—finding a middle ground between the extremes? Or would it mean integration—keeping the best parts even if that made things bigger?

The Answer: Integration (Which Made Everything Grow) If synthesis meant averaging, you’d expect something around 322 lines (808 + 121 + 36 = 965, divided by 3). Instead:

Claude: 808 → 1,080 lines (+272 lines, +34%)
GPT-5: 121 → 304 lines (+183 lines, +151%)
Gemini: 36 → 167 lines (+131 lines, +364%)

All three grew. Gemini’s synthesis was nearly five times longer than its original. They chose integration over compromise—adding the best from each approach rather than finding a middle ground.

How They Explained Their Choices I asked each AI to explain what it took from the others and why. Their attributions revealed how they viewed each other’s contributions:

Claude’s view (70% self, 25% GPT-5, 5% Gemini):

Kept its comprehensive coverage, phased plan, standards alignment as the foundation
Added GPT-5’s tactical fixes: filename corrections, Task-free alternative, TL;DR, FAQ
Borrowed Gemini’s executive summary format and clear problem statements

GPT-5’s view (more balanced distribution):

Adopted Claude’s structural elements: ToC, verification steps, troubleshooting, phased rollout
Kept its practical focus: accuracy fixes (typos, filenames), Task-free path, TL;DR, FAQ
Integrated Gemini’s strategic guidance: Quick Start first, consolidate sections, move philosophy later

Gemini’s view (tactical vs. strategic split):

Adopted Claude’s strategic framework wholesale: executive summary, phased plan, success metrics, deep restructuring
Integrated GPT-5’s tactical precision: typo corrections, filename consistency, onboarding enhancements (TL;DR, FAQ, Task-free path)
Retained its core focus: five principles for Quick Start, ToC, and consolidation

The Pattern Each AI credited others for filling its gaps while keeping its core identity intact. Claude stayed comprehensive but added practical fixes. GPT-5 borrowed structure but maintained actionability. Gemini adopted frameworks while preserving strategic clarity.

None of them created Franken-documents by copy-pasting. They integrated—weaving the best ideas into coherent wholes that preserved their distinct philosophies while genuinely adopting their competitors’ strengths.

The Final Judgment: Unanimous Again

After each AI synthesized the best of all three approaches, they had to evaluate the final plans and declare an ultimate winner.

Based on my previous experiment with code refactoring, I knew unanimous verdicts were possible. But documentation felt different—more subjective, more dependent on audience preferences. Would the pattern hold?

It did.

Complete unanimity.

All three AIs—with different scoring systems, different criteria, different philosophies—selected the same winner:

Claude 4.5 Sonnet’s synthesized plan. Let that sink in. The minimalist (Gemini) voted for the maximalist. The practical fixer (GPT-5) voted for the comprehensive blueprint. Even Claude itself had to justify why its approach deserved top marks.

The scores tell the story:

Claude’s evaluation: Claude 100/100, GPT-5 88/100, Gemini 68/100
GPT-5’s evaluation: Claude 93/100, GPT-5 88/100, Gemini 68/100
Gemini’s evaluation: Claude 100/100, GPT-5 83/100, Gemini 82/100

Notice something? Even when the models scored themselves, Claude remained #1 in every single ranking. GPT-5 scored Claude at 93 (not perfect), but still ranked it first. Gemini gave Claude a perfect 100 while rating its own work at 82.

Why Unanimity Matters This wasn’t a popularity contest or philosophical bias. Each AI articulated why Claude’s plan won, and their reasoning converged on the same insight:

Gemini put it most elegantly: “GPT-5 and Gemini produced excellent recommendations. Claude produced a complete project plan.” That distinction—between suggestions and specifications—was everything. Claude didn’t say “move Quick Start up.” It said “Cut lines 184-207, insert after new ToC at line 31, update all section anchors, validate with these specific commands.”

GPT-5 acknowledged the practical reality: “Despite being longer, it minimizes ambiguity and risk for editors implementing changes.” In documentation overhaul, ambiguity is expensive. If a technical writer has to interpret your recommendations, you’ve created more work, not less.

Even Claude defended its approach carefully: the length wasn’t bloat, it was “specifications that eliminate guesswork” and provide “execution confidence.” When you’re overhauling 1,200+ lines of user-facing documentation, confidence matters.

Plot Twist: The Fourth Judge Disagrees (Sort Of)

Just when we thought the story was over, I brought in Claude 4.1 Opus—a different model from the same family—to act as an independent validator. Think of it as bringing in an external auditor after the committee reaches consensus.

Opus’s 365-line meta-analysis was… complicated.

On one hand, it confirmed the winner: “Claude 4.5 Sonnet’s synthesized plan rightfully wins the synthesis quality evaluation.” It praised the transparency of Claude’s decision matrix, the completeness of integration, and the transformation from recommendations into actionable project plans.

On the other hand, Opus threw some cold water on the celebration:

“GPT-5’s 93/100 is more realistic than Claude’s self-awarded 100/100. Perfection implies no room for improvement.”

Ouch. Opus essentially said: “Yes, Claude won. But Claude, giving yourself a perfect score? That’s a bit much.”

More importantly, Opus challenged the entire premise of declaring a single winner. It introduced what it called “Contextual Excellence”—the idea that different approaches serve different needs:

Enterprise documentation overhaul? Use Claude’s comprehensive blueprint.
Quick fixes for open source? GPT-5’s focused approach wins.
Team planning meeting? Gemini’s strategic summary is perfect.
Solo developer on tight timeline? GPT-5 again.

Opus proposed an “80/20 Implementation Strategy”:

Hour 1: GPT-5’s quick fixes (typos, URLs, accuracy)
Hour 2: Claude’s Quick Start reorganization and verification
Hour 3: Gemini’s structural consolidation

Result: 80% of improvement value in just 3 hours instead of Claude’s proposed 6-9.

The philosophical point Opus made was profound: “The best approach isn’t choosing one winner but understanding how to leverage each AI’s strengths for specific needs.” It’s not about which AI is best—it’s about which approach fits your situation.

Five Surprises About AI Documentation Work

Running this experiment surfaced insights I didn’t expect—some confirming hunches, others completely contradicting my assumptions.

1. The Longest Answer Won (and Deserved To) In an age of TL;DR and executive summaries, the 1,080-line comprehensive plan beat the 167-line strategic view and the 304-line balanced approach. Why? Because when you’re handing off work to someone else, “figure out the details” isn’t helpful—it’s passing the buck. Claude’s plan eliminated all ambiguity. You could hand it to a technical writer on Friday and have complete documentation by Monday.

The lesson: for implementation documents, completeness isn’t verbosity—it’s respect for your implementer’s time.

2. Self-Awareness Varied Wildly Gemini gave itself 70/100. Claude gave itself 99-100/100. GPT-5 gave itself 84-88/100.

What does this tell us? Either Gemini has imposter syndrome, or Claude has an ego problem, or—more likely—different AIs have different calibrations for self-evaluation. Gemini’s tough self-critique actually made its recommendations more credible. Claude’s high self-score came with detailed justification. The variation matters because it affects how you interpret AI confidence levels.

3. Synthesis Isn’t Averaging When the AIs combined all three approaches, I expected some kind of compromise—a middle-ground solution that averaged the extremes. Instead, synthesis meant addition. Claude’s 808 lines became 1,080. Gemini’s 36 lines became 167. GPT-5’s 121 became 304.

Good synthesis doesn’t find the midpoint between approaches. It identifies the valuable pieces from each and integrates them, even if that makes the result larger. The key word is “integration”—GPT-5’s Task-free path didn’t get appended to Claude’s plan, it got woven throughout.

4. Documentation Has No Perfect Score Opus’s meta-analysis included a zinger: perfect scores imply no room for improvement. Even the unanimous winner probably isn’t 100/100 because documentation always has tradeoffs:

More detail → better implementation, harder to scan
More brevity → faster reading, more interpretation required
More structure → easier navigation, more overhead

A “perfect” README for new users might frustrate advanced users. A “perfect” technical spec might bore executive readers. The context determines optimal, not some abstract standard.

5. The Real Win Was Having All Three Here’s what I actually used after this experiment: GPT-5’s quick fixes first (30 minutes), then Claude’s structural reorganization (2 hours), referencing Gemini’s strategy when explaining changes to stakeholders (15 minutes).

The unanimous verdict told me which approach to prioritize. But having three complementary perspectives meant I could tailor the solution to my specific constraints: limited time, non-technical stakeholders, need for measurable improvement.

By the Numbers: The Scale of Analysis

If you’re wondering about the sheer scope of this experiment, here’s the reality:

Total Lines of AI Output: 4,776 That’s what three AIs wrote to fix one 1,232-line README. To put that in perspective, they wrote nearly four times the content they were analyzing. It’s like hiring three editors to fix a 50-page report and receiving 200 pages of editorial notes.

Breaking it down by AI reveals their distinct personalities in raw numbers:

Claude: The Maximalist

Initial analysis: 808 lines
Comparison analysis: 895 lines
Final synthesis: 1,080 lines
Final evaluation: 1,064 lines
Total: 3,847 lines (80% of all output)

Gemini: The Minimalist

Initial analysis: 36 lines
Comparison analysis: 95 lines
Final synthesis: 167 lines
Final evaluation: 93 lines
Total: 391 lines (8% of all output)

GPT-5: The Pragmatist

Initial analysis: 121 lines
Comparison analysis: 41 lines
Final synthesis: 304 lines
Final evaluation: 72 lines
Total: 538 lines (11% of all output)

Claude wrote more in just its initial analysis (808 lines) than Gemini wrote across all four rounds (391 lines). Yet somehow, both reached the same conclusions about what needed fixing.

Performance: Speed vs. Thoroughness While I didn’t time each step precisely, the performance pattern was consistent across all four rounds:

Gemini finished first - Usually by a significant margin
GPT-5 finished second - Moderate completion time
Claude finished last - Sometimes taking notably longer

This aligns intuitively with their output volumes. Gemini’s 36-line analysis appeared almost immediately. GPT-5’s 121-line response took a bit longer. Claude’s 808-line forensic report required real wait time. When you’re generating 10x the content, you’re going to take 10x the time.

The tradeoff is obvious: faster completion vs. comprehensive detail. Gemini gave you an answer in seconds. Claude gave you a specification in minutes. Which matters more depends on whether you’re in a meeting or executing a project.

Scoring Convergence Despite the different approaches and speeds, the final scores showed clear agreement:

Claude received an average of 98/100 across all evaluations
GPT-5 received an average of 85/100
Gemini received an average of 73/100

The ranking order was identical in every single evaluation. No ties, no splits, no philosophical disagreements about the final order. Three different judges, three different scoring systems, same verdict.

Additional Observations: The Experimental Setup

All evaluations were conducted in Cursor with carefully controlled conditions. Each AI received identical prompts—the only difference was the output filename (readme-improvement-claude-4-5-sonnet.md vs. readme-improvement-gpt-5.md, etc.). The initial context was the same for all three: the README file, the project’s documentation standards (801-project-readme-rules.md), and supporting files.

This standardization was important to ensure fair comparison. Any differences in output would reflect the AI’s approach, not variations in what it was asked to do or what information it had access to.

A Curious Behavioral Difference My Cursor rules enforce a PLAN/ACT contract to prevent AIs from making changes without confirmation. When I ask a question, the AI should present a plan and wait for me to type “ACT” before executing. For code-related work, all three models consistently respect this pattern.

But for this documentation analysis task, something unexpected happened:

GPT-5 consistently stopped and asked me to type “ACT” before saving its analysis to the file (exactly as the rules specify)
Claude and Gemini both wrote their analysis files immediately without prompting

This wasn’t a one-time quirk. The pattern repeated across multiple evaluation sessions, despite identical setup and context. GPT-5 treated “write your analysis to a file” as an action requiring confirmation. Claude and Gemini apparently viewed documentation analysis as different from code changes, not requiring the same approval step.

I don’t have a great explanation for this. The Cursor rules were identical. The prompts were identical. The context was identical. Yet GPT-5 interpreted the workflow contract differently than the other two, specifically for documentation tasks. It suggests that different models have different mental models about what constitutes a “code change” versus other types of work.

Interesting? Yes. Problematic? Not really—I was supervising all of them anyway. But it does reveal that even with explicit rules, AI models interpret boundaries differently based on task context.

Which Approach Should You Actually Use?

The unanimous verdict tells you which plan is most complete, but Opus’s challenge—“different contexts need different approaches”—is the more practical advice. Here’s how to choose:

If you have 9 hours and perfectionist tendencies: Use Claude’s full plan
You’ll get a completely overhauled README with every issue addressed, validation for every change, and professional-grade documentation. Hand this to your technical writer and walk away with confidence. But be honest: do you really have 9 hours? And will perfect documentation make your project 10x more successful than good-enough documentation? Sometimes yes. Usually no.

If you have 90 minutes and angry users: Use GPT-5’s quick wins
Fix the URL that points nowhere (5 min). Move Quick Start to line 30 (15 min). Add a TL;DR at the top (20 min). Fix the five typos that make you look careless (10 min). Add verification commands so users know if it worked (30 min). You’ve just solved 80% of user frustration in the time it takes to watch a movie.

If you’re in a planning meeting right now: Open Gemini’s analysis
Your team doesn’t need 1,080 lines of specifications. They need five bullets: Quick Start is buried, no table of contents, URLs are broken, prerequisites are confusing, no verification. That’s it. Assign owners, set deadlines, move on. Gemini’s 36-line analysis fits on one slide.

If you’re wise (and have 3 hours): Do the Opus hybrid

Hour 1: GPT-5’s accuracy fixes (high impact, low effort)
Hour 2: Claude’s Quick Start reorganization + ToC (highest user impact)
Hour 3: Gemini’s philosophy consolidation (reduces future confusion)

You’ve addressed the critical issues, improved user experience, and set yourself up for easier maintenance. The remaining items from Claude’s plan? Put them in a backlog labeled “Polish” and tackle them when you’re bored.

The honest answer: Most projects will use GPT-5’s approach, claim they’ll do Claude’s plan “eventually,” and never touch it again. That’s okay. As attributed to Voltaire, “Perfect is the enemy of good.” But if you’re building something that thousands of developers will use, or if documentation quality is a competitive advantage, or if you’re just tired of the same onboarding questions—then Claude’s comprehensive plan is worth the investment.

The Irony of It All

There’s something deeply ironic about this entire experiment.

I started with a 1,232-line README that buried its Quick Start at line 184. The problem? Too much information before the useful stuff.

The solution? I had three AIs generate 4,776 lines of analysis about how to make documentation more concise and user-friendly.

To fix documentation that overwhelmed users, I created documentation about fixing documentation that would overwhelm anyone trying to fix documentation. It’s documentation turtles all the way down.

But here’s what actually matters: the unanimous verdict wasn’t really about Claude being “best.” It was about three different AI perspectives converging on the same fundamental insight—documentation serves people, not perfection.

Claude won because it made a technical writer’s job easier (complete specifications). Gemini’s harsh self-critique (70/100) showed more wisdom than Claude’s confident 99/100. GPT-5 caught the filename errors that would’ve broken trust. Opus challenged the whole winner-takes-all framing.

They were all right.

The real lesson isn’t “use Claude for documentation” or even “combine all three approaches.” It’s simpler: if your Quick Start is at line 184, you’ve already failed. Everything else—the table of contents, the typos, the philosophical architecture discussions—is secondary to respecting your user’s time.

Want to know the truth? After running this entire experiment, analyzing 4,776 lines of AI output, and validating with a fourth judge, I fixed my README in 2 hours using GPT-5’s approach. Quick Start moved to line 12. Added a TL;DR. Fixed the broken URLs. Done.

Perfect? No. Better? Absolutely. Users could actually use the thing.

Sometimes the best documentation isn’t the most comprehensive or strategic or meticulously validated. Sometimes it’s just the one that gets out of your way and lets people get their work done.

Line 184. Never again.

When Three AIs Fixed a README - The Unanimous Verdict Nobody Expected

Line 184 Was the Problem. 4,776 Lines Were the Solution.