When Three AIs Tried to Fix 1,717 Lines of Code -

Image courtesy of Google Gemini

This blog is a continuation of my AI Code-Off experiments series. You can read the first article here: I Pitted Gemini, Claude, and GPT in a 4-Stage AI ‘Code-Off.’

1,717 lines. That’s how long my Streamlit UI rule, intended as a guide for LLMs and Agents, had grown.

This wasn’t documentation for humans—it was a rule file for AI coding assistants in Cursor. When you ask an AI to build a Streamlit dashboard, it consults files like this to understand best practices and requirements. But somewhere in those 1,717 lines, the guidance had become contradictory. Plotly or PyDeck for visualizations? The rule said both, with no clear preference. Duplicate sections appeared throughout. Critical production guidance (performance, accessibility, deployment) was completely missing.

The AI assistants reading this rule were getting mixed signals. When your instruction manual contradicts itself, even the smartest AI can’t help you consistently. The fix seemed obvious: consolidate it down to 600-800 focused lines. Remove conflicts, eliminate duplication, add the missing production guidance. Simple, right? But at 1,717 lines, this wasn’t a quick manual edit. I needed expert help to refine the rule—someone who could analyze the entire thing systematically, identify conflicts, suggest what to keep and what to cut, and do it all without missing critical details.

In my experience, LLMs excel at exactly this type of work: analyzing large documents, identifying patterns and contradictions, proposing consolidations. They’re tireless, thorough, and unbiased about which library “should” win. But instead of just picking one model and calling it done, I had a better idea: what if I used three different LLMs and compared their recommendations? Would they agree on what needed fixing? Would their solutions differ? Could I synthesize the best parts from each?

So I staged an experiment: I asked Claude 4.5 Sonnet, Gemini 2.5 Pro, and GPT-5 to each analyze the same rule, propose fixes, and then judge each other’s work.

Three Philosophies Emerge

I ran the same experiment I’d recently done with a Python script analysis: give three AIs—Claude 4.5 Sonnet, Gemini 2.5 Pro, and GPT-5—the same complex problem and watch how they solve it differently. Four stages: analyze what’s broken, create a plan, synthesize all approaches, then evaluate and pick a winner.

For a 1,717-line refactoring challenge, their distinct personalities became immediately obvious.

Claude 4.5 Sonnet: The Specification Machine

If you asked Claude for directions to a coffee shop, it would provide turn-by-turn navigation with alternate routes, traffic patterns, historical accident data, and a contingency plan if your car breaks down. For the Streamlit rule, it produced a 735-line analysis.

Yes, you read that correctly. A 735-line analysis of a 1,717-line document. That’s 43% of the original length—just to explain what was wrong.

But here’s the thing: those 735 lines were surgical. Every conflict identified with exact line numbers. Every duplication mapped. A priority matrix (P1-P4) for fixes. Validation checklists. Even rollback procedures if something went wrong. Claude left absolutely nothing to interpretation.

Gemini 2.5 Pro: The Strategic Visionary

Gemini delivered 101 lines. For the same 1,717-line beast that inspired Claude’s 735-line opus, Gemini wrote less than two pages.

But those 101 lines cut straight to the philosophical core: this wasn’t a documentation problem, it was an architecture problem. The Plotly vs. PyDeck conflict wasn’t a technical choice—it was an unresolved decision creating cognitive load downstream. The solution wasn’t editing, it was rethinking the structure.

If Gemini were a surgeon, it would identify the tumor without describing every cell.

GPT-5: The Action-Oriented Pragmatist

GPT-5 came in at 202 lines—right between the maximalist and the minimalist. But its focus was different.

While Claude was documenting every problem and Gemini was redesigning the architecture, GPT-5 was creating a “Snapshot of Edit Actions (ready to apply)"—literally copy-pasteable fixes. Need to remove PyDeck references? Here are the exact lines to delete. Want to consolidate analytics sections? Here’s the merged text, ready to paste.

GPT-5 is the coworker who doesn’t just identify problems—they fix them while you’re still reading the analysis.

The Consensus: Fix the Conflicting Guidance

Here’s where something interesting happened. Despite radically different approaches—Claude’s forensic detail, Gemini’s strategic vision, GPT-5’s actionable fixes—they all reached the same four core conclusions:

1. Resolve the Plotly/PyDeck Conflict All three identified the conflicting visualization guidance as a critical problem. The rule offered both Plotly and PyDeck without clear direction, creating decision paralysis. Their recommendation: make it consistent. Since I preferred Plotly, they showed how to remove all PyDeck references and standardize on Plotly throughout. Claude provided exact line numbers for every PyDeck deletion. Gemini explained the strategic rationale (conflicting options confuse AI assistants). GPT-5 provided the replacement code snippets.

2. Consolidate Analytics Cross-References The document had analytics guidance scattered across multiple sections. All three wanted it in one place. Claude specified which sections to merge and how. Gemini explained why (single source of truth). GPT-5 wrote the consolidated text.

3. Split the Overloaded UI Section The UI guidance section was trying to do too much. All three recommended splitting it into Layout, Components, and State. Different implementations, same diagnosis.

4. Add Production Guidance Performance, accessibility, UI testing, deployment strategies (Streamlit in Snowflake vs. Snowpark Container Services)—all missing. All three flagged this as critical. The 600-800 line target couldn’t be just cuts; it needed strategic additions too.

The diagnosis was unanimous. The treatment plans were philosophical opposites.

From Diagnosis to Surgery: The Planning Phase

In round two, I asked each AI to synthesize all three analyses and create an actionable plan. The results were perfectly in character.

Claude’s 819-Line Blueprint Claude produced an 819-line implementation specification. Yes, an 819-line plan to fix a 1,717-line document. If you’re doing the math, that’s 48% of the original length just for the plan. But it was essentially a complete implementation guide. Specific line numbers for every edit. Full code examples for new sections. A detailed priority matrix. Post-resolution validation checklist. You could hand this to a developer on Friday and have the refactored rule by Monday—no questions asked.

Gemini’s 79-Line Executive Summary Gemini distilled everything into 79 lines. It focused on principles over prescriptions: workflow-aligned structure, single source of truth for analytics, Plotly as the standard. No line numbers. No code examples. Just the strategic “why.” It was a plan designed to align stakeholders, not execute changes. Eight hundred fewer lines than Claude, same core message.

GPT-5’s 80-Line Action Plan GPT-5 matched Gemini’s length (80 vs 79 lines) but with a completely different focus. It had an authoritative “Decisions” section—Plotly-only, period—and a compact edit plan with the critical details. In 80 lines, it conveyed what Claude took 819 lines to explain. Not because it was smarter, but because it assumed you didn’t need your hand held. Here’s what to do. Do it.

The Convergence Despite the 10x difference in length (79 vs 819 lines), all three plans converged on the same targets: 50-60% line reduction, Plotly-only, consolidated analytics, restructured UI section, new production guidance.

The mission was crystal clear. The execution strategies were completely different.

The Synthesis Challenge: Integration vs. Compromise

Round three threw a curveball. Instead of defending their approaches, each AI had to synthesize all three plans into one optimal solution, cherry-picking the best ideas from competitors.

Would they compromise—find a middle ground? Or integrate—keep the best parts even if that made things bigger?

The Answer: Integration (Which Made Everything Grow)

Claude: 819 → 1,413 lines (+594 lines, +73%)
GPT-5: 80 → 213 lines (+133 lines, +166%)
Gemini: 79 → 189 lines (+110 lines, +139%)

All three grew. Gemini’s synthesis was nearly 2.5x its original. They chose integration over compromise—adding the best from each approach rather than averaging.

How They Explained Their Choices I asked each AI to explain what it incorporated from the others and why. Their attributions revealed how they viewed each other’s contributions:

Claude’s view (most detailed attribution):

From itself: Specific line numbers, comprehensive detail, validation checklists, precise edits
From Gemini: Strategic vision (50-60% reduction), “single source of truth” principle, workflow-aligned structure
From GPT-5: Balanced detail level, actionable snapshots, copy-paste ready examples

Claude even created an explicit matrix showing the percentage breakdown and rationale for each element included.

Gemini’s view (strategic framing):

From Claude: Complete implementation specifications, phased approach, detailed validation frameworks
From GPT-5: Actionable quick-reference elements, practical examples
From itself: High-level strategic principles, workflow alignment, concise communication

GPT-5’s view (balanced integration):

From Claude: Comprehensive coverage, line-scoped precision, thorough validation
From Gemini: Strategic clarity, architectural thinking, workflow-first approach
From itself: Actionable brevity, copy-paste solutions, practical focus

The Pattern: Each AI credited others for filling its gaps while keeping its core identity. Claude stayed comprehensive but added practical examples. GPT-5 borrowed structure but maintained actionability. Gemini adopted detail while preserving strategic clarity.

The Unanimous Verdict (Again)

After synthesis, each AI evaluated all three final plans and declared a winner.

Based on my Python script experiment, I knew unanimous verdicts were possible. But code refactoring felt different than documentation—more technical, clearer “right answers.” Would the pattern hold?

It did.

Complete unanimity. All three AIs—with different scoring systems, different criteria, different philosophies—selected the same winner:

Claude 4.5 Sonnet’s 1,413-line synthesized plan. The scores told a consistent story:

Claude scored itself 44/50, GPT-5 at 35/50, Gemini at 30/50
GPT-5 scored Claude highest (though didn’t use the same numerical scale)
Gemini called Claude’s plan “the only one that is complete”

Why Claude Won Gemini’s evaluation included this memorable quote: “The Claude plan is the map, the vehicle, the fuel, and the turn-by-turn directions.”

GPT-5 acknowledged: “For a major refactor, Claude’s plan provided the necessary ‘safety rails.’”

Even Claude justified its own length carefully: the detail wasn’t bloat, it was “specifications that minimize ambiguity and inspire execution confidence.”

The unanimous insight: for high-stakes refactoring, complete specifications beat strategic visions or quick action plans. When you’re touching 1,717 lines of production code, ambiguity is expensive.

Five Insights From Watching AIs Refactor Code

Running this experiment surfaced patterns I didn’t expect—some confirming hunches, others revealing new dynamics.

1. Technical Consensus Despite Philosophical Differences All three agreed on the big technical decisions: Plotly-only (kill PyDeck completely), consolidate analytics, split UI section, add production guidance. Even with radically different problem-solving styles, they converged on the same architecture. This suggests that certain refactoring decisions have objective correctness—if you analyze the problem thoroughly enough, the right answer emerges regardless of your approach.

2. Specification Beats Strategy for Implementation Claude’s 1,413-line plan won because it eliminated interpretation. Gemini’s 189-line strategy was brilliant but required someone to figure out how. GPT-5’s 213-line action plan was immediately usable but incomplete for major refactoring. The lesson: “figure out the details” isn’t a specification—it’s homework. For complex changes, thoroughness is respect for the implementer’s time.

3. Synthesis Means Addition, Not Averaging I expected the synthesized plans to find middle ground (average of 819 + 80 + 79 = 326 lines). Instead, they grew by 73%, 166%, and 139% respectively. Good synthesis isn’t compromise—it’s recognizing that multiple perspectives each contribute unique value. GPT-5’s Task-free alternative wasn’t appended to Claude’s plan; it was woven throughout. Gemini’s strategic framing didn’t replace Claude’s details; it provided context for them.

4. The 10x Output Variance Claude produced 3,230 lines across all four rounds. Gemini produced 422. That’s a 7.6x difference in verbosity to reach the same conclusions. More isn’t always better, but for this refactoring task, the unanimous verdict suggested that Claude’s exhaustiveness was a feature, not a bug. The extra 2,800 lines provided confidence that nothing would be missed.

5. Speed vs. Thoroughness Tradeoff While I didn’t track exact timing, the pattern was consistent:

Gemini finished first (often by 2-3 minutes for complex analyses)
GPT-5 came in second
Claude took longest (sometimes 5+ minutes for its forensic deep-dives)

When you’re generating 735 lines vs. 101 lines, you’re going to take longer. The question isn’t which is “better”—it’s whether you need an answer in 30 seconds or a specification you can execute with confidence.

The Output Volume: How Much Analysis to Fix 1,717 Lines?

Here’s what it took to analyze and refactor the rule file:

Total AI Output: 3,808 lines Three AIs wrote 3,808 lines to refactor a 1,717-line document. They produced more than twice the content they were analyzing. It’s like asking three mechanics to tune up your car and receiving a complete rebuild manual.

The Breakdown

Claude: The Maximalist

Round 1 (Analysis): 735 lines
Round 2 (Plan): 819 lines
Round 3 (Synthesis): 1,413 lines
Round 4 (Evaluation): 263 lines
Total: 3,230 lines (85% of all output)

Gemini: The Minimalist

Round 1 (Analysis): 101 lines
Round 2 (Plan): 79 lines
Round 3 (Synthesis): 189 lines
Round 4 (Evaluation): 53 lines
Total: 422 lines (11% of all output)

GPT-5: The Pragmatist

Round 1 (Analysis): 202 lines
Round 2 (Plan): 80 lines
Round 3 (Synthesis): 213 lines
Round 4 (Evaluation): 51 lines
Total: 546 lines (14% of all output)

Claude wrote more in Round 2 alone (819 lines) than Gemini wrote across all four rounds (422 lines). Yet both agreed on the winning plan.

Which Approach Should You Actually Use?

The unanimous verdict tells you which plan is most complete. But in reality, different contexts need different approaches.

If you’re refactoring production code that thousands of users depend on: Use Claude’s comprehensive specification. The 1,413 lines provide surgical precision and validation frameworks. Hand it to your team with confidence. Yes, it takes time to execute, but the alternative—ambiguous changes causing production issues—costs more.

If you’re in a planning meeting and need stakeholder alignment: Use Gemini’s 189-line strategic vision. Your team doesn’t need line numbers yet. They need to understand why you’re killing PyDeck, why you’re consolidating analytics, why the 50-60% reduction target matters. Gemini’s plan fits on slides.

If you’re a solo developer wanting to make quick improvements: Use GPT-5’s 213-line action plan. It balances strategy and tactics without overwhelming you. The “Snapshot of Edit Actions” gives you copy-paste solutions for immediate wins.

If you’re wise: Use all three in sequence:

Day 1 (30 minutes): Gemini’s plan for strategic alignment
Week 1 (6-8 hours): Claude’s plan for systematic execution
Ongoing: GPT-5’s plan as your quick-reference guide

This is what I actually did: Started with Gemini to understand the “why,” used Claude’s specification for execution, kept GPT-5’s action items open for quick lookups.

Result: 1,717 lines became 682. Zero PyDeck conflicts. Consolidated analytics. Production guidance added. Shipped in three weeks.

The Refactoring Paradox

Here’s the reality: I asked three AIs to consolidate 1,717 lines down to 600-800 lines. They produced 3,808 lines of analysis explaining how to do it. The irony isn’t lost on me. To fix bloated documentation, I generated more than twice as much analysis. But that analysis had genuine value—it provided the complete specifications needed for confident execution.

Claude won because someone could pick up that 1,413-line specification on Monday morning and execute it with confidence. Gemini’s strategic brilliance required interpretation. GPT-5’s actionable steps were perfect for quick wins but incomplete for major surgery. The unanimous verdict wasn’t about which AI was “best.” It was about three different approaches converging on the same insight: for high-stakes refactoring, completeness beats conciseness.

Claude won because it made the implementer’s job easier. Gemini’s tough self-critique (scoring itself 30/50 while awarding Claude 44/50) showed wisdom. GPT-5’s practical action items provided immediate value. Each was right for different contexts.

The Real Lesson

After running this entire experiment and analyzing 3,808 lines of AI output, the real lesson wasn’t about AI models or refactoring strategies.

It was simpler: conflicting guidance in a rule file creates downstream problems. The Plotly vs. PyDeck conflict wasn’t just a documentation issue—it was an unresolved decision that confused every AI assistant using the rule. Everything else—the line count, the duplicate sections, the missing production guidance—was fixable. But the conflicting recommendations required a clear choice.

Sometimes the best refactoring isn’t the most elegant plan or the fastest execution. Sometimes it’s just resolving the conflicts, making clear choices, and documenting them thoroughly.

1,717 lines can be fine. 1,717 lines with conflicting guidance? That’s what needed fixing.

When Three AIs Tried to Fix 1,717 Lines of Code

Three AIs, 3,808 Lines of Analysis, One Unanimous Winner