Image courtesy of Google Gemini
With over two decades of self-taught programming experience, I’ve seen AI revolutionize my coding over the last ten months. Daily use of LLMs has dramatically sped up projects, from quick prototyping and boilerplate generation to faster debugging. This frees me to focus on architectural decisions and complex problem-solving. However, effective AI use demands a strong grasp of programming fundamentals. LLMs aren’t perfect, and my experience has been crucial in quickly spotting and fixing bugs and logical flaws in AI-generated code.
What I’ve come to understand is that settling on a single LLM is rarely the right answer. Each model brings strengths and weaknesses that make it well-suited for some tasks, but not others. So I decided to run an experiment: pit them directly against each other on a complex refactoring task, then have them critique each other’s work.
And what happens when you ask the most advanced models from Google, Anthropic, and OpenAI to not only propose a fix, but to critique each other’s fixes, implement a unified plan, and then code-review the final product? I recently challenged Google’s Gemini 2.5 Pro, Anthropic’s Claude Sonnet 4.5, and OpenAI’s GPT-5 with a complex 2,000-line Python refactoring task to find out. I had them propose fixes, critique each other’s proposed solutions, implement a unified plan, and code-review the final product.
The Challenge: The 2,000-Line Patient
The subject was a realistic synthetic data generation script intended for demos: generate_synthetic_grid_data.py. Its job was to create a realistic dataset for a predictive maintenance demo, complete with five distinct transformer failure scenarios:
- Transformer 1: Manufacturing defect (Waukesha 2015 batch)
- Transformer 2: Age-related degradation
- Transformer 3: Manufacturing defect (Same Waukesha 2015 batch)
- Transformer 4: High-capacity overload pattern
- Transformer 5: Age + deferred maintenance
The problem? The script’s logic was flawed. It was generating generic degradation data for all five scenarios. Without distinct patterns for each failure mode, the ML model would have nothing meaningful to learn—it couldn’t distinguish between a manufacturing defect and age-related wear if the data looked identical. I needed my LLM AI team to fix it.
Stage 1: The Analysis (Meet the “Specialists”)
I gave all three models the same prompt:
Carefully and thoroughly review
generate_synthetic_grid_data.py. Ensure the failing equipment scenarios are logically accurate and the code generates the proper data to support all 5 scenarios:
- Transformer 1: Manufacturing defect (Waukesha 2015 batch)
- Transformer 2: Age-related degradation
- Transformer 3: Manufacturing defect (Same Waukesha 2015 batch)
- Transformer 4: High-capacity overload pattern
- Transformer 5: Age + deferred maintenance
Provide me your analysis and any suggested code and logic changes to improve the data scenario. Save this analysis and plan to a markdown file
generation_improvement_<MODEL>.md
Each model saved their analysis into a separate markdown file. All three models successfully identified the core issues, but their approaches revealed distinct “personalities.”
-
Anthropic’s Claude Sonnet 4.5 (The “Domain Expert”): Claude went deep on the physics. It proposed a new, physics-based function to create distinct patterns, like “erratic temperature spikes” for manufacturing defects versus a “slow, linear decline” for age-related failures.
-
OpenAI’s GPT-5 (The “Software Architect”): GPT-5 focused on a clean, maintainable code structure. It proposed a “Parameterized Profile System”—a new configuration dictionary to separate the parameters of each failure (like start day, intensity, and sensor emphasis) from the core logic.
-
Google’s Gemini 2.5 Pro (The “Data Scientist”): Gemini also spotted the main bugs, but it uniquely identified a more subtle, fundamental flaw: the “overload” scenario was physically impossible because the baseline data was wrong. The script never assigned a higher baseline load to that transformer.
Stage 2: The Synthesis (The Unified Plan)
This is where “collaboration” began. I gave all three models their own and their peers’ analysis files and asked them to create one, unified “best-of-breed” plan. They were each provided with the following prompt:
Review
generation_improvement_gpt5.md,generation_improvement_gemini25.md, andgeneration_improvement_sonnet45.md. identify common issues reported in each document. Identify any uncommon or novel discoveries found in one document but not the others. Determine if all the plans properly address the data generation problem accurately. Determine if any aspect of each of these plans can be combined into a single new plan that has all of the best suggestions.
Each model saved their analysis into a separate markdown file. The consensus was remarkable. All three models agreed on a synthesized approach that took the best from each.
-
From Gemini’s Synthesized Plan: “A unified plan combining the strengths of all three analyses will produce the most robust and realistic dataset. This master plan incorporates the specific baseline fix from Gemini, the parameterized configuration from GPT-5, and the physics-based modeling… from Sonnet.”
-
From GPT-5’s Synthesized Plan: “Unified ‘best-of’ plan: …Add failure_mode and correlation_notes to assets (from Sonnet). …Use Sonnet’s failure-mode-specific shaping… as the core. Parameterize with GPT-5-like per-transformer profiles…”
With a unified, peer-reviewed plan in hand, it was time for the most critical stage: execution.
Stage 3: The Implementation (Where It Got Messy)
I then asked each model to create a copy of my original script and make the edits in the new file using the following prompt:
Implement
combined_analysis_<MODEL>.mdusinggeneration_improvement_gpt5.md,generation_improvement_gemini25.mdandgeneration_improvement_sonnet45.mdas references. Do not updategenerate_synthetic_grid_data.pydirectly. Instead create a copy calledgenerate_synthetic_grid_data_<MODEL>.pyand implement your improvements in that file.
I changed <MODEL> in my prompt based on the selected model I was testing.
This is where the models’ “personalities” truly showed.
-
Anthropic’s Claude Sonnet 4.5 (The “Struggler”): Claude failed, spectacularly. It complained about the file size (~2,000 lines) and only implemented about 55% of the changes, stating it would “need a second round.” In a follow-up after I completed Stage 3, it even admitted its failure was due to a “poor tool choice.” Claude had attempted to use a search-and-replace approach for modifications, which became unwieldy for a 2,000-line file with hundreds of edits. A direct file write would have been more appropriate for this scope of changes. Claude humbly suggested I just use the GPT-5 version instead.
-
Google’s Gemini 2.5 Pro (The “Sprinter”): Gemini was fast. But it initially appeared flaky. On the first attempt, it hit a toolchain error and never wrote the file. However, creating a new Cursor session and making the same request worked fine, suggesting this was a Cursor IDE session issue rather than a Gemini limitation.
-
OpenAI’s GPT-5 (The “Tortoise”): GPT-5 took, by far, the longest. It just sat and “thought” for a long time. But in the end, it finished the entire task in one single pass, correctly writing the new generate_synthetic_grid_data_gpt5.py file.
Apparent Winner: At this point, it looked like GPT-5 was the clear winner. It was the only one to reliably produce the full file on the first pass. Or so I thought.
At the end of Stage 1, Sonnet had the most realistic solution, GPT-5 had the best architecture, and Gemini had the keenest data insight.
Stage 4: The Code Review (The Twist)
I couldn’t just trust the model that “finished.” I had to verify the work. I gave all three models a final prompt:
Carefully and thorougly review
generate_synthetic_grid_data_gemini25.pyandgenerate_synthetic_grid_data_gpt5.py. Usingcombined_analysis_gemini25.md,combined_analysis_gpt5.md,combined_analysis_sonnet45.md,generation_improvement_gpt5.md,generation_improvement_gemini25.md, andgeneration_improvement_sonnet45.mdas references determine which of thegenerate_synthetic_grid_data_*.pyscripts is the most complete and provides the best overall solution based on the analysis and plans.
The models acted as senior code reviewers, auditing their peers’ work against the original plans. The verdict was unanimous and shocking: GPT-5’s code had a critical syntax error that would prevent it from running.
-
Anthropic’s Claude Sonnet 4.5’s Review: Winner: generate_synthetic_grid_data_gemini25.py Score: Gemini 68/70 vs GPT-5 26/70 Critical Issues Documented:
- GPT-5 has syntax error (lines 1613-1615) preventing execution
- GPT-5 missing scenario check for Event 1
- …Gemini 2.5 is production-ready with all fixes
-
OpenAI’s GPT-5’s Own Review: Recommendation Best overall: generate_synthetic_grid_data_gemini25.py Why:
- …GPT-5 vectorized Event 1 block shows mis-indentation and no scenario guard, which can break execution…
- …Gemini vectorized Event 1 is correctly gated and well-formed
- Net: …the functional defect in the vectorized outage handling… make the Gemini 2.5 script the more complete and reliable implementation."
The Critical Flaw: An Indentation Error
The syntax error at lines 1613-1615 was a Python indentation issue in the vectorized Event 1 handling block. GPT-5 had mis-indented the code, which would cause Python to throw a syntax error immediately upon execution. To be fair, I’ve encountered similar indentation issues with most LLMs at one point or another -— it’s a common pitfall when generating large blocks of code. But in this case, it was the difference between working code and completely broken code.
OpenAI’s GPT-5, which appeared to finish the job flawlessly, had created broken, non-functional code. Google’s Gemini 2.5 Pro, which initially hit toolchain issues, had produced the only correct, production-ready script.
My Takeaways
This experiment was a fascinating look at the current state of AI development, and the lessons are clear:
-
“Finished” Does Not Mean “Correct.” The model that looked the most successful (GPT-5) ultimately failed the most important test: creating working code. To be clear, GPT-5 would likely have no issues resolving the problem through iteration.
-
Analysis is Easy, Execution is Hard. All three models were brilliant analysts and architects (Stages 1 & 2). They all correctly diagnosed the problems and designed a superb solution. But only one was a competent developer (Stage 3) in this test.
-
The Real Power is Orchestration (and Verification). The models are a powerful team of specialists: a Domain Expert (Anthropic’s Claude), a Software Architect (OpenAI’s GPT-5), and a Data Scientist (Google’s Gemini). But the most valuable step of all was Stage 4: using the AI team to cross-review its own work.
-
The Winner: The trophy went to Google’s Gemini 2.5 Pro, this time!. Not because it was the fastest or the most reliable, but because, in the end, it was the only one that was correct.
My Final Verdict: The “Winner” vs. The “Workhorse”
This four-stage “code-off” had a clear winner: Google’s Gemini 2.5 Pro created the only correct, functional script in a single pass. I have no doubt that I could have iterated with Anthropic’s Claude or OpenAI’s GPT-5 to create a working script. It is important to emphasize that this experiment is just one data point. Çontext from my broader, day-to-day experience is an important element here. Based on my personal work over the last six months, I often find that Anthropic’s Claude Sonnet 4.5 consistently produces better code with fewer iterations. While Claude struggled with the file creation in this specific test, that has been an anomaly in my experience.
And here’s the final, telling detail: while Google’s Gemini 2.5 Pro won the “code-off” by producing the “winning” file, it was Anthropic’s Claude Sonnet 4.5 that I used for the actual implementation. I trusted Claude to take Gemini’s winning script, integrate it into my project, and execute the final refinements.
So, while Gemini won this particular battle, my go-to model for getting the job done remains Claude. Tune in next time for another round of similar experiments. I have a few other comparisons in the queue.
See also
- What four experiments taught me about model personality
- Prompt Forge: Multi-Model Prompt Evaluation with Snowflake Cortex
- Four Signals, One Decision: How Ensemble AI Solves Unstructured Data Matching
- The Ultimate Pair Programmer - Why AI Coding Needs Human Experience
- When Three AIs Fixed a README - The Unanimous Verdict Nobody Expected