Most people who use more than one frontier model eventually learn that models have different task strengths. One model is better for fast edits. Another is better for long-form reasoning. Another is better for exhaustive review. That part is not especially surprising anymore.
What I wanted to test was more specific: whether those differences could be described as personality. Not personality in the human sense, but a repeatable instinct that shows up across tasks. If that instinct can be named, tested, and connected to failure modes, then model selection becomes less vibes-based.
Instead of asking, “Which model is best?” I wanted to ask:
What role does this model play best — planner, primary reviewer, secondary reviewer, or executor — and which of those seats should I give it?
Over four phases, I gave seven frontier models the same tasks and watched for the same tells under different constraints. Phase 1 used a structured plan-reviewer skill. Phase 2 removed the rubric and let each model define the review format. Phase 3 changed the thing being reviewed entirely, from implementation plan to project documentation. Phase 4 flipped the direction: instead of reviewing someone else’s artifact, each model had to author an implementation plan from the same requirements.
The experiment
The panel included seven frontier models across two families:
- OpenAI: GPT-5.2, GPT-5.4
- Anthropic: Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.5, Claude Opus 4.6, Claude Opus 4.7
All seven appear in all four phases.
| Phase | Task | What it reveals |
|---|---|---|
| Phase 1 | Structured plan review via the plan-reviewer skill |
How personality behaves inside a scoring instrument |
| Phase 2 | Open-ended plan review, no rubric | What each model thinks a review is when it invents the form |
| Phase 3 | Documentation review, plus a recursive pass analyzing the panel | Whether the labels transfer when the artifact changes |
| Phase 4 | Author an implementation plan from a well-specified brief | Which planner instinct each model shows when it owns the blank page |
In Phases 1 and 2, every model reviewed the same JavaScript/TypeScript remediation plan. The plan focused on documentation quality and ESLint enforcement.
Phase 1 added an extra constraint: the models reviewed that plan through the plan-reviewer skill. This was not just a generic rubric. The skill imposed fixed dimensions, weighted scores, verdict thresholds, and consistent output. In other words, the models were not reviewing freely. They were reviewing inside a scoring machine.
Phase 2 removed that machine. The models reviewed the same plan, but without the skill, the rubric, or the scoring template. That made it easier to see what each model thought a review should look like when it had to choose the shape for itself.
Phase 3 changed the artifact instead of the review format. The models reviewed documentation for a different project, which let me test whether the personality labels from plan review still explained behavior in documentation review.
Phase 4 changed the direction. Instead of reviewing someone else’s artifact, each model authored an implementation plan from the same brief. That flipped the pressure onto commitment: no review instrument to lean on, no prior artifact to react to, just the requirements and a blank page.
The simple version of what happened
The models were usually facing the same problems, but they turned those problems into different kinds of artifacts. When reviewing, if two models find the same defect, but one writes a compact fix list, another writes a scorecard, another writes a governance proposal, and another writes a strategic rewrite plan, those are not just formatting differences. They are clues about the model’s default role. When authoring, the same instinct decides which decisions get committed, which get deferred, and what the plan treats as the “real work.”
The four phases made that role easier to see:
- Phase 1 showed how personality behaves under a structured skill.
- Phase 2 showed what personality looks like when the model has to invent the review format.
- Phase 3 tested whether the labels still worked when the artifact changed.
- Phase 4 tested whether the labels still worked when the direction flipped and the model had to author, not review.
Phase 1: the skill compressed the voices
Phase 1 used the plan-reviewer skill I built in my ai_coding_rules repo. The skill forces every review into the same shape: fixed scoring categories, weighted scores, verdict thresholds, and a repeatable output format.
That was useful because it made the reviews comparable. It also created a harder test for personality. If every model has to speak through the same scoring format, what differences still show up?
The answer was: fewer differences, but not zero. GPT-5.2 still compressed findings into practical severity. GPT-5.4 looked like a lenient scorer, though later phases showed that the deeper trait was reframing. Sonnet 4.5 fit naturally into the scoring structure. Sonnet 4.6 focused on what would actually run. Opus 4.6 made the scoring auditable. Opus 4.7 used the rubric to close loopholes.
So Phase 1 did not erase personality. It pushed personality into quieter places: what evidence each model chose, how harshly it scored, and which rubric category it treated as decisive.
Phase 2: when the rubric disappeared, the personalities appeared
Phase 2 used the same kind of plan-review task, but removed the plan-reviewer skill. No rubric. No scoring schema. No template. The models had to decide for themselves what a review should be.
That one change made the outputs look dramatically different. Some were short. Some were long. Some scored the plan anyway. Some wrote like they were editing a spec. Some wrote like they were doing a terminal-backed investigation. Some wrote like a teacher returning a graded paper.
What surprised me was that the models did not become less aligned on the substance. All seven found the plan’s central contradiction: the plan had made an implementation choice, but later described that same choice as a blocking open question. Most also found related issues around enforcement, citation hygiene, and machine-verifiable acceptance criteria.
The tell was not mainly in what they found. It was in what kind of review they produced after finding it.
| Model | Phase 2 review identity | Implicit question |
|---|---|---|
| GPT-5.2 | Terse mechanic | What text should change? |
| GPT-5.4 | Framing strategist | What framing makes this decisive? |
| Opus 4.5 | Scorecard auditor | Does this pass inspection? |
| Opus 4.6 | Forensic investigator | Is each claim actually true? |
| Opus 4.7 | Comparative synthesizer | What did the prior review miss? |
| Sonnet 4.5 | Enthusiastic grader | How good is this, and what does it say about the author? |
| Sonnet 4.6 | Runtime pragmatist | What will break when someone runs it? |
This is where the classification idea became useful. Phase 1 showed how models behave when structure is imposed. Phase 2 showed what each model thinks the job is when that structure is removed.
It also clarified why a model’s apparent strength can be misleading. GPT-5.4 may look lenient in Phase 1, but Phase 2 suggests it is not simply lenient; it is trying to turn the plan into a better version of itself. Sonnet 4.5 may look thorough, but its thoroughness naturally expresses itself as grading and praise. Opus 4.6 may look terse, but it is terse because it treats proof as more important than performance.
Phase 3: when the artifact changed, the personalities translated
Phase 3 changed the thing being reviewed. Instead of reviewing an implementation plan, the models reviewed project documentation for improvements. That shifted the pressure. The dominant questions became:
- Can a reader copy this command and succeed?
- Which document owns this fact?
- Which references are stale?
- Which docs are evergreen and which are historical artifacts?
- How should future drift be prevented?
The same personalities showed up, but translated into documentation language. GPT-5.2 became the copy-paste safety hawk. GPT-5.4 became the documentation ownership thinker. Sonnet 4.6 became the line-level fix author. Opus 4.6 became the exhaustive mismatch cataloguer. Opus 4.7 became the governance and drift-prevention reviewer.
That was encouraging. It suggested the labels were not just a plan-review artifact. They translated, imperfectly but usefully, into a different kind of review work.
Phase 4: the planning task made optimization targets visible
Phase 4 changed the test. Instead of reviewing someone else’s plan, each model had to write its own plan for the same project: Insight Queue, a small internal tool with a fixed stack. The brief already named FastAPI, Python 3.12, uv, Ruff, pytest, Pydantic v2, Node 24, Vitest, Zod, SQLite, and cross-platform CI.
Because the stack was already chosen, I was not judging the models on framework selection. I was judging how they handled the choices the brief left open.
Three choices mattered most.
First, who owns the shared schema? The Python API and TypeScript CLI needed to share types without hand-copying schemas. The strongest plans created one source of truth, such as OpenAPI or JSON Schema, and added a CI check to catch drift. Weaker plans duplicated the schemas and trusted tests to catch mistakes later.
Second, how does each phase prove it is done? Some plans only listed tasks. Better plans named the check that would prove each phase worked. Opus 4.7 stood out because every phase ended with a concrete test, like stable memory on the 50k-row fixture or deterministic contract generation.
Third, does the model keep the tool small? Five models recommended SQLAlchemy for a local SQLite app with only three tables, even though the brief asked for simple, boring technology. Opus 4.7, GPT-5.2, and GPT-5.4 rejected that default. That mattered because the task was not to design an enterprise app. It was to keep a small tool small.
The same personalities showed up in planning form:
- Opus 4.7 behaved as a governance-minded planner: twelve upfront decisions, numbered exit codes, reviewer checklist, stdlib
sqlite3over SQLAlchemy, native fetch over axios. - Sonnet 4.6 was the risk-triaged executor: schema drift named as the primary threat, codegen and drift detection wired into CI.
- Opus 4.6 wrote the plan a new engineer would actually understand: the reader-centered prose of the panel.
- GPT-5.4 and Opus 4.5 were the cataloguers: 27 sections, inlined configs, community-health files, exhaustive surface coverage that diluted signal into volume.
- Sonnet 4.5 was the rubric grader: tidy tables, option-A/option-B recommendations, structure substituting for commitment. Sonnet 4.5 also produced the clearest anti-pattern, including 70+ lines of Zod code and a deprecated FastAPI lifecycle hook inside the plan despite the explicit “do not write implementation code” constraint.
- GPT-5.2 was the pragmatic compressor: fewest words, fewest risks named, cleanest to read, but lightest on risk surfacing.
The practical lesson mirrored the review phases. When the stack is prescribed, personality shows up in the decisions the prompt did not make for you: schema ownership, ORM defaults, exit criteria, governance scaffolding, and how much code leaks into a document that forbids it. No single personality covered the planning surface. The best plan needs both precision and readability. A governance-minded planner can lock down exit criteria; a reader-centered reviewer can make the plan usable; a pragmatic operator can cut the 1,500-line version down to the 300 lines that matter.
Taken together, the four phases changed what was visible. The structured skill made the models look more similar. The open-ended review exposed each model’s preferred review style. Documentation review tested whether those habits transferred to reader-facing work. Planning showed what each model commits to when it owns the blank page.
Are the personalities consistent?
Mostly, but not perfectly.
The models were more consistent than a benchmark score would suggest. They were not consistent enough that I would treat these labels as permanent identities. Phase 1 also needs a caveat: the plan-reviewer skill forced every model into the same scoring format, so it made the models look more similar than they really were.
Three patterns stood out.
1. Models agreed on the problem more than the fix
Across all four phases, the models often found the same issue. They disagreed on what the issue meant and how serious it was.
In Phase 2, all seven models found the same contradiction in the plan. In Phase 3, several models caught the same broken commands and documentation ownership problems. In Phase 4, almost every plan identified schema ownership between Python and TypeScript as a major risk.
The disagreement came after detection. One model treated the issue as a blocker. Another treated it as a grading penalty. Another turned it into a governance problem. Another proposed a direct implementation fix.
That is the strongest pattern here: model disagreement is not always hallucination. Often, the models are seeing the same problem through different review habits.
2. Some personalities carried across tasks
GPT-5.2, GPT-5.4, Sonnet 4.6, Opus 4.6, and Opus 4.7 stayed recognizable across all four phases.
The task changed, but the habit stayed the same. Opus 4.7 kept pushing toward exit criteria and tighter standards. Sonnet 4.6 kept focusing on runtime risk. Opus 4.6 kept making the evidence visible. GPT-5.4 kept trying to organize the whole system. GPT-5.2 kept compressing the work down to the fewest useful decisions.
3. Some personalities changed shape depending on the task
Opus 4.5 is the clearest example. In plan review, it looked like an auditor. In documentation review, it became more diplomatic. In planning, it became an exhaustive config cataloguer.
Sonnet 4.5 was steadier in form, but its accuracy changed with the task. It performed best when the rubric was explicit. It became riskier when the task required judgment outside a scoring structure, especially when the task forbade implementation code inside the plan.
So the conclusion is not “every model has one fixed personality.” It is narrower than that: some models have a stable habit that shows up across tasks. Others have a stable tendency, but the task changes how it appears.
The observed personality map
This is the map I would take away from the experiment.
The observed personality names below are the stable labels. Earlier labels, like “terse mechanic” or “runtime pragmatist,” describe how that same personality showed up in a specific task.
| Model | Observed personality | Best at | Watch for |
|---|---|---|---|
| GPT-5.2 | Pragmatic operator | Compressing risk into actionable fixes | May under-explain strategy |
| GPT-5.4 | Reader-centered systems thinker | Reframing messy work into a coherent system | May over-systematize well-specified tasks where reframing has no traction |
| Sonnet 4.5 | Rubric-shaped grader | Structured scorecards and progress tracking | Can over-credit visible structure |
| Sonnet 4.6 | Surgical executor | Turning findings into concrete fixes | May skip root-cause strategy |
| Opus 4.5 | Context-sensitive diplomat | Polished stakeholder-facing synthesis | Can recategorize failures as manageable polish |
| Opus 4.6 | Forensic cataloguer | Exhaustive evidence and instance tracking | Can overwhelm with detail |
| Opus 4.7 | Governance reviewer | Standards, process, drift prevention, and exit-criteria-driven plan authorship | Can over-formalize simple fixes |
The labels are not meant to be universal truths about the models. They are the names I would use for the review and authorship personalities I observed in this set of tasks.
The personalities in detail
openai-gpt-5.2 — the pragmatic operator
Across all four phases, GPT-5.2 does the same thing in different costumes.
- Phase 1: compact summaries, practical severity, willing to give a usable readout rather than a forensic one.
- Phase 2: the terse mechanic. Finds the enforcement gap others soften (
npm run lintsilently exits 0 on warnings) and writes diff-level prescriptions instead of prose. - Phase 3: the copy-paste safety hawk. Flags credential hazards and shell risks other models gloss past.
- Phase 4: the pragmatic compressor at authorship. 747 lines, the shortest plan in the panel. Fewest risks named, minimal scaffolding, explicit reasoning for rejecting SQLAlchemy. Willing to leave things unsaid when they are obvious, which can mean lighter risk surfacing than the panel average.
The through-line is a refusal to spend words on ceremony. GPT-5.2 tries to compress the artifact — review or plan — into the smallest set of decisions a reader has to make.
openai-gpt-5.4 — the reader-centered systems thinker
GPT-5.4 is the clearest example of a personality that a rubric can hide.
- Phase 1: the rubric flattens it. Its charitable, construction-oriented voice shows up as “more lenient scorer,” but the underlying move is reframing, not grading.
- Phase 2: freed from the rubric, it reframes the plan’s thesis instead of enumerating defects.
- Phase 3: ownership and reader journey. Who owns this fact? Which doc is the source of truth? What is the cleanup plan in phases?
- Phase 4: the systems thinker writing a plan for a new engineer. 1,513 lines, 27 sections, a pre-implementation checklist, a rollback priority list. Phase 4 also exposed the failure mode of the instinct: with no mess to reframe, the comprehensiveness diluted into volume, and the plan accepted hand-duplicated schemas as an implementation convenience rather than making duplication structurally impossible.
Its consistent trait is the desire to make the thing it is working on cohere as a system, not just pass inspection. That instinct is most valuable when the input is ambiguous and least valuable when the input is already tightly specified.
claude-sonnet-4-5 — the rubric-shaped grader
Sonnet 4.5 is the most form-consistent model in the experiment. It scores, grades, gates, and percentage-tables whether you asked for that or not.
Its problem is calibration, not consistency. In Phase 3 it gave high accuracy marks and a B grade while the underlying documentation still contained commands that would not execute. It also cited one important bug at the wrong file location: the make setup issue belonged in TROUBLESHOOTING.md, not DEPLOYMENT.md. Phase 4 exposed a second calibration failure on the authorship side: it included 70+ lines of Zod schemas and 30+ lines of Python application factory code inside a plan document that explicitly forbade implementation code, plus a deprecated FastAPI lifecycle hook. The grading instinct wants to demonstrate completeness by showing examples, even when the task asks for the opposite.
That is the same optimism shape seen across every phase, just expressed differently depending on what the task asks for. The scorecard looks authoritative; the underlying precision is shakier than the presentation implies.
claude-sonnet-4-6 — the surgical executor
Sonnet 4.6 keeps shrinking the distance between a finding and a fix.
- Phase 1: the deepest single-dimension treatment in the corpus: executability.
- Phase 2: the runtime pragmatist, catching a nested
ignoresflat-config defect that would have silently reintroduced the warning flood the plan was designed to eliminate. - Phase 3: file:line fixes with drop-in replacement text.
- Phase 4: the risk-triaged plan author. Schema drift named as the primary threat; codegen and drift detection wired directly into CI. Sonnet 4.6’s planning move is the same as its review move: make the operationally risky thing structurally impossible rather than patch it with tests after the fact.
The technical strength of those recommendations matters. In Phase 2, Sonnet 4.6 caught a configuration shape that would have silently failed at runtime. In Phase 3, it caught another model’s wrong file citation. In Phase 4, it pre-empted the same class of runtime trap by wiring drift detection into CI before the first commit. Each is more than a review comment; it is an execution failure headed off.
Its primary role is execution-level fix author. Its secondary value is severity pressure: it is unusually willing to say that something is broken when the break is operational.
claude-opus-4-5 — the context-sensitive diplomat
Opus 4.5 is the biggest personality-stability puzzle in the corpus.
- Phase 1: behaves like a strict consistency verifier.
- Phase 2: reimposes a scorecard on an open-ended task, more auditor than diplomat.
- Phase 3: switches into polished, stakeholder-friendly prose and rates documents 9-10/10 while other models flag the same commands as release-blocking.
- Phase 4: switches again, this time into an exhaustive config cataloguer that inlines full configuration files directly into the plan document. Also one of the plans that reached for SQLAlchemy plus aiosqlite despite the SQLite-only brief — default-pattern creep dressed as thoroughness.
The technical weakness is not that Opus 4.5 misses every problem. It often sees the issue, but recategorizes it as manageable polish, or buries it under the shape of a polished deliverable. In Phase 3, that meant rating DEPLOYMENT.md at the top of the accuracy scale while other models treated its broken commands as high-severity reader failures. In Phase 4, it meant producing a plan that looked thorough while quietly adding dependencies the requirements did not need.
The cleanest reading is that Opus 4.5 carries an internal sense of “what a deliverable should look like” and will produce that deliverable regardless of task. Under a rubric, that can look like a strict audit. In a documentation review, it can look like diplomacy. In a planning authorship task, it can look like exhaustive config cataloguing. The stable trait is not leniency by itself; it is deliverable-shaping.
claude-opus-4-6 — the forensic cataloguer
Opus 4.6 is the most internally consistent Opus in the set.
- Phase 1: shows the arithmetic of its own scoring and leaves an audit trail.
- Phase 2: verifies claims against the repo, 62 lines of grounded findings with no ceremony.
- Phase 3: enumerates every instance of documentation drift, with explicit conflict matrices.
- Phase 4: the evidence-backed plan author. A moderate-length plan where every design choice carries its reasoning and its alternative. Several analyses named it the most readable plan in the panel — the forensic instinct applied to authorship produces prose a new engineer can actually follow.
The difference is not just volume. Opus 4.6 makes its reasoning inspectable. In the plan-review corpus, it shows the arithmetic behind severity decisions. In the documentation-review corpus, it enumerates repeated mismatch patterns instead of citing one example and generalizing. In the plan-authorship corpus, it explains why each choice was made rather than stating it as a fiat. The instinct is the same: leave the audit trail visible.
That makes its recommendations technically strong in a specific way: they are defensible under scrutiny. Comprehensiveness is its signature. Reader fatigue is its cost.
claude-opus-4-7 — the governance reviewer
Opus 4.7 is the model most interested in the rule behind the rule.
- Phase 1: the strict prosecutor. Closes loopholes in discretionary language.
- Phase 2: the comparative synthesizer. Reads the prior review, accepts its findings as building blocks, and extends the frame.
- Phase 3: pre-flight rule loading, community-health files,
LastUpdatedmetadata discipline, CI linting proposals. - Phase 4: the only model in the panel that consistently closed each phase with a testable exit condition (“memory stays flat on the 50k-row fixture,” “contract generation is deterministic”). Twelve upfront design decisions with alternatives considered, numbered exit codes, a reviewer checklist, and default-pattern discipline: rejected SQLAlchemy for stdlib
sqlite3, native fetch over axios, flagged removable dependencies. Multiple analyses across the panel named it the strongest plan in the set.
Across tasks, Opus 4.7 reliably asks: what standard or process would have made this defect impossible, or this decision irreversible in the right direction? In review it reads as governance. In authorship it reads as executable exit criteria. Same instinct, same reliability, different surface.
The analysts reveal themselves
Here’s where it gets recursive.
In Phase 3, I asked each model to analyze the documentation-review outputs of the whole panel, including itself. I did not prompt the models to demonstrate their own traits. But the analysis artifacts still reflected the same personalities they were describing.
| Analyst model | How its own personality showed up in the analysis |
|---|---|
| GPT-5.2 | Compressed the comparison into practical decision matrices and quick shortcuts. |
| GPT-5.4 | Reframed the comparison into higher-level axes instead of ranking models. |
| Sonnet 4.5 | Turned the meta-analysis into a long, rubric-like assessment with confident quantification. |
| Sonnet 4.6 | Found the sharpest concrete bug in another analysis: Sonnet 4.5’s wrong file citation. |
| Opus 4.5 | Softened its own weaknesses and used more diplomatic language than the stricter reviewers. |
| Opus 4.6 | Produced the densest evidence inventory and the harshest calibration language. |
| Opus 4.7 | Turned the analysis into composition rules and governance guidance. |
The models did not just describe personality; they performed it. The pragmatist compressed. The strategist reframed. The grader scored. The executor found the bug. The diplomat softened. The cataloguer catalogued. The governance reviewer prescribed process.
The same pattern showed up in model-selection recommendations. Several analysts naturally assigned themselves to the role their personality values most. Sonnet 4.6 was the useful exception: it often assigned itself as executor rather than primary reviewer, which fits its pattern. It knows its highest-value lane is turning findings into fixes.
The meta-layer also showed what each model noticed in the others. Models that care about correctness spotted calibration failures. Models that care about systems spotted framing gaps. Models that care about process spotted governance deficits. None of them saw everything equally.
That’s not a flaw. It’s the reason model pairings work.
The optimism spectrum
One of the cleanest cross-phase signals is a gradient of severity calibration. The same defects, viewed by the model panel, produce verdicts that range from “Excellent (10/10)” to “actively dangerous.” That range is not random; it is ordered:
Most optimistic <----------------------------------------> Least optimistic
opus-4-5 -> sonnet-4-5 -> gpt-5.4 -> opus-4-7 -> gpt-5.2 -> opus-4-6 -> sonnet-4-6
This ordering is consistent across the corpus. Opus 4.5 is usually the most charitable. Sonnet 4.6 is usually the most willing to call something broken.
Phase 4 expressed the same gradient as risk appetite. The optimistic side reached for SQLAlchemy plus aiosqlite on a SQLite-only brief, or smuggled implementation code into a plan that forbade it. The severe side named schema drift as the top risk, then either wired drift detection into CI or kept the dependency surface small.
The practical implication: if you need to trust the severity assessment, use a model from the right side of this spectrum. If you need stakeholder-ready prose that doesn’t alarm people unnecessarily, use a model from the left. But never use only the left side when correctness matters. Phase 3 proved that Opus 4.5 can rate a document “Excellent” while it contains commands that don’t execute. Phase 4 added a second proof: Opus 4.5 can author a plan that reads thoroughly while quietly adding dependencies the brief did not require.
Opus convergence: the duplication signal
Phase 1’s raw data contains a signal that matters for anyone building multi-model review pipelines: Opus 4.5 and Opus 4.7 produced the same total score (47.5/100), the same seven blocking issues, and substantively verbatim fix language. In Phase 3, the same two models also converged on similar archetype taxonomies.
That does not mean Opus 4.5 and Opus 4.7 have the same personality. They do not. Opus 4.5 is deliverable-shaped and diplomatic; Opus 4.7 is governance-shaped and standards-oriented. The convergence is score-level and verdict-level duplication, not personality-level sameness.
Phase 4 sharpened this distinction. When the same two models had to author rather than review, they diverged sharply: Opus 4.5 inlined every config file and reached for SQLAlchemy plus aiosqlite; Opus 4.7 rejected SQLAlchemy for stdlib sqlite3, wrote numbered exit codes, and closed each phase with testable conditions. The convergence is specific to how they score other people’s work. The artifacts they produce when they own the blank page are not the same at all.
That distinction changes how you should compose a review panel. Stacking two strict Opus reviewers may increase confidence without increasing insight diversity. A better pairing crosses the spectrum: one comprehensive cataloguer, one systems thinker, and one pragmatic operator or executor.
The strength of the suggested fixes
Finding the issue was only half the signal. The proposed fix mattered just as much.
A weak recommendation says, “improve this.” A stronger recommendation says, “replace this command with that command.” The strongest recommendation explains why the fix closes the failure mode and how to verify it worked.
GPT-5.2 was strongest when the failure was mechanical. In Phase 2, it caught the lint enforcement gap: the plan wanted zero warnings, but npm run lint could still exit successfully with warnings. Its fix, --max-warnings 0, closed the loophole.
Sonnet 4.6 produced the strongest execution-level fixes. Its nested ignores finding was not a style preference; it identified an ESLint flat-config shape that would silently fail at runtime.
Opus 4.6 produced the most defensible fixes. It inventories the repeated instances, explains the scoring arithmetic, and leaves an evidence trail someone else can audit.
Opus 4.7 produced the strongest prevention-oriented fixes. It asks what standard, CI check, metadata rule, or ownership model would stop the same issue from coming back.
So the more precise finding is not just “models have reviewer personalities.” It is this:
Model personality shows up in each model’s idea of what a fix should be.
That matters because suggested fixes are where personality becomes operational. A model can find the right issue and still propose the wrong class of fix. Or it can find the same issue as everyone else but propose the one fix that actually closes the technical gap.
The tooling caveat
Every session in this experiment was a fresh session. Every model had the same AGENTS.md file and access to the same external rules framework. All sessions appeared to load and process rules properly. The environment was controlled.
The AGENTS.md instructions were also very clear that chat responses should include PRE-FLIGHT checks and gate information. In practice, the models generally followed that instruction well in the session chat output.
But the saved analysis files were different. Nothing in AGENTS.md or the rule framework required the model to include PRE-FLIGHT details inside the report it wrote to disk.
So when GPT-5.4, Opus 4.7, and Sonnet 4.6 included visible PRE-FLIGHT blocks in the saved review or analysis artifact, I read that as a choice about the deliverable, not just compliance with the environment.
That distinction matters. The difference is not “some models had rules and some didn’t.” The difference is that some models treated rule-loading and process disclosure as part of the report itself, while others kept that process in the chat layer and left it out of the saved artifact.
That feels like a personality signal. Opus 4.7 doesn’t just follow governance rules; it wants the artifact to show the governance trail. GPT-5.2 can follow the same rules and say nothing about them in the report, because its personality is compression, not ceremony.
Visible gate-checking is not proof that a model engaged more deeply with the rules. A model that silently applies standards may be just as compliant as one that announces them. What differs is the model’s instinct about whether process belongs in the final deliverable.
From personality to assignment
Same artifact. Same broad defects. Four different tests. The output shape changed a lot. The role identity underneath changed less than the outputs made it look.
What is stable is each model’s answer to the question what is the artifact for?
- GPT-5.2 thinks the artifact is a compact list of what matters.
- GPT-5.4 thinks the artifact is a better version of the system.
- Sonnet 4.5 thinks the artifact is a grade.
- Sonnet 4.6 thinks the artifact is a pull request.
- Opus 4.5 thinks the artifact is a stakeholder deliverable.
- Opus 4.6 thinks the artifact is an exhaustive inventory.
- Opus 4.7 thinks the artifact is a governance proposal.
That holds whether the artifact is a review or a plan. A compact list of what matters is the same instinct expressed as severity-compressed review notes or as a 747-line minimal plan. A governance proposal is the same instinct expressed as loophole-closing review comments or as numbered exit criteria per phase.
That is the assignment mechanism I was looking for. The useful question stops being which model is best? and becomes which role identity does this task need, and which blind spot can I afford?
Practical recommendations
The point of the map is not to create cute labels. It is to make assignment easier. If a model’s personality is a stable instinct, then that instinct becomes a routing signal.
A complex task has four roles to fill, not two. Someone authors the plan. Someone primary-reviews it for framing and ownership. Someone secondary-reviews it for severity and mechanics. Someone executes it. Treating these as four separate roles, with a different personality in each seat, is the configuration the four-phase evidence best supports.
The default combo
The strongest four-role configuration this corpus supports is a cross-family review loop: Anthropic authors the artifact, OpenAI stress-tests it, Anthropic executes it.
| Role | Model | Family | What it contributes |
|---|---|---|---|
| Planner | Opus 4.7 | Anthropic | Governance-minded planner with exit criteria and default-pattern discipline. In Phase 4 it was the only model to consistently close each phase with a testable condition and the only Anthropic model to reject SQLAlchemy for the stated SQLite-only requirement. |
| Primary reviewer | GPT-5.4 | OpenAI | Reader-centered systems thinker. Attacks a plan’s framing and ownership model instead of its line items. Writes the next draft rather than only describing defects, which is what a surgically precise plan needs stress-tested against. |
| Secondary reviewer | GPT-5.2 | OpenAI | Pragmatic operator. Compresses the review into the few concrete mechanisms that actually prevent failure: --max-warnings 0, enforcement gates, exit-code discipline, copy-paste safety. |
| Executor | Sonnet 4.6 | Anthropic | Surgical executor with runtime pragmatism. Catches config-shape traps the plan missed (the nested ignores flat-config trap in Phase 2) and converts reviews into concrete drop-in fixes. |
Two reasons this shape works. First, it respects the Opus convergence finding: stacking two Opus reviewers produced verbatim-level score and fix duplication in Phase 1, so review diversity requires crossing families. Second, each model sits where its instinct has traction. Opus 4.7 commits. GPT-5.4 reframes. GPT-5.2 compresses. Sonnet 4.6 ships.
Scenario variants
Three scenarios are worth calling out, but the default above is where I would start any complex task.
| Scenario | Planner | Primary reviewer | Secondary reviewer | Executor | Why this variant |
|---|---|---|---|---|---|
| Small or routine task (one-file fix, clear spec) | — | GPT-5.2 | — | Sonnet 4.6 | No planning needed. A mechanics check plus a surgical executor closes the loop in two models, one from each family. |
| Standard implementation task (default) | Opus 4.7 | GPT-5.4 | GPT-5.2 | Sonnet 4.6 | Cross-family stress test on a well-specified plan. |
| Ambiguous or underspecified requirements | GPT-5.4 | Opus 4.7 | GPT-5.2 | Sonnet 4.6 | GPT-5.4’s reframing instinct resolves the mess first; Opus 4.7’s governance instinct locks it down after. |
The ambiguous-requirements variant is the one worth internalizing. When the task is well-specified, the planner should commit (Opus 4.7) and the reviewer should challenge the frame (GPT-5.4). When the task is messy, the planner should reframe first (GPT-5.4) and the reviewer should close loopholes (Opus 4.7). The pair is the same two models in both cases. Only the order changes.
Anti-patterns
Two stacking mistakes are worth avoiding.
The first is pairing Sonnet 4.5 with Opus 4.5 as your only reviewers. Both can produce polished, structured, reassuring outputs. Both have documented calibration gaps. Together they create a false-confidence pipeline: two models agreeing that things are fine while concrete failures go unflagged. Always include at least one model with low risk tolerance in any correctness-critical loop: Sonnet 4.6, Opus 4.6, or GPT-5.2.
The second is using two Opus models in the same loop and expecting diverse insight. Opus 4.5 and Opus 4.7 produced verbatim-level score and fix duplication in Phase 1. They do not have the same personality, but their review outputs can end up redundant enough to waste a review slot. Cross the family line whenever possible.
Rubric-driven skills still have a place. Use them when you want normalized, comparable output across runs. That is exactly what plan-reviewer was designed to do. But understand the tradeoff: a strong skill suppresses the most visible personality differences and pushes them into calibration choices. If you need to see how a model naturally approaches a problem, give it an open-ended prompt. Phase 2 revealed more about reviewer identity in one run than Phase 1 did across seven. Phase 4 revealed more about planner identity in one authoring run than any amount of inference from review behavior could.
Conclusion
The strongest takeaway from four phases is this: model personality is stable enough to use as a routing signal. A complex task is not one job but four: plan, primary review, secondary review, execute.
The concrete version is the default combo above. Opus 4.7 plans. GPT-5.4 primary-reviews. GPT-5.2 secondary-reviews. Sonnet 4.6 executes. Anthropic authors, OpenAI stress-tests, Anthropic ships.
I would not assign models only by generic task reputation anymore. I would assign them by the instinct I have actually observed.
Need a defect hunter? Pick the model whose personality is low-tolerance and fix-oriented. Need a strategy rewrite? Pick the model that naturally reframes systems. Need an exhaustive audit? Pick the cataloguer. Need a stakeholder narrative? Use the diplomat, but pair it with a skeptic.
When models disagree, my first assumption should not be that one of them is simply wrong. Sometimes that is true. But often, each model is optimizing for a different version of the task: correctness, clarity, completeness, execution, governance, or reader trust.
If you can name the instinct behind the answer, you can use the disagreement instead of trying to eliminate it. The goal is not to find one model that is always best. The goal is to build a role loop where each model’s bias has a job.
See also
- Prompt Forge: Multi-Model Prompt Evaluation with Snowflake Cortex
- The Ultimate Pair Programmer - Why AI Coding Needs Human Experience
- When Three AIs Fixed a README - The Unanimous Verdict Nobody Expected
- When Three AIs Tried to Fix 1,717 Lines of Code
- I Pitted Gemini, Claude, and GPT in a 4-Stage AI 'Code-Off.'