Most people who use more than one frontier model eventually learn that models have different task strengths. One model is better for fast edits. Another is better for long-form reasoning. Another is better for exhaustive review. That part is not especially surprising anymore.
What I wanted to test was more specific: whether those differences could be described as personality. Not personality in the human sense, but a repeatable instinct that shows up across tasks. If that instinct can be named, tested, and connected to failure modes, then model selection becomes less vibes-based.
[Read More]
Image courtesy of Google Gemini
Image courtesy of Google Gemini
Image courtesy of Google Gemini