What four experiments taught me about model personality

A practical experiment in classifying model personalities and assigning them to planner, reviewer, and executor roles

Most people who use more than one frontier model eventually learn that models have different task strengths. One model is better for fast edits. Another is better for long-form reasoning. Another is better for exhaustive review. That part is not especially surprising anymore.

What I wanted to test was more specific: whether those differences could be described as personality. Not personality in the human sense, but a repeatable instinct that shows up across tasks. If that instinct can be named, tested, and connected to failure modes, then model selection becomes less vibes-based.

[Read More]

Prompt Forge: Multi-Model Prompt Evaluation with Snowflake Cortex

Score, compare, and optimize your prompts across 9 dimensions

Give three frontier models the same prompt, and you’ll get three distinct interpretations of what “good” means. I discovered this gradually, over months of working with AI coding assistants on real projects.

Early on, I’d occasionally switch models mid-task, sometimes because I was curious and sometimes because I wanted a second opinion on a tricky problem. I’d notice changes in the outputs, but I couldn’t pin down what was driving them. Was it my prompt? The task itself? The model? Some combination of all three? The variables were tangled together, and I wasn’t being rigorous enough to isolate them.

[Read More]