Prompt Forge: Multi-Model Prompt Evaluation with Snowflake Cortex -

Give three frontier models the same prompt, and you’ll get three distinct interpretations of what “good” means. I discovered this gradually, over months of working with AI coding assistants on real projects.

Early on, I’d occasionally switch models mid-task, sometimes because I was curious and sometimes because I wanted a second opinion on a tricky problem. I’d notice changes in the outputs, but I couldn’t pin down what was driving them. Was it my prompt? The task itself? The model? Some combination of all three? The variables were tangled together, and I wasn’t being rigorous enough to isolate them.

As I started comparing models more intentionally, using the same prompt and context across different models, the patterns became harder to ignore. What surprised me wasn’t that different models gave different answers—that part was expected—but how differently they interpreted the same instructions. One model would take requirements literally and produce minimal output. Another would read between the lines and add error handling I hadn’t asked for. A third would restructure my entire approach.

None of these responses were wrong, exactly. They were just… different. And that gap fascinated me.

So I started paying closer attention. I was already aware that plenty of people were benchmarking models, testing them across tasks, and publishing leaderboards. And I think anyone who regularly uses more than one model already knows, at a practical level, that the models feel different. But I found myself wanting to understand what those differences actually were outside of a benchmark score.

When I got an output I liked from one model, I’d feed the same prompt to another just to see what changed. I’d ask models to critique each other’s work. What began as idle curiosity slowly turned into a kind of informal research project: understanding why the same words could produce such different results.

Over time, I noticed patterns. Some prompts worked reliably across models: clear, specific, well-structured. Others were coin flips. The variance wasn’t random. It was a signal. The models themselves behave differently by design, but how I communicated with them mattered just as much.

I dug deeper. I researched prompt engineering best practices from frontier model vendors. I learned that context engineering—how you structure and present information—was just as important as what you asked for. My goal shifted from “get this task done” to “write prompts that work reliably, regardless of which model executes them.”

That evolution led to my AI Code-Off experiments: intentional, structured comparisons where I pitted Claude, Gemini, and GPT-5 against each other on complex tasks like 2,000-line Python refactoring, documentation improvements, and README rewrites. Every comparison taught me something new about what makes prompts work.

Along the way, I ran into an interesting problem. When I asked models to evaluate each other’s work, they’d give scores. But their criteria were completely different. I’d add guardrails like “score this on a 100-point scale,” but a 100-point scale can be comprised of anything the LLM thinks is relevant. One model might weight “creativity” heavily. Another might focus on “safety.” A third might penalize verbosity while the first rewards it.

The numbers looked comparable. The criteria behind them were completely different.

I needed a rubric—a structured scoring system that forced all models to evaluate against the same dimensions. Not “rate this prompt,” but “rate this prompt on Actionability, Completeness, Token Efficiency, and six other specific criteria, using these definitions and these scoring thresholds.”

Trial and error shaped that rubric. I even asked the models themselves what would make a given prompt better. But running those experiments manually got tedious. Every comparison required loading the same context, swapping models, running the same prompt, collecting outputs, and repeating. For a four-stage experiment across three models, that’s twelve separate interactions at a minimum. Human error crept in: forgetting where I was, accidentally using yesterday’s prompt version, having to start over.

My goal was practical: get more consistent behavior from different models on the same tasks so I could get past blockers, ship working code, and move faster. But after doing this for the better part of a year, I also saw a chance to turn those lessons into something useful for other people.

The point was not to optimize for Claude, GPT, Gemini, or any other model specifically. It was to improve prompt hygiene across models: clearer criteria, fewer hidden assumptions, and more consistent output no matter which model runs the prompt. I wanted a way to help people improve their prompts without sending them to a vendor-specific tool or website that nudged them toward model-specific optimizations.

So I built something to fix that. That’s how Prompt Forge was born. The repo is coming soon!

What Prompt Forge does

Prompt Forge evaluates AI prompts across 9 weighted dimensions and generates optimized versions. Think of it as a code linter, but for prompts.

Prompt Forge interface showing prompt input, model selector, and evaluation button The interface is intentionally simple: enter your prompt, select a model, click Evaluate.

Here’s the workflow:

Enter your prompt
Select one model (or up to four for comparison)
Click Evaluate
Get dimension-by-dimension scores, identified issues, and recommendations
Copy the optimized prompt

Prompt entry with example loaded and model selected Load a built-in example or paste your own prompt. Here, a Snowflake SQL query request is ready for evaluation with Claude Sonnet 4.5 selected.

Once you click Evaluate, the tool runs all 9 dimensions in parallel:

Evaluation in progress showing all 9 dimensions running Real-time progress tracking shows each dimension being evaluated simultaneously. The entire analysis typically completes in under 90 seconds.

For multi-model comparison, the workflow is the same, but you select up to four models and get a side-by-side analysis with comparison charts. You can tune how many models run simultaneously in Settings > Evaluation, retry only the models that failed, and export the full results as a markdown report.

Multi-model comparison showing parallel evaluation of two models Multi-model mode evaluates the same prompt across multiple LLMs simultaneously, with real-time progress tracking for each dimension.

The demo is straightforward: mediocre prompt in → multi-model scores → optimized version out → watch how much more consistent your outputs become across different models.

The 9 dimensions

The evaluation rubric covers:

Dimension	Weight	What it measures
Actionability	2.0x	Clear, executable instructions
Completeness	2.0x	All necessary information provided
Execution robustness	1.5x	Completion criteria and error handling
Cross-agent consistency	1.0x	Consistent behavior across different LLMs
Chain-of-thought clarity	1.0x	Step-by-step reasoning scaffolding
Consistency	0.75x	No internal contradictions
Parsability	0.75x	Easy for LLMs to parse and follow
Context grounding	0.5x	Examples and context provided
Token efficiency	0.5x	Information density without bloat

These dimensions weren’t pulled from thin air. They emerged from research into prompt engineering best practices published by frontier model vendors, trial and error from my Code-Off experiments, and, importantly, asking the models themselves what would make a given prompt better. I focused especially on prompts meant for non-human, autonomous agent executors.

Detailed evaluation results showing dimension scores and timing breakdown Each evaluation breaks down scores by dimension. This prompt scored 35/100, revealing weaknesses in Completeness (4/20) and Execution Robustness (3/15). The timing breakdown shows how long each dimension took to evaluate.

How scoring works

Each dimension is scored 0-5, but not all dimensions contribute equally to the final score. Weights amplify or dampen each dimension’s impact:

Actionability and Completeness carry a 2.0x weight. I believe they’re the most critical for coding tasks
Execution Robustness carries 1.5x
Cross-Agent Consistency and Chain-of-Thought Clarity carry 1.0x (baseline)
Consistency and Parsability carry 0.75x
Context Grounding and Token Efficiency carry 0.5x. They’re important but secondary.

The weighted scores combine to a 100-point maximum, then map to letter grades (A through F).

A note on these choices: the dimensions themselves should be generally universal. Things like Actionability, Completeness, and Consistency matter for almost any prompt intended to produce reliable output. But the specific weights? Those are informed by my usage and my testing. I’ve tuned them for coding and building tasks where clear instructions and complete specifications matter most.

Actionability dimension analysis with issues and recommendations Expanding any dimension reveals detailed analysis: what criteria were met or missed, specific issues identified with severity levels, and prioritized recommendations for improvement.

That same idea drives the scoring: fewer hidden assumptions, clearer criteria, and more consistent output across models.

Why Snowflake Cortex?

I work at Snowflake, so easy access was part of it. But the real reasons are about eliminating variables and simplifying the technical implementation.

One API, many models

Prompt Forge uses Snowflake’s AI_COMPLETE function under the hood. That single function gives me access to models from Anthropic, OpenAI, Google, Meta, Mistral, and others, all through the same SQL interface. I don’t have to manage separate API keys, handle different authentication schemes, or write adapter code for each vendor’s SDK.

If I’d built this on raw APIs, I’d be maintaining:

Anthropic’s messages API with its specific request format
OpenAI’s chat/completions endpoint with its own conventions
Google’s Vertex AI with yet another authentication model
Rate limiting logic that differs per provider
Error handling that varies by vendor

Instead, I write one SQL call: SELECT SNOWFLAKE.CORTEX.AI_COMPLETE(model_name, prompt). The model differences are still there—that’s the point of comparison—but the API differences are abstracted away.

Isolating what matters

If you give the same prompt to Cursor and Claude Code, the agentic coding tool has system prompts and behaviors that impact how the prompt is interpreted. The model isn’t just seeing your prompt—it’s seeing your prompt filtered through the tool’s context.

I wanted to eliminate that variability. Using Cortex ensures that the only differences in output should be the selected model and the prompt itself—not the APIs, tooling, or hidden system prompts used to interact with the model. When you’re trying to answer “is this prompt good?”, you need to isolate the variable that actually matters.

Model availability

Cortex provides access to a growing list of frontier and open-source models. As of this writing, that includes Claude Sonnet and Opus, GPT-4o and GPT-5, Gemini Pro, Llama, Mistral, DeepSeek, and others.

Snowflake has launch partnerships with frontier model providers that enable quick access to new releases. When Anthropic released Claude Opus 4.5, it was available through Cortex AI the same day. Same with OpenAI’s GPT-5.2. For a project like Prompt Forge, this means I can add new models to the comparison set as soon as they launch, with no new integration code required. Refresh the model list, and they appear in the picker.

Some frontier model vendors offer prompt optimization tools, but they’re tuned for their specific models. I wanted something that works across vendors without lock-in.

Building with Cortex Code

I built Prompt Forge with Cortex Code, Snowflake’s agentic coding CLI. That matters because this project was not just about evaluating prompts in theory. The coding workflow was part of the experiment.

The rubrics evolved while I was using Cortex Code on real work: researching, planning, refactoring, writing docs, and debugging code. I kept noticing the same pattern. Better prompts made the agent more predictable. Vague prompts created drift. Prompt Forge came out of that loop.

The first version came together in about a day. That would not have happened without an agentic coding workflow. Cortex Code made it practical to move quickly from “I keep seeing this problem” to a working app, then keep iterating on the UI, scoring logic, and model comparison flow over the next week.

Honest limitations

Before you dive in, a few caveats. The default rubrics in Prompt Forge are tuned for building and coding tasks—the kind of work where you’re asking an LLM to generate code, create documentation, write implementation plans, or refactor existing files.

You can use Prompt Forge to evaluate a general-purpose prompt that just answers questions. But the optimizations won’t be well-suited for those tasks. The scoring dimensions assume you want consistent, executable, robust output—not conversational engagement or creative exploration.

Who this is for

That said, Prompt Forge is intentionally educational. I built it because I realized that users with less experience wouldn’t know model variance was even a potential issue. I envision several audiences:

Developers who vibe-code daily and want to get better at prompting
Teams standardizing prompts across projects and contributors
Researchers benchmarking model consistency
Anyone curious about why their prompts work better with some models than others

Will this lower the bar for effective AI-assisted development? Maybe a little. But I think it mostly shifts the workload. Instead of learning prompt engineering through trial and error, users can focus on articulating what they want and what matters, then let the tool and LLMs generate an optimized version.

I see this as a time-saver as much as an efficacy improvement.

An invitation to contribute

Here’s where I need your help.

Are the weights right? I don’t know. They reflect my priorities based on my use cases. Your priorities might be different:

If you’re building multi-step agents, Chain-of-Thought Clarity might deserve a higher weight
If you’re generating structured data, Parsability might be more critical than I’ve rated it
If you’re working in a domain where examples are essential, Context Grounding might need a boost

The rubrics themselves are tunable. Each dimension is defined in a YAML file under config/rubrics/. You can adjust:

The criteria for each score level (0-5)
The weight/boost applied to the dimension
The specific language that guides the LLM’s evaluation

Prompt evaluation is still messy. There’s no industry-standard rubric, and no peer-reviewed consensus on optimal weights. I’m sharing what worked for me, but I don’t think this is finished.

So give me feedback. What weights would you change, and why? What specific use case drove that recommendation? Feedback like “I work on X, and Y weight made more sense because Z” would help refine the defaults for everyone. File issues. Submit PRs. Help the community figure out what “good prompt hygiene” actually looks like across different contexts.

Getting started

Here’s what you get: a vague prompt goes in, and a structured specification comes out.

Improved prompt output with structured sections and changes list The payoff: an optimized prompt with clear structure (Schema Information, Requirements, Rolling Average Calculation) and a detailed list of changes made. Notice how a vague prompt became a comprehensive specification with explicit schema, date handling rules, and success criteria.

If you want to try it yourself, you’ll need:

Prerequisites:

Node.js 24+ and npm 11+
Snowflake account with Cortex enabled
~/.snowflake/connections.toml configured with your credentials

Quick start:

npm install && npm run dev

Open http://localhost:3000. Enter a prompt, even something simple like “Write a Python function to parse JSON,” select a model, and click Evaluate. Within seconds, you’ll get a dimension-by-dimension breakdown with scores, specific recommendations, and an optimized version ready to copy.

For production deployment to Snowpark Container Services, multi-model comparison workflows, and full configuration details, check the project README and documentation.

This post is part of my ongoing exploration of AI-assisted development. Previous experiments: Four-Model Code-Off, Streamlit Rule Refactoring, README Improvements, and The Ultimate Pair Programmer.

Prompt Forge: Multi-Model Prompt Evaluation with Snowflake Cortex

Score, compare, and optimize your prompts across 9 dimensions