Model Testing Strategy¶
This page does not explain provider fields. It answers two questions:
- which model should you use on the left side
- how should you compare versions and models on the right side
First-time users: follow this order¶
- Pick one stable optimization model for the left side
- Pick the real target model you actually care about on the right side
- Compare prompt versions first, then compare model differences
Remember these 4 lines¶
- The left-side model is for analysis, optimization, and iteration. It does not prove real output quality.
- The right-side model is for real execution. It is the source of test evidence.
- If you want to compare prompt versions, keep model and input fixed.
- If you want to compare models, keep prompt and input fixed.
How to choose the left-side optimization model¶
The left-side model is responsible for:
- analyzing prompt structure
- generating improved drafts
- continuing iterations
- handling text-side analysis tasks inside the workspace
Prioritize:
- the model you know best
- a model that is stable at reasoning and rewriting
- a cost and speed level you can afford
It does not have to match your production model exactly.
How to choose the right-side test model¶
The right-side model is responsible for:
- real prompt execution
- producing results
- supplying evidence for Result Evaluation and Compare Evaluation
If you already know your target model, use it on the right side first.
If you only want the shortest advice¶
- Use one stable text model on the left
- Use the model you truly care about on the right
- Compare versions before you compare models
In text workspaces: compare versions or compare models first?¶
If you want to know whether a prompt change actually helped, compare versions first:
- keep the right-side input fixed
- keep the test model fixed
- compare
original / workspace / vN
If you want to know whether the same prompt is stable across models, compare models:
- keep the same prompt fixed
- keep the same test input fixed
- switch right-side models
The least helpful starting point is changing both at once:
- prompt version
- test model
If both change together, it becomes hard to tell what actually caused the change.
Variable and context workspaces need extra care¶
Variable Workspace¶
When comparing prompt versions, keep the variable values the same.
Context Workspace¶
When comparing one target message version, keep the full conversation context stable.
Image workspaces are naturally dual-model¶
Image workspaces differ from text workspaces because the left and right sides already use different model types.
Left side¶
The left side still uses a text model for:
- analyzing image prompts
- optimizing image prompts
- continuing iterations
Right side¶
The right side uses an image model for:
- generating the actual image
- comparing prompt versions through real images
- comparing style differences across image models
A better testing order for image workspaces¶
Text-to-image¶
- keep one image model fixed and compare
original / workspace / vN - find the more reliable prompt version
- then keep that version fixed and compare different image models
Image-to-image¶
Keep the same input image whenever possible. If the input image changes, your comparison baseline changes too.
Browser vs desktop¶
If you mainly connect to public HTTPS APIs, the browser version is usually enough.
If you mainly connect to local or internal services that are affected by browser restrictions, the desktop app is usually more reliable.
The simplest starting strategy¶
Text workspaces¶
- left side: one familiar optimization model
- right side: one target model you actually care about
- compare versions first, then models
Image workspaces¶
- left side: one stable text model
- right side: start with one main image model
- compare prompt versions first, then image models