Testing & Evaluation¶
This page explains one thing:
what the left side edits, and what the right side proves.
Once that boundary is clear, the buttons become much easier to understand.
First-time users: remember these 4 lines¶
- Left side edits prompts
- Right side runs real outputs
- Result Evaluation checks whether one output is good enough
- Compare Evaluation checks which output is better and why
Start with this action table¶
| Action | Where it happens | Main focus | Does it modify the left workspace? |
|---|---|---|---|
| Analysis | Left side | prompt structure, clarity, constraints | can suggest edits for the workspace |
| Optimize / Iterate | Left side | rewrite or improve the prompt directly | yes |
| Test | Right side | real execution output | no |
| Result Evaluation | one right-side column | whether this one execution reached the goal | can suggest edits for the workspace |
| Compare Evaluation | multiple right-side columns | differences across real outputs | can suggest edits for the workspace |
If you only want the shortest explanation, read these 3 lines¶
- Analysis does not use right-side test input. It inspects the prompt itself.
- Result Evaluation judges one real execution.
- Compare Evaluation compares multiple real executions.
Analysis vs evaluation¶
Left-side analysis¶
Left-side analysis asks: “Is this prompt written clearly enough?”
It focuses on:
- whether the goal is clear
- whether constraints are complete
- whether the wording is stable enough for the model to follow
- whether the structure is suitable for further optimization
Right-side evaluation¶
Right-side evaluation asks: “How good was this real execution?”
It focuses on:
- whether the input and output match
- whether the output completed the task
- which constraints were satisfied or violated
- what the current workspace prompt still lacks
What left-side analysis does not read¶
To avoid semantic confusion, left-side analysis does not treat right-side test input as evidence.
That means:
- in System Prompt Workspace, left-side analysis does not read the right-side test message
- in Variable Workspace, left-side analysis does not read the current variable values
- in Context Workspace, left-side analysis does not use one previous right-side execution as a premise
If you want to judge whether a prompt actually worked on a real result, use right-side evaluation.
What the right side is testing in each workspace¶
| Workspace | Main right-side test input | Most important evidence during evaluation |
|---|---|---|
| System Prompt Workspace | one test message | system prompt + test message + output |
| User Prompt Workspace | usually no extra input | executed prompt + output |
| Variable Workspace | shared variable form | executed prompt + variable values + output |
| Context Workspace | full conversation + shared variables + optional tools | full execution snapshot + output |
Result Evaluation vs Compare Evaluation¶
Use Result Evaluation when you want to judge one column on its own.
Typical questions:
- Did this column drift?
- Why did it add extra explanation?
- Why did it miss the format?
- Does this one version already have obvious prompt issues?
Use Compare Evaluation when you already have two or more columns and want to compare the differences.
Typical questions:
- original vs workspace
- workspace vs
v2 - same prompt on different models
- different saved versions on the same model
What Compare Evaluation is actually comparing¶
Compare Evaluation compares real output evidence, not version labels.
- Same model, different prompt versions: did the prompt change actually change the result?
- Same prompt, different models: which model interprets the prompt more reliably?
- Workspace draft vs saved versions: is the current draft actually worth saving?
What “workspace” means¶
The Workspace option on the right means the current editable content on the left.
It is not the same as “latest saved version”.
Think of it like this:
- original: your initial input
v1 / v2 / v3: saved versions- workspace: what you are editing right now, even if it is not saved yet
What Focus Brief is for¶
Evaluation dialogs can include an optional Focus Brief.
If you provide something like:
- “Do not add explanation”
- “The tone is too strong”
- “Why is model A much worse than model B?”
- “Tool arguments keep missing required fields”
the evaluation will prioritize that concern instead of returning a generic summary.
What happens after you apply evaluation suggestions¶
Evaluation suggestions are not bound to one version branch.
The rule is:
- try to apply them to the current left workspace
- if the workspace has changed too much, the old evaluation becomes stale
- stale does not mean deleted; it means “this conclusion belongs to older content”
Recommended first workflow¶
- Build one testable workspace draft on the left
- Run
2-4real columns on the right - Start with Result Evaluation to catch obvious single-column issues
- Then run Compare Evaluation to summarize version or model differences
- Apply the valuable suggestions back to the left workspace
- Save a new version only when the changes are worth keeping
Common mistakes¶
- Mistake 1: left-side analysis should read right-side test input
No. Analysis focuses on the prompt itself. - Mistake 2: right-side evaluation always knows one historical branch
No. The current design is about improving the current editable workspace, not maintaining strict branch binding. - Mistake 3: Compare Evaluation only compares A/B labels
No. It compares difference patterns across real outputs.