Testing & Evaluation¶

This page explains one thing:

what the left side edits, and what the right side proves.

Once that boundary is clear, the buttons become much easier to understand.

First-time users: remember these 4 lines¶

Left side edits prompts
Right side runs real outputs
Result Evaluation checks whether one output is good enough
Compare Evaluation checks which output is better and why

Start with this action table¶

Action	Where it happens	Main focus	Does it modify the left workspace?
Analysis	Left side	prompt structure, clarity, constraints	can suggest edits for the workspace
Optimize / Iterate	Left side	rewrite or improve the prompt directly	yes
Test	Right side	real execution output	no
Result Evaluation	one right-side column	whether this one execution reached the goal	can suggest edits for the workspace
Compare Evaluation	multiple right-side columns	differences across real outputs	can suggest edits for the workspace

If you only want the shortest explanation, read these 3 lines¶

Analysis does not use right-side test input. It inspects the prompt itself.
Result Evaluation judges one real execution.
Compare Evaluation compares multiple real executions.

Analysis vs evaluation¶

Left-side analysis¶

Left-side analysis asks: “Is this prompt written clearly enough?”

It focuses on:

whether the goal is clear
whether constraints are complete
whether the wording is stable enough for the model to follow
whether the structure is suitable for further optimization

Right-side evaluation¶

Right-side evaluation asks: “How good was this real execution?”

It focuses on:

whether the input and output match
whether the output completed the task
which constraints were satisfied or violated
what the current workspace prompt still lacks

What left-side analysis does not read¶

To avoid semantic confusion, left-side analysis does not treat right-side test input as evidence.

That means:

in System Prompt Workspace, left-side analysis does not read the right-side test message
in Variable Workspace, left-side analysis does not read the current variable values
in Context Workspace, left-side analysis does not use one previous right-side execution as a premise

If you want to judge whether a prompt actually worked on a real result, use right-side evaluation.

What the right side is testing in each workspace¶

Workspace	Main right-side test input	Most important evidence during evaluation
System Prompt Workspace	one test message	system prompt + test message + output
User Prompt Workspace	usually no extra input	executed prompt + output
Variable Workspace	shared variable form	executed prompt + variable values + output
Context Workspace	full conversation + shared variables + optional tools	full execution snapshot + output

Result Evaluation vs Compare Evaluation¶

Use Result Evaluation when you want to judge one column on its own.

Typical questions:

Did this column drift?
Why did it add extra explanation?
Why did it miss the format?
Does this one version already have obvious prompt issues?

Use Compare Evaluation when you already have two or more columns and want to compare the differences.

Typical questions:

original vs workspace
workspace vs v2
same prompt on different models
different saved versions on the same model

What Compare Evaluation is actually comparing¶

Compare Evaluation compares real output evidence, not version labels.

Same model, different prompt versions: did the prompt change actually change the result?
Same prompt, different models: which model interprets the prompt more reliably?
Workspace draft vs saved versions: is the current draft actually worth saving?

What “workspace” means¶

The Workspace option on the right means the current editable content on the left.

It is not the same as “latest saved version”.

Think of it like this:

original: your initial input
v1 / v2 / v3: saved versions
workspace: what you are editing right now, even if it is not saved yet

What Focus Brief is for¶

Evaluation dialogs can include an optional Focus Brief.

If you provide something like:

“Do not add explanation”
“The tone is too strong”
“Why is model A much worse than model B?”
“Tool arguments keep missing required fields”

the evaluation will prioritize that concern instead of returning a generic summary.

What happens after you apply evaluation suggestions¶

Evaluation suggestions are not bound to one version branch.

The rule is:

try to apply them to the current left workspace
if the workspace has changed too much, the old evaluation becomes stale
stale does not mean deleted; it means “this conclusion belongs to older content”

Recommended first workflow¶

Build one testable workspace draft on the left
Run 2-4 real columns on the right
Start with Result Evaluation to catch obvious single-column issues
Then run Compare Evaluation to summarize version or model differences
Apply the valuable suggestions back to the left workspace
Save a new version only when the changes are worth keeping

Common mistakes¶

Mistake 1: left-side analysis should read right-side test input
No. Analysis focuses on the prompt itself.
Mistake 2: right-side evaluation always knows one historical branch
No. The current design is about improving the current editable workspace, not maintaining strict branch binding.
Mistake 3: Compare Evaluation only compares A/B labels
No. It compares difference patterns across real outputs.