Use workflow-backed metrics¶
Goal: configure judge-backed metrics and inspect their execution artifacts.
When to use this:
Use this guide when deterministic pure scoring is not sufficient and Themis should own an evaluation workflow.
Procedure¶
Use this task map when you need to confirm the minimum pieces required for judge-backed scoring.
flowchart LR
A["Reduced candidate"] --> B["Parser"]
B --> C["Workflow-backed metric"]
C --> D["Judge model(s)"]
C --> E["Workflow overrides"]
C --> F["Persisted evaluation executions"]
The runtime builds a workflow around the metric, so the important setup work is choosing the right subject, judge, and overrides.
Provide:
- one or more workflow-backed metrics
- parsers for the reduced candidate
- judge models
- optional
prompt_specfor judge prompt instructions or generic prompt blocks - any workflow overrides such as a rubric
from __future__ import annotations
from themis import Experiment, InMemoryRunStore, get_evaluation_execution
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset
def run_example() -> dict[str, object]:
"""Execute builtin workflow-backed metrics together."""
store = InMemoryRunStore()
experiment = Experiment(
generation=GenerationConfig(
generator="builtin/demo_generator",
candidate_policy={"num_samples": 2},
reducer="builtin/majority_vote",
),
evaluation=EvaluationConfig(
metrics=[
"builtin/llm_rubric",
"builtin/panel_of_judges",
"builtin/majority_vote_judge",
"builtin/pairwise_judge",
],
parsers=["builtin/json_identity"],
judge_models=["builtin/demo_judge", "builtin/demo_judge"],
workflow_overrides={"rubric": "pass if the answer is correct"},
),
storage=StorageConfig(store="memory"),
datasets=[
Dataset(
dataset_id="sample",
cases=[
Case(
case_id="case-1",
input={"question": "2+2"},
expected_output={"answer": "4"},
)
],
)
],
seeds=[7, 11],
)
result = experiment.run(store=store)
execution = get_evaluation_execution(
store, result.run_id, "case-1", "builtin/llm_rubric"
)
return {
"run_id": result.run_id,
"status": result.status.value,
"score_ids": [score.metric_id for score in result.cases[0].scores],
"judge_calls": 0 if execution is None else len(execution.judge_calls),
}
if __name__ == "__main__":
print(run_example())
Workflow-backed metrics persist judge calls, prompts, responses, and aggregation outputs so scores remain inspectable later.
Variants¶
| Variant | Best when | Tradeoff | Related APIs / commands |
|---|---|---|---|
| Rubric scoring | One judge and one rubric are enough | Less resilient to judge variance than panel-style setups | builtin/llm_rubric |
| Multi-judge averaging | Multiple judges should score the same output and aggregate | Higher latency and judge-model cost | builtin/panel_of_judges |
| Majority-vote judgment | The output should collapse to a categorical majority decision | Loses scalar nuance compared with averaging | builtin/majority_vote_judge |
| Pairwise selection | Two candidates should be compared directly | Not a drop-in replacement for single-output scoring | builtin/pairwise_judge |
| Heterogeneous multi-judge orchestration | Different prompts or parsing logic should run over the same response | Requires a custom workflow metric in Python | Custom workflow metric, ctx.prompt_spec |
Expected result¶
The run should persist evaluation executions with judge calls, prompts, responses, and final scores or aggregation output.
Builtin judge workflows consume PromptSpec.blocks directly. Custom workflow metrics should read ctx.prompt_spec themselves when they need benchmark-derived context, retrieved context, reference judgments, or any other prompt material that should travel with the experiment identity.