Use workflow-backed metrics¶

Goal: configure judge-backed metrics and inspect their execution artifacts.

When to use this:

Use this guide when deterministic pure scoring is not sufficient and Themis should own an evaluation workflow.

Procedure¶

Use this task map when you need to confirm the minimum pieces required for judge-backed scoring.

flowchart LR
    A["Reduced candidate"] --> B["Parser"]
    B --> C["Workflow-backed metric"]
    C --> D["Judge model(s)"]
    C --> E["Workflow overrides"]
    C --> F["Persisted evaluation executions"]

The runtime builds a workflow around the metric, so the important setup work is choosing the right subject, judge, and overrides.

Provide:

one or more workflow-backed metrics
parsers for the reduced candidate
judge models
optional prompt_spec for judge prompt instructions or generic prompt blocks
any workflow overrides such as a rubric

from __future__ import annotations

from themis import Experiment, InMemoryRunStore, get_evaluation_execution
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Execute builtin workflow-backed metrics together."""

    store = InMemoryRunStore()
    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator",
            candidate_policy={"num_samples": 2},
            reducer="builtin/majority_vote",
        ),
        evaluation=EvaluationConfig(
            metrics=[
                "builtin/llm_rubric",
                "builtin/panel_of_judges",
                "builtin/majority_vote_judge",
                "builtin/pairwise_judge",
            ],
            parsers=["builtin/json_identity"],
            judge_models=["builtin/demo_judge", "builtin/demo_judge"],
            workflow_overrides={"rubric": "pass if the answer is correct"},
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        seeds=[7, 11],
    )
    result = experiment.run(store=store)
    execution = get_evaluation_execution(
        store, result.run_id, "case-1", "builtin/llm_rubric"
    )
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "score_ids": [score.metric_id for score in result.cases[0].scores],
        "judge_calls": 0 if execution is None else len(execution.judge_calls),
    }


if __name__ == "__main__":
    print(run_example())

Workflow-backed metrics persist judge calls, prompts, responses, and aggregation outputs so scores remain inspectable later.

Variants¶

Variant	Best when	Tradeoff	Related APIs / commands
Rubric scoring	One judge and one rubric are enough	Less resilient to judge variance than panel-style setups	`builtin/llm_rubric`
Multi-judge averaging	Multiple judges should score the same output and aggregate	Higher latency and judge-model cost	`builtin/panel_of_judges`
Majority-vote judgment	The output should collapse to a categorical majority decision	Loses scalar nuance compared with averaging	`builtin/majority_vote_judge`
Pairwise selection	Two candidates should be compared directly	Not a drop-in replacement for single-output scoring	`builtin/pairwise_judge`
Heterogeneous multi-judge orchestration	Different prompts or parsing logic should run over the same response	Requires a custom workflow metric in Python	Custom workflow metric, `ctx.prompt_spec`

Expected result¶

The run should persist evaluation executions with judge calls, prompts, responses, and final scores or aggregation output.

Builtin judge workflows consume PromptSpec.blocks directly. Custom workflow metrics should read ctx.prompt_spec themselves when they need benchmark-derived context, retrieved context, reference judgments, or any other prompt material that should travel with the experiment identity.

Use workflow-backed metrics¶

Procedure¶

Variants¶

Expected result¶

Troubleshooting¶