Skip to content

Use workflow-backed metrics

Goal: configure judge-backed metrics and inspect their execution artifacts.

When to use this:

Use this guide when deterministic pure scoring is not sufficient and Themis should own an evaluation workflow.

Procedure

Use this task map when you need to confirm the minimum pieces required for judge-backed scoring.

flowchart LR
    A["Reduced candidate"] --> B["Parser"]
    B --> C["Workflow-backed metric"]
    C --> D["Judge model(s)"]
    C --> E["Workflow overrides"]
    C --> F["Persisted evaluation executions"]

The runtime builds a workflow around the metric, so the important setup work is choosing the right subject, judge, and overrides.

Provide:

  • one or more workflow-backed metrics
  • parsers for the reduced candidate
  • judge models
  • optional prompt_spec for judge prompt instructions or generic prompt blocks
  • any workflow overrides such as a rubric
from __future__ import annotations

from themis import Experiment, InMemoryRunStore, get_evaluation_execution
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Execute builtin workflow-backed metrics together."""

    store = InMemoryRunStore()
    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator",
            candidate_policy={"num_samples": 2},
            reducer="builtin/majority_vote",
        ),
        evaluation=EvaluationConfig(
            metrics=[
                "builtin/llm_rubric",
                "builtin/panel_of_judges",
                "builtin/majority_vote_judge",
                "builtin/pairwise_judge",
            ],
            parsers=["builtin/json_identity"],
            judge_models=["builtin/demo_judge", "builtin/demo_judge"],
            workflow_overrides={"rubric": "pass if the answer is correct"},
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        seeds=[7, 11],
    )
    result = experiment.run(store=store)
    execution = get_evaluation_execution(
        store, result.run_id, "case-1", "builtin/llm_rubric"
    )
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "score_ids": [score.metric_id for score in result.cases[0].scores],
        "judge_calls": 0 if execution is None else len(execution.judge_calls),
    }


if __name__ == "__main__":
    print(run_example())

Workflow-backed metrics persist judge calls, prompts, responses, and aggregation outputs so scores remain inspectable later.

Variants

Variant Best when Tradeoff Related APIs / commands
Rubric scoring One judge and one rubric are enough Less resilient to judge variance than panel-style setups builtin/llm_rubric
Multi-judge averaging Multiple judges should score the same output and aggregate Higher latency and judge-model cost builtin/panel_of_judges
Majority-vote judgment The output should collapse to a categorical majority decision Loses scalar nuance compared with averaging builtin/majority_vote_judge
Pairwise selection Two candidates should be compared directly Not a drop-in replacement for single-output scoring builtin/pairwise_judge
Heterogeneous multi-judge orchestration Different prompts or parsing logic should run over the same response Requires a custom workflow metric in Python Custom workflow metric, ctx.prompt_spec

Expected result

The run should persist evaluation executions with judge calls, prompts, responses, and final scores or aggregation output.

Builtin judge workflows consume PromptSpec.blocks directly. Custom workflow metrics should read ctx.prompt_spec themselves when they need benchmark-derived context, retrieved context, reference judgments, or any other prompt material that should travel with the experiment identity.

Troubleshooting