First LLM-judged evaluation¶

What you will build¶

You will run a workflow-backed metric with builtin demo judges and inspect the stored evaluation execution.

Prerequisites¶

familiarity with persisted runs
base Themis install

Steps¶

Configure a workflow-backed metric and demo judge models.
Run the experiment into an in-memory store.
Inspect the stored evaluation execution for judge calls and scores.

from __future__ import annotations

from themis import Experiment, InMemoryRunStore, get_evaluation_execution
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Run a workflow-backed metric with builtin demo judges."""

    store = InMemoryRunStore()
    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator",
            candidate_policy={"num_samples": 1},
            reducer="builtin/majority_vote",
        ),
        evaluation=EvaluationConfig(
            metrics=["builtin/llm_rubric"],
            parsers=["builtin/json_identity"],
            judge_models=["builtin/demo_judge", "builtin/demo_judge"],
            workflow_overrides={"rubric": "pass if the answer is correct"},
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        seeds=[7],
    )

    result = experiment.run(store=store)
    execution = get_evaluation_execution(
        store, result.run_id, "case-1", "builtin/llm_rubric"
    )
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "judge_calls": 0 if execution is None else len(execution.judge_calls),
        "score_count": 0 if execution is None else len(execution.scores),
    }


if __name__ == "__main__":
    print(run_example())

Expected results¶

Inspect after the run:

the workflow-backed metric produced judge executions
the stored execution keeps the per-judge calls and scores
you can inspect judge artifacts from the store later

Common failure points¶

expecting workflow-backed metrics to behave like pure metrics
forgetting to provide judge models for workflow-backed scoring

First LLM-judged evaluation¶

What you will build¶

Prerequisites¶

Steps¶

Expected results¶

Common failure points¶

Next steps¶