Skip to content

First LLM-judged evaluation

What you will build

You will run a workflow-backed metric with builtin demo judges and inspect the stored evaluation execution.

Prerequisites

  • familiarity with persisted runs
  • base Themis install

Steps

  1. Configure a workflow-backed metric and demo judge models.
  2. Run the experiment into an in-memory store.
  3. Inspect the stored evaluation execution for judge calls and scores.
from __future__ import annotations

from themis import Experiment, InMemoryRunStore, get_evaluation_execution
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Run a workflow-backed metric with builtin demo judges."""

    store = InMemoryRunStore()
    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator",
            candidate_policy={"num_samples": 1},
            reducer="builtin/majority_vote",
        ),
        evaluation=EvaluationConfig(
            metrics=["builtin/llm_rubric"],
            parsers=["builtin/json_identity"],
            judge_models=["builtin/demo_judge", "builtin/demo_judge"],
            workflow_overrides={"rubric": "pass if the answer is correct"},
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        seeds=[7],
    )

    result = experiment.run(store=store)
    execution = get_evaluation_execution(
        store, result.run_id, "case-1", "builtin/llm_rubric"
    )
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "judge_calls": 0 if execution is None else len(execution.judge_calls),
        "score_count": 0 if execution is None else len(execution.scores),
    }


if __name__ == "__main__":
    print(run_example())

Expected results

Inspect after the run:

  • the workflow-backed metric produced judge executions
  • the stored execution keeps the per-judge calls and scores
  • you can inspect judge artifacts from the store later

Common failure points

  • expecting workflow-backed metrics to behave like pure metrics
  • forgetting to provide judge models for workflow-backed scoring

Next steps