Skip to content

Themis

First advanced run

First advanced run¶

What you will build¶

You will run multiple candidates per case, reduce them to one reduced candidate, and score the result with both pure and workflow-backed metrics.

Prerequisites¶

familiarity with workflow-backed metrics
understanding of candidate fan-out

Steps¶

Configure num_samples greater than one.
Keep a reducer in the generation stage.
Add both pure and workflow-backed metrics to evaluation.

from __future__ import annotations

from themis import Experiment, InMemoryRunStore
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Run a multi-candidate evaluation with mixed metrics."""

    store = InMemoryRunStore()
    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator",
            candidate_policy={"num_samples": 2},
            reducer="builtin/majority_vote",
        ),
        evaluation=EvaluationConfig(
            metrics=[
                "builtin/exact_match",
                "builtin/llm_rubric",
                "builtin/pairwise_judge",
            ],
            parsers=["builtin/json_identity"],
            judge_models=["builtin/demo_judge", "builtin/demo_judge"],
            workflow_overrides={"rubric": "prefer correct and concise answers"},
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        seeds=[7, 11],
    )

    result = experiment.run(store=store)
    case_result = result.cases[0]
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "generated_candidates": len(case_result.generated_candidates),
        "score_ids": [score.metric_id for score in case_result.scores],
    }


if __name__ == "__main__":
    print(run_example())

Expected results¶

Inspect after the run:

two candidates were generated for the case
reduction happened before scoring
the final result includes both pure and workflow-backed metric scores

Common failure points¶

expecting multiple candidates to appear when num_samples is left at 1
mixing up reducer responsibilities with metric responsibilities

Next steps¶