Skip to content

Use pure metrics

Goal: configure deterministic scoring based only on parsed output and case data.

When to use this:

Use this guide when you do not need judge models or workflow execution.

Procedure

Choose one or more builtin pure metrics such as builtin/exact_match, builtin/f1, or builtin/bleu, and pair them with a parser that normalizes the reduced candidate into the shape the metric expects.

from __future__ import annotations

from themis import Experiment
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Execute builtin pure metrics together."""

    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator", reducer="builtin/majority_vote"
        ),
        evaluation=EvaluationConfig(
            metrics=["builtin/exact_match", "builtin/f1", "builtin/bleu"],
            parsers=["builtin/json_identity"],
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
    )
    result = experiment.run()
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "score_ids": [score.metric_id for score in result.cases[0].scores],
    }


if __name__ == "__main__":
    print(run_example())

Pure metrics score parsed outputs directly. They do not require judge models or workflow execution.

Variants

Variant Best when Tradeoff Related APIs / commands
Exact structured comparison Parsed output should match the expected value exactly Too strict for fuzzy or stylistic outputs builtin/exact_match
Token-overlap style scoring Partial lexical overlap is more meaningful than exact equality Still surface-form based, not semantic judging builtin/f1, builtin/bleu
Task-specific deterministic logic The scoring rule is deterministic but domain-specific Requires custom metric implementation PureMetric

Expected result

The run should complete without judge models and produce direct final scores.

Troubleshooting