Use pure metrics¶

Goal: configure deterministic scoring based only on parsed output and case data.

When to use this:

Use this guide when you do not need judge models or workflow execution.

Procedure¶

Choose one or more builtin pure metrics such as builtin/exact_match, builtin/f1, or builtin/bleu, and pair them with a parser that normalizes the reduced candidate into the shape the metric expects.

from __future__ import annotations

from themis import Experiment
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Execute builtin pure metrics together."""

    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator", reducer="builtin/majority_vote"
        ),
        evaluation=EvaluationConfig(
            metrics=["builtin/exact_match", "builtin/f1", "builtin/bleu"],
            parsers=["builtin/json_identity"],
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
    )
    result = experiment.run()
    return {
        "run_id": result.run_id,
        "status": result.status.value,
        "score_ids": [score.metric_id for score in result.cases[0].scores],
    }


if __name__ == "__main__":
    print(run_example())

Pure metrics score parsed outputs directly. They do not require judge models or workflow execution.

Variants¶

Variant	Best when	Tradeoff	Related APIs / commands
Exact structured comparison	Parsed output should match the expected value exactly	Too strict for fuzzy or stylistic outputs	`builtin/exact_match`
Token-overlap style scoring	Partial lexical overlap is more meaningful than exact equality	Still surface-form based, not semantic judging	`builtin/f1`, `builtin/bleu`
Task-specific deterministic logic	The scoring rule is deterministic but domain-specific	Requires custom metric implementation	`PureMetric`

Expected result¶

The run should complete without judge models and produce direct final scores.

Use pure metrics¶

Procedure¶

Variants¶

Expected result¶

Troubleshooting¶