Use pure metrics¶
Goal: configure deterministic scoring based only on parsed output and case data.
When to use this:
Use this guide when you do not need judge models or workflow execution.
Procedure¶
Choose one or more builtin pure metrics such as builtin/exact_match, builtin/f1, or builtin/bleu, and pair them with a parser that normalizes the reduced candidate into the shape the metric expects.
from __future__ import annotations
from themis import Experiment
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset
def run_example() -> dict[str, object]:
"""Execute builtin pure metrics together."""
experiment = Experiment(
generation=GenerationConfig(
generator="builtin/demo_generator", reducer="builtin/majority_vote"
),
evaluation=EvaluationConfig(
metrics=["builtin/exact_match", "builtin/f1", "builtin/bleu"],
parsers=["builtin/json_identity"],
),
storage=StorageConfig(store="memory"),
datasets=[
Dataset(
dataset_id="sample",
cases=[
Case(
case_id="case-1",
input={"question": "2+2"},
expected_output={"answer": "4"},
)
],
)
],
)
result = experiment.run()
return {
"run_id": result.run_id,
"status": result.status.value,
"score_ids": [score.metric_id for score in result.cases[0].scores],
}
if __name__ == "__main__":
print(run_example())
Pure metrics score parsed outputs directly. They do not require judge models or workflow execution.
Variants¶
| Variant | Best when | Tradeoff | Related APIs / commands |
|---|---|---|---|
| Exact structured comparison | Parsed output should match the expected value exactly | Too strict for fuzzy or stylistic outputs | builtin/exact_match |
| Token-overlap style scoring | Partial lexical overlap is more meaningful than exact equality | Still surface-form based, not semantic judging | builtin/f1, builtin/bleu |
| Task-specific deterministic logic | The scoring rule is deterministic but domain-specific | Requires custom metric implementation | PureMetric |
Expected result¶
The run should complete without judge models and produce direct final scores.