First evaluate(...)¶
What you will build¶
You will run a single deterministic evaluation from Python using builtin generation, parsing, and scoring components.
Prerequisites¶
- base Themis install
- no provider extras required
- basic familiarity with running a Python script
Steps¶
- Read the example below.
- Run it as a standalone script or import
run_example(). - Inspect the returned
run_idandstatus.
from __future__ import annotations
from themis import evaluate
from themis.core.models import Case, Dataset
def run_example() -> dict[str, object]:
"""Run the smallest end-to-end evaluation through the Layer 1 API."""
result = evaluate(
model="builtin/demo_generator",
data=[
Dataset(
dataset_id="sample",
cases=[
Case(
case_id="case-1",
input={"question": "2+2"},
expected_output={"answer": "4"},
)
],
)
],
metric="builtin/exact_match",
parser="builtin/json_identity",
)
return {"run_id": result.run_id, "status": result.status.value}
if __name__ == "__main__":
print(run_example())
Expected results¶
Expected result:
statusiscompletedrun_idis stable for the same compiled identity inputs- you used the shortest supported Python entry point
Common failure points¶
- using a different expected output than the builtin demo generator returns
- assuming
memorystorage can be reopened from another process