Skip to content

First evaluate(...)

What you will build

You will run a single deterministic evaluation from Python using builtin generation, parsing, and scoring components.

Prerequisites

  • base Themis install
  • no provider extras required
  • basic familiarity with running a Python script

Steps

  1. Read the example below.
  2. Run it as a standalone script or import run_example().
  3. Inspect the returned run_id and status.
from __future__ import annotations

from themis import evaluate
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Run the smallest end-to-end evaluation through the Layer 1 API."""

    result = evaluate(
        model="builtin/demo_generator",
        data=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        metric="builtin/exact_match",
        parser="builtin/json_identity",
    )
    return {"run_id": result.run_id, "status": result.status.value}


if __name__ == "__main__":
    print(run_example())

Expected results

Expected result:

  • status is completed
  • run_id is stable for the same compiled identity inputs
  • you used the shortest supported Python entry point

Common failure points

  • using a different expected output than the builtin demo generator returns
  • assuming memory storage can be reopened from another process

Next steps