Skip to content

Run benchmarks

Goal: execute a named benchmark from the catalog instead of wiring the dataset yourself.

When to use this:

Use this guide when a shipped benchmark entry already matches the task you want to run.

Procedure

Run the shortest benchmark workflow:

themis quick-eval benchmark --name mmlu_pro

Or run the same named benchmark from Python through the catalog API using themis.catalog.run(...).

Then inspect the benchmark catalog for prerequisites such as optional dataset dependencies or adapter-specific execution constraints.

When you want to inspect or filter the benchmark dataset before running it:

from themis.catalog import load

benchmark = load("mmlu_pro")
dataset = benchmark.materialize_dataset()

Benchmark slicing and downsampling are code-authored today. When you need a subset of a shipped benchmark, load or materialize a Dataset, then filter or sample its cases before compiling the experiment. Themis treats that filtered dataset as the benchmark you asked it to run.

One concrete pattern is:

from themis import Experiment
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Dataset

source_dataset = Dataset(...)
filtered_dataset = source_dataset.model_copy(
    update={
        "cases": [
            case
            for case in source_dataset.cases
            if case.metadata.get("category") == "hard"
        ][:100]
    }
)

experiment = Experiment(
    generation=GenerationConfig(...),
    evaluation=EvaluationConfig(...),
    storage=StorageConfig(store="sqlite", parameters={"path": "runs.sqlite3"}),
    datasets=[filtered_dataset],
)

This is the current supported way to run just a slice or downsample of a benchmark.

Variants

Variant Best when Tradeoff Related APIs / commands
Quick local check You want the shortest shell path to a shipped benchmark Less control over filtering and experiment wiring themis quick-eval benchmark --name ...
Python catalog execution You want catalog convenience but still from Python Less flexible than building a custom experiment after inspection themis.catalog.run(...)
Custom experiment around the same dataset You want to reuse the shipped benchmark dataset but own the experiment wiring More code than the convenience API themis.catalog.load(...), BenchmarkDefinition.materialize_dataset(...), Experiment(...)
Filtered benchmark slice Only part of the shipped benchmark should run Slicing is authored in code today rather than declarative config Dataset(cases=[...]) built from a materialized benchmark dataset
Benchmark downsample You want a smaller subset for smoke checks or iteration speed No longer the full benchmark score Sample cases before Experiment.compile()

Expected result

You should get a completed run keyed by the named benchmark entry and know whether the benchmark requires extra setup.

Troubleshooting

Local smoke checks

Use these optional commands when you want to validate benchmark wiring against local services instead of the demo generator.

Generation-model smoke check against your local OpenAI-compatible endpoint:

from themis.adapters.openai import openai
from themis.catalog import run
from themis.core.stores import InMemoryRunStore

result = run(
    "frontierscience",
    model=openai(
        "google/gemma-4-26b-a4b",
        base_url="http://127.0.0.1:1234/v1",
    ),
    store=InMemoryRunStore(),
)
print(result.status)

Code-benchmark smoke check with local sandbox services:

export THEMIS_CODE_SANDBOX_FUSION_URL=http://localhost:8080
export THEMIS_CODE_PISTON_URL=http://localhost:2000
themis quick-eval benchmark --name codeforces

These smoke checks are optional local verification only. The automated test suite should continue to use fixture-backed datasets and fake or demo components.