Run benchmarks¶

Goal: execute a named benchmark from the catalog instead of wiring the dataset yourself.

When to use this:

Use this guide when a shipped benchmark entry already matches the task you want to run.

Procedure¶

Run the shortest benchmark workflow:

themis quick-eval benchmark --name mmlu_pro

Or run the same named benchmark from Python through the catalog API using themis.catalog.run(...).

Then inspect the benchmark catalog for prerequisites such as optional dataset dependencies or adapter-specific execution constraints.

When you want to inspect or filter the benchmark dataset before running it:

from themis.catalog import load

benchmark = load("mmlu_pro")
dataset = benchmark.materialize_dataset()

Benchmark slicing and downsampling are code-authored today. When you need a subset of a shipped benchmark, load or materialize a Dataset, then filter or sample its cases before compiling the experiment. Themis treats that filtered dataset as the benchmark you asked it to run.

One concrete pattern is:

from themis import Experiment
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Dataset

source_dataset = Dataset(...)
filtered_dataset = source_dataset.model_copy(
    update={
        "cases": [
            case
            for case in source_dataset.cases
            if case.metadata.get("category") == "hard"
        ][:100]
    }
)

experiment = Experiment(
    generation=GenerationConfig(...),
    evaluation=EvaluationConfig(...),
    storage=StorageConfig(store="sqlite", parameters={"path": "runs.sqlite3"}),
    datasets=[filtered_dataset],
)

This is the current supported way to run just a slice or downsample of a benchmark.

Variants¶

Variant	Best when	Tradeoff	Related APIs / commands
Quick local check	You want the shortest shell path to a shipped benchmark	Less control over filtering and experiment wiring	`themis quick-eval benchmark --name ...`
Python catalog execution	You want catalog convenience but still from Python	Less flexible than building a custom experiment after inspection	`themis.catalog.run(...)`
Custom experiment around the same dataset	You want to reuse the shipped benchmark dataset but own the experiment wiring	More code than the convenience API	`themis.catalog.load(...)`, `BenchmarkDefinition.materialize_dataset(...)`, `Experiment(...)`
Filtered benchmark slice	Only part of the shipped benchmark should run	Slicing is authored in code today rather than declarative config	`Dataset(cases=[...])` built from a materialized benchmark dataset
Benchmark downsample	You want a smaller subset for smoke checks or iteration speed	No longer the full benchmark score	Sample cases before `Experiment.compile()`

Expected result¶

You should get a completed run keyed by the named benchmark entry and know whether the benchmark requires extra setup.

Troubleshooting¶

Local smoke checks¶

Use these optional commands when you want to validate benchmark wiring against local services instead of the demo generator.

Generation-model smoke check against your local OpenAI-compatible endpoint:

from themis.adapters.openai import openai
from themis.catalog import run
from themis.core.stores import InMemoryRunStore

result = run(
    "frontierscience",
    model=openai(
        "google/gemma-4-26b-a4b",
        base_url="http://127.0.0.1:1234/v1",
    ),
    store=InMemoryRunStore(),
)
print(result.status)

Code-benchmark smoke check with local sandbox services:

export THEMIS_CODE_SANDBOX_FUSION_URL=http://localhost:8080
export THEMIS_CODE_PISTON_URL=http://localhost:2000
themis quick-eval benchmark --name codeforces

These smoke checks are optional local verification only. The automated test suite should continue to use fixture-backed datasets and fake or demo components.