Skip to content

Benchmark catalog

Catalog surfaces

Surface Kind Use when Key constraints / notes
Reusable catalog components Loadable parsers, metrics, reducers, selectors, generators, and judge workflows You want a shipped building block such as builtin/choice_letter without going through a named benchmark Load with themis.catalog.load(...) or list with themis.catalog.list_component_ids(...)
Benchmark recipes Named dataset-backed benchmark definitions You want a catalog entry such as mmlu_pro to materialize a real dataset and wire the right parser and metric stack Recipes stay cheap to inspect until materialize_dataset(...) or themis.catalog.run(...) is called

Python entry points

Entry point Kind Use when Notes
themis.catalog.list_component_ids(...) Discovery helper You want to see the reusable shipped component ids before deciding what to load Returns component ids only; benchmark discovery still starts from the benchmark manifest docs
themis.catalog.list_benchmark_ids(...) Discovery helper You want the canonical shipped benchmark ids without inspecting manifests Returns benchmark ids only
themis.catalog.list_benchmarks(...) Metadata listing You want structured benchmark metadata such as support tier, variants, and version notes Best source for docs, CLIs, and validation layers
themis.catalog.load(...) Resolver You want to inspect a reusable component or a BenchmarkDefinition before running anything Use load("builtin/choice_letter") for a parser or load("mmlu_pro") for a benchmark recipe
themis.catalog.run(...) Convenience executor You want the catalog to materialize the dataset and run the benchmark in one call Best for benchmark execution; for custom slicing, load first and build your own Dataset
themis.catalog.validate_benchmark(...) Validation helper You want to confirm a shipped benchmark loads, materializes, and is ready for score smoke checks Ready code-execution benchmarks run a score smoke check; experimental ones report a skipped score smoke check

Use themis.catalog.load("builtin/choice_letter") when you want a reusable parser directly. Use themis.catalog.load("mmlu_pro") when you want to inspect a benchmark definition first, including materialize_dataset(...). Use themis.catalog.run("mmlu_pro", model=..., store=...) when you want catalog convenience without going through the CLI. Use themis.catalog.list_benchmark_ids(...) or themis.catalog.list_benchmarks(...) when you need benchmark discovery or catalog metadata instead of component discovery.

Reusable component ids

Name Kind Use when Key constraints / notes
builtin/choice_letter Parser The model should end in an option label such as A or B Pair with MCQ benchmarks and metrics such as builtin/choice_accuracy
builtin/math_answer Parser You need short-answer math normalization before scoring Pairs with builtin/math_equivalence
builtin/code_text Parser The model emits raw or fenced code that should be scored as source text Common in code-generation benchmarks
builtin/choice_accuracy Metric You want deterministic correctness for parsed MCQ outputs Expects parsed option labels rather than long free-form answers
builtin/math_equivalence Metric You want symbolic or normalized math equivalence instead of string equality Best for AIME-style numeric and short-answer math
builtin/procbench_final_accuracy Metric You want deterministic final-answer checking for procbench-like outputs Use only when the benchmark recipe is not already using a judge-backed rubric

Named benchmark entries

Benchmark Shape Parser / Metric Variants Support tier Notes
aime_2025 Math short-answer builtin/math_answer + builtin/math_equivalence None ready Install themis-eval[datasets] when materializing from Hugging Face
aime_2026 Math short-answer builtin/math_answer + builtin/math_equivalence None ready Install themis-eval[datasets] when materializing from Hugging Face
aethercode Code generation builtin/code_text + builtin/aethercode_pass_rate None ready Requires piston or sandbox_fusion plus dataset access
apex_2025 Math short-answer builtin/math_answer + builtin/math_equivalence None ready Install themis-eval[datasets] when materializing from Hugging Face
babe Multiple choice builtin/choice_letter + builtin/choice_accuracy None ready Dataset access only
beyond_aime Math short-answer builtin/math_answer + builtin/math_equivalence None ready Dataset access only
codeforces Code generation builtin/code_text + builtin/codeforces_pass_rate None ready Requires piston or sandbox_fusion plus dataset access
encyclo_k Multiple choice builtin/choice_letter + builtin/choice_accuracy None ready Dataset access only
frontierscience Judge-backed QA builtin/json_identity + builtin/llm_rubric None ready Use a real judge model for non-demo scoring
gpqa_diamond Multiple choice builtin/choice_letter + builtin/choice_accuracy None ready Dataset access only
healthbench Judge-backed QA builtin/json_identity + builtin/llm_rubric None ready Use a real judge model for non-demo scoring
hle Judge-backed expert QA builtin/json_identity + builtin/panel_of_judges Recipe-defined ready Check the recipe for supported domain variants before choosing one
hmmt_feb_2025 Math short-answer builtin/math_answer + builtin/math_equivalence None ready Dataset access only
hmmt_nov_2025 Math short-answer builtin/math_answer + builtin/math_equivalence None ready Dataset access only
humaneval_plus Code generation builtin/code_text + builtin/humaneval_pass_rate None ready Requires piston or sandbox_fusion; uses the current upstream default split
imo_answerbench Math short-answer builtin/math_answer + builtin/math_equivalence None ready Dataset access only
livecodebench Code generation builtin/code_text + builtin/livecodebench_pass_rate None ready Targets LiveCodeBench release_v6. Requires piston or sandbox_fusion plus dataset access
lpfqa Judge-backed QA builtin/json_identity + builtin/llm_rubric None ready Use a real judge model for non-demo scoring
mmlu_pro Multiple choice builtin/choice_letter + builtin/choice_accuracy None ready Good default catalog benchmark for MCQ smoke checks
mmmlu Multiple choice builtin/choice_letter + builtin/choice_accuracy Recipe-defined ready Inspect the recipe for supported language or config variants
phybench Math short-answer builtin/math_answer + builtin/math_equivalence None ready Dataset access only
procbench Procedural QA builtin/text + builtin/llm_rubric Recipe-defined ready Check task-specific variants before choosing a slice
rolebench Role-following judged QA builtin/json_identity + builtin/llm_rubric instruction_generalization_eng, role_generalization_eng ready Use a real judge model for non-demo scoring
simpleqa_verified Judge-backed QA builtin/json_identity + builtin/panel_of_judges None ready Uses a panel-style judge workflow rather than single-rubric scoring
superchem Multiple choice builtin/choice_letter + builtin/choice_accuracy en, zh ready Choose the language variant before materializing the dataset
supergpqa Multiple choice builtin/choice_letter + builtin/choice_accuracy None ready Dataset access only

Benchmark recipes now materialize real benchmark datasets instead of a synthetic placeholder case at run time. Check the benchmark manifest and Benchmark adapters for adapter-specific execution requirements such as code execution backends or dataset variants.