Benchmark catalog¶

Catalog surfaces¶

Surface	Kind	Use when	Key constraints / notes
Reusable catalog components	Loadable parsers, metrics, reducers, selectors, generators, and judge workflows	You want a shipped building block such as `builtin/choice_letter` without going through a named benchmark	Load with `themis.catalog.load(...)` or list with `themis.catalog.list_component_ids(...)`
Benchmark recipes	Named dataset-backed benchmark definitions	You want a catalog entry such as `mmlu_pro` to materialize a real dataset and wire the right parser and metric stack	Recipes stay cheap to inspect until `materialize_dataset(...)` or `themis.catalog.run(...)` is called

Python entry points¶

Entry point	Kind	Use when	Notes
`themis.catalog.list_component_ids(...)`	Discovery helper	You want to see the reusable shipped component ids before deciding what to load	Returns component ids only; benchmark discovery still starts from the benchmark manifest docs
`themis.catalog.list_benchmark_ids(...)`	Discovery helper	You want the canonical shipped benchmark ids without inspecting manifests	Returns benchmark ids only
`themis.catalog.list_benchmarks(...)`	Metadata listing	You want structured benchmark metadata such as support tier, variants, and version notes	Best source for docs, CLIs, and validation layers
`themis.catalog.load(...)`	Resolver	You want to inspect a reusable component or a `BenchmarkDefinition` before running anything	Use `load("builtin/choice_letter")` for a parser or `load("mmlu_pro")` for a benchmark recipe
`themis.catalog.run(...)`	Convenience executor	You want the catalog to materialize the dataset and run the benchmark in one call	Best for benchmark execution; for custom slicing, load first and build your own `Dataset`
`themis.catalog.validate_benchmark(...)`	Validation helper	You want to confirm a shipped benchmark loads, materializes, and is ready for score smoke checks	Ready code-execution benchmarks run a score smoke check; experimental ones report a skipped score smoke check

Use themis.catalog.load("builtin/choice_letter") when you want a reusable parser directly. Use themis.catalog.load("mmlu_pro") when you want to inspect a benchmark definition first, including materialize_dataset(...). Use themis.catalog.run("mmlu_pro", model=..., store=...) when you want catalog convenience without going through the CLI. Use themis.catalog.list_benchmark_ids(...) or themis.catalog.list_benchmarks(...) when you need benchmark discovery or catalog metadata instead of component discovery.

Reusable component ids¶

Name	Kind	Use when	Key constraints / notes
`builtin/choice_letter`	Parser	The model should end in an option label such as `A` or `B`	Pair with MCQ benchmarks and metrics such as `builtin/choice_accuracy`
`builtin/math_answer`	Parser	You need short-answer math normalization before scoring	Pairs with `builtin/math_equivalence`
`builtin/code_text`	Parser	The model emits raw or fenced code that should be scored as source text	Common in code-generation benchmarks
`builtin/choice_accuracy`	Metric	You want deterministic correctness for parsed MCQ outputs	Expects parsed option labels rather than long free-form answers
`builtin/math_equivalence`	Metric	You want symbolic or normalized math equivalence instead of string equality	Best for AIME-style numeric and short-answer math
`builtin/procbench_final_accuracy`	Metric	You want deterministic final-answer checking for procbench-like outputs	Use only when the benchmark recipe is not already using a judge-backed rubric

Named benchmark entries¶

Benchmark	Shape	Parser / Metric	Variants	Support tier	Notes
`aime_2025`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Install `themis-eval[datasets]` when materializing from Hugging Face
`aime_2026`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Install `themis-eval[datasets]` when materializing from Hugging Face
`aethercode`	Code generation	`builtin/code_text` + `builtin/aethercode_pass_rate`	None	ready	Requires `piston` or `sandbox_fusion` plus dataset access
`apex_2025`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Install `themis-eval[datasets]` when materializing from Hugging Face
`babe`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	None	ready	Dataset access only
`beyond_aime`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Dataset access only
`codeforces`	Code generation	`builtin/code_text` + `builtin/codeforces_pass_rate`	None	ready	Requires `piston` or `sandbox_fusion` plus dataset access
`encyclo_k`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	None	ready	Dataset access only
`frontierscience`	Judge-backed QA	`builtin/json_identity` + `builtin/llm_rubric`	None	ready	Use a real judge model for non-demo scoring
`gpqa_diamond`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	None	ready	Dataset access only
`healthbench`	Judge-backed QA	`builtin/json_identity` + `builtin/llm_rubric`	None	ready	Use a real judge model for non-demo scoring
`hle`	Judge-backed expert QA	`builtin/json_identity` + `builtin/panel_of_judges`	Recipe-defined	ready	Check the recipe for supported domain variants before choosing one
`hmmt_feb_2025`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Dataset access only
`hmmt_nov_2025`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Dataset access only
`humaneval_plus`	Code generation	`builtin/code_text` + `builtin/humaneval_pass_rate`	None	ready	Requires `piston` or `sandbox_fusion`; uses the current upstream default split
`imo_answerbench`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Dataset access only
`livecodebench`	Code generation	`builtin/code_text` + `builtin/livecodebench_pass_rate`	None	ready	Targets LiveCodeBench release_v6. Requires `piston` or `sandbox_fusion` plus dataset access
`lpfqa`	Judge-backed QA	`builtin/json_identity` + `builtin/llm_rubric`	None	ready	Use a real judge model for non-demo scoring
`mmlu_pro`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	None	ready	Good default catalog benchmark for MCQ smoke checks
`mmmlu`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	Recipe-defined	ready	Inspect the recipe for supported language or config variants
`phybench`	Math short-answer	`builtin/math_answer` + `builtin/math_equivalence`	None	ready	Dataset access only
`procbench`	Procedural QA	`builtin/text` + `builtin/llm_rubric`	Recipe-defined	ready	Check task-specific variants before choosing a slice
`rolebench`	Role-following judged QA	`builtin/json_identity` + `builtin/llm_rubric`	`instruction_generalization_eng`, `role_generalization_eng`	ready	Use a real judge model for non-demo scoring
`simpleqa_verified`	Judge-backed QA	`builtin/json_identity` + `builtin/panel_of_judges`	None	ready	Uses a panel-style judge workflow rather than single-rubric scoring
`superchem`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	`en`, `zh`	ready	Choose the language variant before materializing the dataset
`supergpqa`	Multiple choice	`builtin/choice_letter` + `builtin/choice_accuracy`	None	ready	Dataset access only

Benchmark recipes now materialize real benchmark datasets instead of a synthetic placeholder case at run time. Check the benchmark manifest and Benchmark adapters for adapter-specific execution requirements such as code execution backends or dataset variants.