Skip to content

Builtins and adapters

The catalog now exposes reusable shipped components directly through themis.catalog.load(...) and themis.catalog.list_component_ids(...).

Builtin component ids

Name Kind Use when Key constraints / notes
builtin/demo_generator Generator You need deterministic local output for tutorials, smoke tests, or fixture-backed examples Not a real provider integration
builtin/majority_vote Reducer Multiple samples should collapse to the most common answer Works best when outputs normalize to the same representation
builtin/best_of_n Selector A judge should choose the strongest candidate before reduction Requires a judge-backed flow
builtin/json_identity Parser The model already emits structured JSON in the shape you want to score Minimal normalization
builtin/text Parser You want the reduced output treated as plain text Useful for rubric-style scoring or simple text metrics
builtin/choice_letter Parser The answer should resolve to a discrete option label Best for MCQ benchmarks
builtin/math_answer Parser Math answers need normalization before deterministic scoring Pairs with builtin/math_equivalence
builtin/code_text Parser The output is code, including fenced code blocks Used by code-generation benchmarks and reusable execution metrics
builtin/exact_match Metric Parsed output should match the expected value exactly Good default for deterministic tasks with stable output format
builtin/f1 Metric Token overlap is a better fit than exact string equality Still deterministic; no judge model required
builtin/bleu Metric You need surface-form overlap for longer text outputs Better for rough similarity than strict correctness
builtin/choice_accuracy Metric Parsed option labels should score as correct or incorrect Expects parser output compatible with multiple choice
builtin/math_equivalence Metric Equivalent math expressions or normalized answers should count as correct Best for math families such as AIME or HMMT
builtin/procbench_final_accuracy Metric You want deterministic final-answer checking for procedure-style tasks Only use when the recipe is not already judge-backed
builtin/codeforces_pass_rate Metric You need Codeforces-style code execution scoring Requires an execution backend such as piston or sandbox_fusion
builtin/aethercode_pass_rate Metric You need AetherCode-specific execution scoring Requires an execution backend such as piston or sandbox_fusion
builtin/livecodebench_pass_rate Metric You need LiveCodeBench-style execution scoring Requires an execution backend such as piston or sandbox_fusion
builtin/humaneval_pass_rate Metric You need HumanEval-style function execution scoring against a reference solution Requires a Python-capable execution backend such as piston or sandbox_fusion
builtin/demo_judge Judge model You need a deterministic local judge for examples and tests Replace with a real judge model for meaningful evaluation
builtin/llm_rubric Workflow metric One judge should score against a rubric Requires judge models plus optional rubric overrides
builtin/pairwise_judge Workflow metric Two candidates should be compared head-to-head Useful for selection or pairwise preference evaluation
builtin/panel_of_judges Workflow metric Multiple judges should score the same output and aggregate Higher cost than a single-judge rubric
builtin/majority_vote_judge Workflow metric Several judge votes should collapse to a majority decision Useful when categorical consensus matters more than scalar averaging

Adapter families

Name Kind Use when Key constraints / notes
OpenAI Responses API Provider adapter Themis should own evaluation and storage, while an OpenAI-compatible endpoint handles generation Install the openai extra or inject a compatible client
vLLM OpenAI-compatible APIs Provider adapter You run a local or self-hosted OpenAI-compatible model endpoint Install the vllm extra on Linux or inject a compatible client
LangGraph graphs Graph adapter A LangGraph workflow already exists and should act as the generator Pass a graph with invoke() or ainvoke(); trace capture improves when astream_events() exists

Use builtin ids for deterministic examples, smoke tests, common scoring patterns, and benchmark-family reuse. Use adapters when generation should be delegated to an external provider or graph runtime.