Builtins and adapters¶
The catalog now exposes reusable shipped components directly through
themis.catalog.load(...) and themis.catalog.list_component_ids(...).
Builtin component ids¶
| Name | Kind | Use when | Key constraints / notes |
|---|---|---|---|
builtin/demo_generator |
Generator | You need deterministic local output for tutorials, smoke tests, or fixture-backed examples | Not a real provider integration |
builtin/majority_vote |
Reducer | Multiple samples should collapse to the most common answer | Works best when outputs normalize to the same representation |
builtin/best_of_n |
Selector | A judge should choose the strongest candidate before reduction | Requires a judge-backed flow |
builtin/json_identity |
Parser | The model already emits structured JSON in the shape you want to score | Minimal normalization |
builtin/text |
Parser | You want the reduced output treated as plain text | Useful for rubric-style scoring or simple text metrics |
builtin/choice_letter |
Parser | The answer should resolve to a discrete option label | Best for MCQ benchmarks |
builtin/math_answer |
Parser | Math answers need normalization before deterministic scoring | Pairs with builtin/math_equivalence |
builtin/code_text |
Parser | The output is code, including fenced code blocks | Used by code-generation benchmarks and reusable execution metrics |
builtin/exact_match |
Metric | Parsed output should match the expected value exactly | Good default for deterministic tasks with stable output format |
builtin/f1 |
Metric | Token overlap is a better fit than exact string equality | Still deterministic; no judge model required |
builtin/bleu |
Metric | You need surface-form overlap for longer text outputs | Better for rough similarity than strict correctness |
builtin/choice_accuracy |
Metric | Parsed option labels should score as correct or incorrect | Expects parser output compatible with multiple choice |
builtin/math_equivalence |
Metric | Equivalent math expressions or normalized answers should count as correct | Best for math families such as AIME or HMMT |
builtin/procbench_final_accuracy |
Metric | You want deterministic final-answer checking for procedure-style tasks | Only use when the recipe is not already judge-backed |
builtin/codeforces_pass_rate |
Metric | You need Codeforces-style code execution scoring | Requires an execution backend such as piston or sandbox_fusion |
builtin/aethercode_pass_rate |
Metric | You need AetherCode-specific execution scoring | Requires an execution backend such as piston or sandbox_fusion |
builtin/livecodebench_pass_rate |
Metric | You need LiveCodeBench-style execution scoring | Requires an execution backend such as piston or sandbox_fusion |
builtin/humaneval_pass_rate |
Metric | You need HumanEval-style function execution scoring against a reference solution | Requires a Python-capable execution backend such as piston or sandbox_fusion |
builtin/demo_judge |
Judge model | You need a deterministic local judge for examples and tests | Replace with a real judge model for meaningful evaluation |
builtin/llm_rubric |
Workflow metric | One judge should score against a rubric | Requires judge models plus optional rubric overrides |
builtin/pairwise_judge |
Workflow metric | Two candidates should be compared head-to-head | Useful for selection or pairwise preference evaluation |
builtin/panel_of_judges |
Workflow metric | Multiple judges should score the same output and aggregate | Higher cost than a single-judge rubric |
builtin/majority_vote_judge |
Workflow metric | Several judge votes should collapse to a majority decision | Useful when categorical consensus matters more than scalar averaging |
Adapter families¶
| Name | Kind | Use when | Key constraints / notes |
|---|---|---|---|
| OpenAI Responses API | Provider adapter | Themis should own evaluation and storage, while an OpenAI-compatible endpoint handles generation | Install the openai extra or inject a compatible client |
| vLLM OpenAI-compatible APIs | Provider adapter | You run a local or self-hosted OpenAI-compatible model endpoint | Install the vllm extra on Linux or inject a compatible client |
| LangGraph graphs | Graph adapter | A LangGraph workflow already exists and should act as the generator | Pass a graph with invoke() or ainvoke(); trace capture improves when astream_events() exists |
Use builtin ids for deterministic examples, smoke tests, common scoring patterns, and benchmark-family reuse. Use adapters when generation should be delegated to an external provider or graph runtime.