Reproduce and rejudge runs¶
Goal: export/import run artifacts and replay downstream evaluation stages from stored upstream data.
When to use this:
Use this guide when generation should stay fixed but evaluation needs to move stores or be rerun from a downstream stage.
Procedure¶
Use this sequence when you need to move evidence or rerun workflow-backed evaluation without regenerating candidates.
sequenceDiagram
participant S as Source store
participant B as Bundle files
participant T as Target store
participant E as "Experiment.replay(stage='judge')"
S->>B: export generation/evaluation bundle
B->>T: import bundle
T->>E: reopen stored upstream artifacts
E-->>T: write new evaluation executions
The crucial boundary is that upstream artifacts stay fixed while downstream work moves or reruns.
Stage handoff boundaries:
- CLI-visible bundle export currently covers
generationandevaluation - reduction, parse, and score bundle handoff is Python-only today through
export_reduction_bundle(...),export_parse_bundle(...),export_score_bundle(...), and their matching import helpers - imported artifacts are normalized back into standard event history, so
resume,report,compare, and cache reuse see the imported data exactly like locally produced data
from __future__ import annotations
from themis import (
Experiment,
InMemoryRunStore,
export_evaluation_bundle,
export_generation_bundle,
import_evaluation_bundle,
import_generation_bundle,
)
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset
def run_example() -> dict[str, object]:
"""Export bundle artifacts, import them into another store, and replay judge scoring in place."""
source_store = InMemoryRunStore()
target_store = InMemoryRunStore()
source_store.initialize()
target_store.initialize()
experiment = Experiment(
generation=GenerationConfig(
generator="builtin/demo_generator",
candidate_policy={"num_samples": 1},
reducer="builtin/majority_vote",
),
evaluation=EvaluationConfig(
metrics=["builtin/llm_rubric"],
parsers=["builtin/json_identity"],
judge_models=["builtin/demo_judge", "builtin/demo_judge"],
workflow_overrides={"rubric": "pass if the answer is correct"},
),
storage=StorageConfig(store="memory"),
datasets=[
Dataset(
dataset_id="sample",
cases=[
Case(
case_id="case-1",
input={"question": "2+2"},
expected_output={"answer": "4"},
)
],
)
],
seeds=[7],
)
initial = experiment.run(store=source_store)
import_generation_bundle(
target_store, export_generation_bundle(source_store, initial.run_id)
)
import_evaluation_bundle(
target_store, export_evaluation_bundle(source_store, initial.run_id)
)
replayed = experiment.replay(stage="judge", store=source_store)
return {
"run_id": initial.run_id,
"replayed_run_id": replayed.run_id,
"imported": target_store.resume(initial.run_id) is not None,
}
if __name__ == "__main__":
print(run_example())
Reproduce from stored artifacts when generation should remain fixed. Replay only the downstream stages that actually need rerunning.
Variants¶
| Variant | Best when | Tradeoff | Related APIs / commands |
|---|---|---|---|
| Portable generation artifacts only | Candidate outputs should move to another store or environment before any new evaluation work | Downstream stages still need to run later | export_generation_bundle(...), import_generation_bundle(...) |
| Portable reduction, parse, or score artifacts | Intermediate stages, not just generation or judging, must move across environments | Python-only today, so less convenient than CLI export | export_reduction_bundle(...), export_parse_bundle(...), export_score_bundle(...) |
| Portable evaluation artifacts too | Judge executions should move with the run | More artifact management than an in-place replay | export_evaluation_bundle(...), import_evaluation_bundle(...) |
| Rerun workflow-backed metrics in place | Generation stays fixed and only judge outputs should change | Requires stored upstream artifacts and judge access | Experiment.replay(stage="judge") |
| Rerun pure scoring from parsed outputs | Parsing is fixed and deterministic scoring should be recomputed | Only useful when upstream parsing is already good | Experiment.replay(stage="score") |
| Stop a run intentionally at a boundary first | You know ahead of time that generation or parsing should stop early for handoff | Requires a second step to continue later | Experiment.run(..., until_stage=...) |
Expected result¶
You should be able to move artifacts between stores and replay downstream stages without regenerating candidates.