Reproduce and rejudge runs¶

Goal: export/import run artifacts and replay downstream evaluation stages from stored upstream data.

When to use this:

Use this guide when generation should stay fixed but evaluation needs to move stores or be rerun from a downstream stage.

Procedure¶

Use this sequence when you need to move evidence or rerun workflow-backed evaluation without regenerating candidates.

sequenceDiagram
    participant S as Source store
    participant B as Bundle files
    participant T as Target store
    participant E as "Experiment.replay(stage='judge')"
    S->>B: export generation/evaluation bundle
    B->>T: import bundle
    T->>E: reopen stored upstream artifacts
    E-->>T: write new evaluation executions

The crucial boundary is that upstream artifacts stay fixed while downstream work moves or reruns.

Stage handoff boundaries:

CLI-visible bundle export currently covers generation and evaluation
reduction, parse, and score bundle handoff is Python-only today through export_reduction_bundle(...), export_parse_bundle(...), export_score_bundle(...), and their matching import helpers
imported artifacts are normalized back into standard event history, so resume, report, compare, and cache reuse see the imported data exactly like locally produced data

from __future__ import annotations

from themis import (
    Experiment,
    InMemoryRunStore,
    export_evaluation_bundle,
    export_generation_bundle,
    import_evaluation_bundle,
    import_generation_bundle,
)
from themis.core.config import EvaluationConfig, GenerationConfig, StorageConfig
from themis.core.models import Case, Dataset


def run_example() -> dict[str, object]:
    """Export bundle artifacts, import them into another store, and replay judge scoring in place."""

    source_store = InMemoryRunStore()
    target_store = InMemoryRunStore()
    source_store.initialize()
    target_store.initialize()

    experiment = Experiment(
        generation=GenerationConfig(
            generator="builtin/demo_generator",
            candidate_policy={"num_samples": 1},
            reducer="builtin/majority_vote",
        ),
        evaluation=EvaluationConfig(
            metrics=["builtin/llm_rubric"],
            parsers=["builtin/json_identity"],
            judge_models=["builtin/demo_judge", "builtin/demo_judge"],
            workflow_overrides={"rubric": "pass if the answer is correct"},
        ),
        storage=StorageConfig(store="memory"),
        datasets=[
            Dataset(
                dataset_id="sample",
                cases=[
                    Case(
                        case_id="case-1",
                        input={"question": "2+2"},
                        expected_output={"answer": "4"},
                    )
                ],
            )
        ],
        seeds=[7],
    )
    initial = experiment.run(store=source_store)
    import_generation_bundle(
        target_store, export_generation_bundle(source_store, initial.run_id)
    )
    import_evaluation_bundle(
        target_store, export_evaluation_bundle(source_store, initial.run_id)
    )
    replayed = experiment.replay(stage="judge", store=source_store)
    return {
        "run_id": initial.run_id,
        "replayed_run_id": replayed.run_id,
        "imported": target_store.resume(initial.run_id) is not None,
    }


if __name__ == "__main__":
    print(run_example())

Reproduce from stored artifacts when generation should remain fixed. Replay only the downstream stages that actually need rerunning.

Variants¶

Variant	Best when	Tradeoff	Related APIs / commands
Portable generation artifacts only	Candidate outputs should move to another store or environment before any new evaluation work	Downstream stages still need to run later	`export_generation_bundle(...)`, `import_generation_bundle(...)`
Portable reduction, parse, or score artifacts	Intermediate stages, not just generation or judging, must move across environments	Python-only today, so less convenient than CLI export	`export_reduction_bundle(...)`, `export_parse_bundle(...)`, `export_score_bundle(...)`
Portable evaluation artifacts too	Judge executions should move with the run	More artifact management than an in-place replay	`export_evaluation_bundle(...)`, `import_evaluation_bundle(...)`
Rerun workflow-backed metrics in place	Generation stays fixed and only judge outputs should change	Requires stored upstream artifacts and judge access	`Experiment.replay(stage="judge")`
Rerun pure scoring from parsed outputs	Parsing is fixed and deterministic scoring should be recomputed	Only useful when upstream parsing is already good	`Experiment.replay(stage="score")`
Stop a run intentionally at a boundary first	You know ahead of time that generation or parsing should stop early for handoff	Requires a second step to continue later	`Experiment.run(..., until_stage=...)`

Expected result¶

You should be able to move artifacts between stores and replay downstream stages without regenerating candidates.

Reproduce and rejudge runs¶

Procedure¶

Variants¶

Expected result¶

Troubleshooting¶