Example Catalog¶
All numbered examples use the benchmark-first public surface.
| Example | Focus |
|---|---|
01_hello_world.py |
Smallest benchmark run |
02_project_file.py |
File-backed project policy |
03_custom_extractor_metric.py |
Custom parser plus metric |
04_compare_models.py |
Aggregation and paired comparison |
05_resume_run.py |
Reuse against the same storage root |
06_hooks_and_timeline.py |
Hooks and candidate timelines |
07_judge_metric.py |
Judge-backed metric |
08_external_stage_handoff.py |
External scoring handoff |
09_experiment_evolution.py |
Incremental benchmark evolution |
Intentionally Untouched¶
examples/medical_reasoning_eval remains in the repository as a handoff and
acceptance reference. It was not rewritten to the new API, and it is not part
of the recommended public example path.