Compare and Export Results¶
Use BenchmarkResult for benchmark-native aggregation and paired comparisons.
Aggregate¶
Pair¶
Example output from examples/04_compare_models.py:
[{'slice_id': 'qa', 'metric_id': 'exact_match', 'baseline_model_id': 'baseline', 'treatment_model_id': 'candidate', 'pair_count': 4, 'baseline_mean': 0.5, 'treatment_mean': 1.0, 'delta_mean': 0.5, 'p_value': 0.5, 'adjusted_p_value': 0.5, 'adjustment_method': <PValueCorrection.NONE: 'none'>, 'ci_lower': 0.0, 'ci_upper': 1.0, 'ci_level': 0.95, 'method': 'bootstrap_BCa_wilcoxon'}]