BixBench-Verified-50

Loading data...

Grading Modes

File	Date	Repeats	Rows

Included from results/performance_analysis.png as requested.

Question	Task Group	Direct	MCQ w/o	MCQ w/	Refusal Gap	MCQ Lift

Task Group	Questions	Runs	Direct	MCQ w/o	MCQ w/

Mixed-result questions broken down by file code (MCQ without refusal correct/total).

Rank	Run	Direct	MCQ with refusal	MCQ without refusal

Run python3 scripts/build_pages_data.py whenever source CSVs or image change.