ai.bio.xyz

BixBench-Verified-50

Loading data...

Grading Modes

Scientific Reporting

Key Takeaways

    Scoped Source Runs

    File Date Repeats Rows

    Downloads

    Existing Performance Figure

    Included from results/performance_analysis.png as requested.

    Performance analysis chart for benchmark runs

    Overall Accuracy

    Per-File Comparison

    Repeat Ranking (MCQ without refusal)

    Hardest Questions

    Largest Refusal Gaps

    Consistency Mix

    Task Group Accuracy (32 groups)

    Strengths & Weaknesses

    Strengths

      Weaknesses

        All 50 Questions

        Question Task Group Direct MCQ w/o MCQ w/ Refusal Gap MCQ Lift

        All 32 Task Groups

        Task Group Questions Runs Direct MCQ w/o MCQ w/

        MCQ Rescue Detail

        Cross-Run Variability

        Mixed-result questions broken down by file code (MCQ without refusal correct/total).

        Best Repeats

        Rank Run Direct MCQ with refusal MCQ without refusal

        Rebuild Data

        Run python3 scripts/build_pages_data.py whenever source CSVs or image change.