3 Comments
User's avatar
David Gamage's avatar

Interesting! Can you also test o4-mini-high? I'd be curious to see how it does on your test.

Expand full comment
Danny Wilf-Townsend's avatar

Great question—I hadn't looked. Running it now, o4-mini-high gets questions one and two right, bombs question three (with a hallucinated quote, no less), gets question four right, and misses question five like everyone else except o3. Getting question four right puts it in good company (o1 pro, o3, and Gemini 2.5 pro), but a lot of other models did much better on question 3.

Expand full comment
David Gamage's avatar

Thanks!

Expand full comment