so far today: running the "proficiency" benchmarks against GPT-4-1 and Gemini-2.5-flash.


The headline: Google's cheap model can count letters. Gemini was substantially slower than both OpenAI and Anthropic (but, perhaps, that can vary day-to-day). But it got 96% on the infamous count how many "R"s in strawberry metric, and none of the similarly-priced models got above 70%. 💡 ( the only metric it did "bad" on was the IPA one, and that is because the response normalization code is broken)


for pricing ⚙️ ( all prices per million tokens) :

GPT-4-1-nano: 10c IN, 40c OUT

GPT-4-1-mini: 40c IN, 160c OUT

GPT-4o-mini: 30c IN, 120c OUT

Gemini-2.5-flash: 15c IN, 60c OUT

Claude-3-5-haiku: 80c IN, 400c OUT

⚙️ Most of these have (or will have) "cache" discounts of 50-90% for repeated queries with the same long context.

💡 Claude is both the most expensive at this tier, and the lowest-performing. And the least-recently updated.

🔥 presumably, they will have a new model at half the price, next week.