Channel: Cities - Project Journal
In reply to: edgeley (part 1) (View Chain)
so far today: running the "proficiency" benchmarks against GPT-4-1 and Gemini-2.5-flash.
The headline: Google's cheap model can count letters. Gemini was substantially slower than both OpenAI and Anthropic (but, perhaps, that can vary day-to-day). But it got 96% on the infamous count how many "R"s in strawberry metric, and none of the similarly-priced models got above 70%. 💡 ( the only metric it did "bad" on was the IPA one, and that is because the response normalization code is broken)
for pricing ⚙️ ( all prices per million tokens) :
GPT-4-1-nano: 10c IN, 40c OUT
GPT-4-1-mini: 40c IN, 160c OUT
GPT-4o-mini: 30c IN, 120c OUT
Gemini-2.5-flash: 15c IN, 60c OUT
Claude-3-5-haiku: 80c IN, 400c OUT
⚙️ Most of these have (or will have) "cache" discounts of 50-90% for repeated queries with the same long context.
💡 Claude is both the most expensive at this tier, and the lowest-performing. And the least-recently updated.
🔥 presumably, they will have a new model at half the price, next week.