{"channel":"cities","content":"so far today: running the \"proficiency\" benchmarks against GPT-4-1 and Gemini-2.5-flash.\r\n\r\n----\r\n\r\nThe headline: Google's cheap model can count letters.  Gemini was substantially slower than both OpenAI and Anthropic (but, perhaps, that can vary day-to-day).  But it got 96% on the infamous << count how many \"R\"s in strawberry >> metric, and none of the similarly-priced models got above 70%. (<red> the only metric it did \"bad\" on was the IPA one, and that is because the response normalization code is broken)\r\n\r\n----\r\n\r\nfor pricing (<context> all prices per million tokens):\r\n<<< GPT-4-1-nano: 10c IN, 40c OUT\r\nGPT-4-1-mini: 40c IN, 160c OUT\r\nGPT-4o-mini: 30c IN, 120c OUT\r\nGemini-2.5-flash: 15c IN, 60c OUT\r\nClaude-3-5-haiku: 80c IN, 400c OUT >>>\r\n\r\n<green> Most of these have (or will have) \"cache\" discounts of 50-90% for repeated queries with the same long context.\r\n<red> Claude is both the most expensive at this tier, and the lowest-performing.  And the least-recently updated.\r\n<xantham> presumably, they will have a new model at half the price, next week.","created_at":"2025-04-24T20:00:24.398343","id":356,"llm_annotations":{},"parent_id":353,"processed_content":"<p>so far today: running the \"proficiency\" benchmarks against GPT-4-1 and Gemini-2.5-flash.\r</p> <hr class=\"section-break\" /> <p>The headline: Google's cheap model can count letters.  Gemini was substantially slower than both OpenAI and Anthropic (but, perhaps, that can vary day-to-day).  But it got 96% on the infamous <span class=\"literal-text\">count how many \"R\"s in strawberry</span> metric, and none of the similarly-priced models got above 70%. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the only metric it did \"bad\" on was the IPA one, and that is because the response normalization code is broken)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>for pricing <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( all prices per million tokens)</span>\n  </span>:\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p> GPT-4-1-nano: 10c IN, 40c OUT\r</p>\n<p>GPT-4-1-mini: 40c IN, 160c OUT\r</p>\n<p>GPT-4o-mini: 30c IN, 120c OUT\r</p>\n<p>Gemini-2.5-flash: 15c IN, 60c OUT\r</p>\n<p>Claude-3-5-haiku: 80c IN, 400c OUT </p></div></div>\r</p>\n<p><span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\"> Most of these have (or will have) \"cache\" discounts of 50-90% for repeated queries with the same long context.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\"> Claude is both the most expensive at this tier, and the lowest-performing.  And the least-recently updated.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\"> presumably, they will have a new model at half the price, next week.</span>\n  </span></p>","quotes":[],"subject":"edgeley (part 2)"}