{"chain":[{"channel":"cities","content":"<teal> <<< Edgeley, North Dakota, is a small rural town in LaMoure County, located in the southeastern part of the state. With a population hovering around 500 people, it's one of many prairie towns that exemplify the broader character of the upper Great Plains\u2014quiet, sparsely populated, and closely tied to agriculture. >>>\r\n\r\n----\r\n\r\nhttps://www.lesswrong.com/posts/bfHDoWLnBH9xR3YAK/ai-2027-is-a-bet-against-amdahl-s-law\r\n\r\n<red> <<< Of course the post is right.  The various << FOOM >> claims are all bullshit.  And Amdahl's Law is one of the reason why.  Just because a few things will be a hundred times faster (or a million times faster) doesn't make the whole thing that much faster. >>>\r\n\r\nAlso, AGI definitions vary so widely, from << things that have already happened >> to << things that are impossible >>, that a \"prediction market\" is nearly meaningless.\r\n\r\n----\r\n\r\nI have seen various commentary related to \"Twilight of the Edgelords\" (<resource> https://www.astralcodexten.com/p/twilight-of-the-edgelords ), a piece that I don't have access to.\r\n\r\nAnd, the response I can piece together from the fragments I can see would fall under GUILD LAW. (<resource> additional commentary at https://www.writingruxandrabio.com/p/the-edgelords-were-right-a-response and https://theahura.substack.com/p/contra-scott-and-rux-on-whos-to-blame )\r\n\r\n----\r\n\r\nhttps://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/\r\n\r\n<<< To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090. >>>\r\n\r\nIt seems pretty obvious.  A majority of the users of open-source models are using quantized models on personal hardware; might as well optimize that use-case. (<red> it is less clear that a majority of the CPU cycles are there; but a majority of the people certainly are.)\r\n\r\nMy next round of updating the << Greenland metrics >> will have to include the gemma3-12b-qat model. (<red> or, maybe the 27b.  According to Hacker News, gemma3-27b-Q4 << only uses ~22Gb (via Ollama) or ~15GB (MLX) >>.  On a 24GB machine, this clearly needs the non-Ollama approach.)\r\n\r\nAnd, also, GPT-4.1 .  And probably Gemini-2.5 . (<red> the goal for these models should be to perform at 100% accuracy.) (<orange> well, actually, a few of the \"correct\" benchmark answers right now are incorrect.)","created_at":"2025-04-21T19:45:31.800653","id":353,"is_target":false,"parent_id":null,"processed_content":"<p><div class=\"mlq color-teal\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83e\udd16</span></button><div class=\"mlq-content\"><p> Edgeley, North Dakota, is a small rural town in LaMoure County, located in the southeastern part of the state. With a population hovering around 500 people, it's one of many prairie towns that exemplify the broader character of the upper Great Plains\u2014quiet, sparsely populated, and closely tied to agriculture. </p></div></div>\r</p> <hr class=\"section-break\" /> <p><a href=\"https://www.lesswrong.com/posts/bfHDoWLnBH9xR3YAK/ai-2027-is-a-bet-against-amdahl-s-law\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.lesswrong.com/posts/bfHDoWLnBH9xR3YAK/ai-2027-is-a-bet-against-amdahl-s-law</a>\r</p>\n<p><div class=\"mlq color-red\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83d\udca1</span></button><div class=\"mlq-content\"><p> Of course the post is right.  The various <span class=\"literal-text\">FOOM</span> claims are all bullshit.  And Amdahl's Law is one of the reason why.  Just because a few things will be a hundred times faster (or a million times faster) doesn't make the whole thing that much faster. </p></div></div>\r</p>\n<p>Also, AGI definitions vary so widely, from <span class=\"literal-text\">things that have already happened</span> to <span class=\"literal-text\">things that are impossible</span>, that a \"prediction market\" is nearly meaningless.\r</p> <hr class=\"section-break\" /> <p>I have seen various commentary related to \"Twilight of the Edgelords\" <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://www.astralcodexten.com/p/twilight-of-the-edgelords\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.astralcodexten.com/p/twilight-of-the-edgelords</a> )</span>\n  </span>, a piece that I don't have access to.\r</p>\n<p>And, the response I can piece together from the fragments I can see would fall under GUILD LAW. <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( additional commentary at <a href=\"https://www.writingruxandrabio.com/p/the-edgelords-were-right-a-response\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.writingruxandrabio.com/p/the-edgelords-were-right-a-response</a> and <a href=\"https://theahura.substack.com/p/contra-scott-and-rux-on-whos-to-blame\" target=\"_blank\" rel=\"noopener noreferrer\">https://theahura.substack.com/p/contra-scott-and-rux-on-whos-to-blame</a> )</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p><a href=\"https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/\" target=\"_blank\" rel=\"noopener noreferrer\">https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/</a>\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p> To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090. </p></div></div>\r</p>\n<p>It seems pretty obvious.  A majority of the users of open-source models are using quantized models on personal hardware; might as well optimize that use-case. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( it is less clear that a majority of the CPU cycles are there; but a majority of the people certainly are.)</span>\n  </span>\r</p>\n<p>My next round of updating the <span class=\"literal-text\">Greenland metrics</span> will have to include the gemma3-12b-qat model. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( or, maybe the 27b.  According to Hacker News, gemma3-27b-Q4 <span class=\"literal-text\">only uses ~22Gb (via Ollama) or ~15GB (MLX)</span>.  On a 24GB machine, this clearly needs the non-Ollama approach.)</span>\n  </span>\r</p>\n<p>And, also, GPT-4.1 .  And probably Gemini-2.5 . <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the goal for these models should be to perform at 100% accuracy.)</span>\n  </span> <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, a few of the \"correct\" benchmark answers right now are incorrect.)</span>\n  </span></p>","subject":"edgeley (part 1)"},{"channel":"cities","content":"so far today: running the \"proficiency\" benchmarks against GPT-4-1 and Gemini-2.5-flash.\r\n\r\n----\r\n\r\nThe headline: Google's cheap model can count letters.  Gemini was substantially slower than both OpenAI and Anthropic (but, perhaps, that can vary day-to-day).  But it got 96% on the infamous << count how many \"R\"s in strawberry >> metric, and none of the similarly-priced models got above 70%. (<red> the only metric it did \"bad\" on was the IPA one, and that is because the response normalization code is broken)\r\n\r\n----\r\n\r\nfor pricing (<context> all prices per million tokens):\r\n<<< GPT-4-1-nano: 10c IN, 40c OUT\r\nGPT-4-1-mini: 40c IN, 160c OUT\r\nGPT-4o-mini: 30c IN, 120c OUT\r\nGemini-2.5-flash: 15c IN, 60c OUT\r\nClaude-3-5-haiku: 80c IN, 400c OUT >>>\r\n\r\n<green> Most of these have (or will have) \"cache\" discounts of 50-90% for repeated queries with the same long context.\r\n<red> Claude is both the most expensive at this tier, and the lowest-performing.  And the least-recently updated.\r\n<xantham> presumably, they will have a new model at half the price, next week.","created_at":"2025-04-24T20:00:24.398343","id":356,"is_target":false,"parent_id":353,"processed_content":"<p>so far today: running the \"proficiency\" benchmarks against GPT-4-1 and Gemini-2.5-flash.\r</p> <hr class=\"section-break\" /> <p>The headline: Google's cheap model can count letters.  Gemini was substantially slower than both OpenAI and Anthropic (but, perhaps, that can vary day-to-day).  But it got 96% on the infamous <span class=\"literal-text\">count how many \"R\"s in strawberry</span> metric, and none of the similarly-priced models got above 70%. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the only metric it did \"bad\" on was the IPA one, and that is because the response normalization code is broken)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>for pricing <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( all prices per million tokens)</span>\n  </span>:\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p> GPT-4-1-nano: 10c IN, 40c OUT\r</p>\n<p>GPT-4-1-mini: 40c IN, 160c OUT\r</p>\n<p>GPT-4o-mini: 30c IN, 120c OUT\r</p>\n<p>Gemini-2.5-flash: 15c IN, 60c OUT\r</p>\n<p>Claude-3-5-haiku: 80c IN, 400c OUT </p></div></div>\r</p>\n<p><span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\"> Most of these have (or will have) \"cache\" discounts of 50-90% for repeated queries with the same long context.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\"> Claude is both the most expensive at this tier, and the lowest-performing.  And the least-recently updated.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\"> presumably, they will have a new model at half the price, next week.</span>\n  </span></p>","subject":"edgeley (part 2)"},{"channel":"cities","content":"Last night, Claude wrote some code for the << qualification >> metrics. (<red> which might now be called *exemplar* tasks.  The metric is \"respond to one prompt\".) (<xantham> to some degree, the goal is to test \"changes in context\" as much as \"changes in model\")\r\n\r\n----\r\n\r\nAn earlier task (from late 2023) was to answer the question: Who was << Pablo Arosemena >>? (<context> the Wikipedia article [[Pablo Arosemena]] is about an obscure politician from Panama)\r\n\r\nThe 8b models don't know who this is.  But, they most commonly think he is an obscure painter. (<xantham> probably because of Pablo Picasso)\r\n\r\nIs there some sense this is a << true stereotype >>?  Is it more likely he was a painter than that he was a baker, a masseuse, or a peasant farmer? (<orange> well, actually ... it's more likely someone of this name had one of those jobs.  But, less likely they were written about.)\r\n\r\n----\r\n\r\nI am getting a new computer. (<xantham> \"only\" $600)  This should allow better speed comparisons between the models. (<red> the inconveniences of having an external USB drive, power demands, and heat creation \"on my lap\" grew to be too much.)\r\n\r\n----\r\n\r\nI need to do one more \"schema improvement\" pass on the \"dictionary\".\r\n\r\nThings like \"Chinese translation\", \"Korean translation\" need to be in a << dataclass >>, rather than passed as parameters everywhere.\r\n\r\nFor now, I want the indexes (and explicit NULL values), so these are database columns, rather than an \"all_translations\" JSON blob.\r\n\r\n----\r\n\r\nSome of the benchmarks should be re-written once the \"dictionary\" API is available.\r\n\r\n----\r\n\r\nClaude invented \"categories\" for the benchmarks: \"Language\", \"Reasoning\", \"Knowledge\", and \"Translation\".  These are ... decent.\r\n\r\nBut the zeroth category is << token introspection >>.  For \"how many letters is the word << triumphant >>\" questions. (<red> even the \"spell check\" tests that require repeating a misspelled word are probably in this category) (<green> you can have an LLM without << token introspection >>.  but it should be very doable.  possibly with some form of API / injection.)\r\n\r\nThen, *Language* (starting with definitions and antonyms), *Knowledge* (starting with basic geography), and *Translation* (starting with EN-FR, EN-ZH, SW-KO - word-based).\r\n\r\nThe only \"reasoning\" task so far might be \"unit conversion\".  But that would have a different name.\r\n\r\n----\r\n\r\nThe \"translation\" tasks have to deal with the different vocabulary sizes of different languages.\r\n\r\nThis is one part << technical vocabulary >> (<red> does Swahili have a word for << capacitor >> that isn't a loan word?) and one part << eccentric distinctions >> (<green> Chinese has \u4e8c and \u4e24 for \"two\")\r\n\r\nSo far, I have largely mitigated this problem by avoiding it.\r\n\r\n----\r\n\r\nThe dictionary will need some type of \"class\" system.  Specifically, I want to say \"get a random animal\" and have it do that.\r\n\r\nI am putting that off as well.  Largely because it is a morass of taxonomical hell that has stymied decades of efforts.","created_at":"2025-04-28T17:04:56.472050","id":455,"is_target":true,"parent_id":356,"processed_content":"<p>Last night, Claude wrote some code for the <span class=\"literal-text\">qualification</span> metrics. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( which might now be called <em>exemplar</em> tasks.  The metric is \"respond to one prompt\".)</span>\n  </span> <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( to some degree, the goal is to test \"changes in context\" as much as \"changes in model\")</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>An earlier task (from late 2023) was to answer the question: Who was <span class=\"literal-text\">Pablo Arosemena</span>? <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( the Wikipedia article <a href=\"https://en.wikipedia.org/wiki/Pablo_Arosemena\" class=\"wikilink\" target=\"_blank\">Pablo Arosemena</a> is about an obscure politician from Panama)</span>\n  </span>\r</p>\n<p>The 8b models don't know who this is.  But, they most commonly think he is an obscure painter. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( probably because of Pablo Picasso)</span>\n  </span>\r</p>\n<p>Is there some sense this is a <span class=\"literal-text\">true stereotype</span>?  Is it more likely he was a painter than that he was a baker, a masseuse, or a peasant farmer? <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually ... it's more likely someone of this name had one of those jobs.  But, less likely they were written about.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>I am getting a new computer. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( \"only\" $600)</span>\n  </span>  This should allow better speed comparisons between the models. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the inconveniences of having an external USB drive, power demands, and heat creation \"on my lap\" grew to be too much.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>I need to do one more \"schema improvement\" pass on the \"dictionary\".\r</p>\n<p>Things like \"Chinese translation\", \"Korean translation\" need to be in a <span class=\"literal-text\">dataclass</span>, rather than passed as parameters everywhere.\r</p>\n<p>For now, I want the indexes (and explicit NULL values), so these are database columns, rather than an \"all_translations\" JSON blob.\r</p> <hr class=\"section-break\" /> <p>Some of the benchmarks should be re-written once the \"dictionary\" API is available.\r</p> <hr class=\"section-break\" /> <p>Claude invented \"categories\" for the benchmarks: \"Language\", \"Reasoning\", \"Knowledge\", and \"Translation\".  These are ... decent.\r</p>\n<p>But the zeroth category is <span class=\"literal-text\">token introspection</span>.  For \"how many letters is the word <span class=\"literal-text\">triumphant</span>\" questions. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( even the \"spell check\" tests that require repeating a misspelled word are probably in this category)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( you can have an LLM without <span class=\"literal-text\">token introspection</span>.  but it should be very doable.  possibly with some form of API / injection.)</span>\n  </span>\r</p>\n<p>Then, <em>Language</em> (starting with definitions and antonyms), <em>Knowledge</em> (starting with basic geography), and <em>Translation</em> (starting with EN-FR, EN-ZH, SW-KO - word-based).\r</p>\n<p>The only \"reasoning\" task so far might be \"unit conversion\".  But that would have a different name.\r</p> <hr class=\"section-break\" /> <p>The \"translation\" tasks have to deal with the different vocabulary sizes of different languages.\r</p>\n<p>This is one part <span class=\"literal-text\">technical vocabulary</span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( does Swahili have a word for <span class=\"literal-text\">capacitor</span> that isn't a loan word?)</span>\n  </span> and one part <span class=\"literal-text\">eccentric distinctions</span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Chinese has <span class=\"annotated-chinese\" data-pinyin=\"\u00c8R\" data-definition=\"two\">\u4e8c</span> and <span class=\"annotated-chinese\" data-pinyin=\"L\u01cfANG\" data-definition=\"two\">\u4e24</span> for \"two\")</span>\n  </span>\r</p>\n<p>So far, I have largely mitigated this problem by avoiding it.\r</p> <hr class=\"section-break\" /> <p>The dictionary will need some type of \"class\" system.  Specifically, I want to say \"get a random animal\" and have it do that.\r</p>\n<p>I am putting that off as well.  Largely because it is a morass of taxonomical hell that has stymied decades of efforts.</p>","subject":"edgeley (part 3)"}]}
