{"channel":"cities","content":"Last night, Claude wrote some code for the << qualification >> metrics. (<red> which might now be called *exemplar* tasks.  The metric is \"respond to one prompt\".) (<xantham> to some degree, the goal is to test \"changes in context\" as much as \"changes in model\")\r\n\r\n----\r\n\r\nAn earlier task (from late 2023) was to answer the question: Who was << Pablo Arosemena >>? (<context> the Wikipedia article [[Pablo Arosemena]] is about an obscure politician from Panama)\r\n\r\nThe 8b models don't know who this is.  But, they most commonly think he is an obscure painter. (<xantham> probably because of Pablo Picasso)\r\n\r\nIs there some sense this is a << true stereotype >>?  Is it more likely he was a painter than that he was a baker, a masseuse, or a peasant farmer? (<orange> well, actually ... it's more likely someone of this name had one of those jobs.  But, less likely they were written about.)\r\n\r\n----\r\n\r\nI am getting a new computer. (<xantham> \"only\" $600)  This should allow better speed comparisons between the models. (<red> the inconveniences of having an external USB drive, power demands, and heat creation \"on my lap\" grew to be too much.)\r\n\r\n----\r\n\r\nI need to do one more \"schema improvement\" pass on the \"dictionary\".\r\n\r\nThings like \"Chinese translation\", \"Korean translation\" need to be in a << dataclass >>, rather than passed as parameters everywhere.\r\n\r\nFor now, I want the indexes (and explicit NULL values), so these are database columns, rather than an \"all_translations\" JSON blob.\r\n\r\n----\r\n\r\nSome of the benchmarks should be re-written once the \"dictionary\" API is available.\r\n\r\n----\r\n\r\nClaude invented \"categories\" for the benchmarks: \"Language\", \"Reasoning\", \"Knowledge\", and \"Translation\".  These are ... decent.\r\n\r\nBut the zeroth category is << token introspection >>.  For \"how many letters is the word << triumphant >>\" questions. (<red> even the \"spell check\" tests that require repeating a misspelled word are probably in this category) (<green> you can have an LLM without << token introspection >>.  but it should be very doable.  possibly with some form of API / injection.)\r\n\r\nThen, *Language* (starting with definitions and antonyms), *Knowledge* (starting with basic geography), and *Translation* (starting with EN-FR, EN-ZH, SW-KO - word-based).\r\n\r\nThe only \"reasoning\" task so far might be \"unit conversion\".  But that would have a different name.\r\n\r\n----\r\n\r\nThe \"translation\" tasks have to deal with the different vocabulary sizes of different languages.\r\n\r\nThis is one part << technical vocabulary >> (<red> does Swahili have a word for << capacitor >> that isn't a loan word?) and one part << eccentric distinctions >> (<green> Chinese has \u4e8c and \u4e24 for \"two\")\r\n\r\nSo far, I have largely mitigated this problem by avoiding it.\r\n\r\n----\r\n\r\nThe dictionary will need some type of \"class\" system.  Specifically, I want to say \"get a random animal\" and have it do that.\r\n\r\nI am putting that off as well.  Largely because it is a morass of taxonomical hell that has stymied decades of efforts.","created_at":"2025-04-28T17:04:56.472050","id":455,"llm_annotations":{},"parent_id":356,"processed_content":"<p>Last night, Claude wrote some code for the <span class=\"literal-text\">qualification</span> metrics. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( which might now be called <em>exemplar</em> tasks.  The metric is \"respond to one prompt\".)</span>\n  </span> <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( to some degree, the goal is to test \"changes in context\" as much as \"changes in model\")</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>An earlier task (from late 2023) was to answer the question: Who was <span class=\"literal-text\">Pablo Arosemena</span>? <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( the Wikipedia article <a href=\"https://en.wikipedia.org/wiki/Pablo_Arosemena\" class=\"wikilink\" target=\"_blank\">Pablo Arosemena</a> is about an obscure politician from Panama)</span>\n  </span>\r</p>\n<p>The 8b models don't know who this is.  But, they most commonly think he is an obscure painter. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( probably because of Pablo Picasso)</span>\n  </span>\r</p>\n<p>Is there some sense this is a <span class=\"literal-text\">true stereotype</span>?  Is it more likely he was a painter than that he was a baker, a masseuse, or a peasant farmer? <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually ... it's more likely someone of this name had one of those jobs.  But, less likely they were written about.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>I am getting a new computer. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( \"only\" $600)</span>\n  </span>  This should allow better speed comparisons between the models. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the inconveniences of having an external USB drive, power demands, and heat creation \"on my lap\" grew to be too much.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>I need to do one more \"schema improvement\" pass on the \"dictionary\".\r</p>\n<p>Things like \"Chinese translation\", \"Korean translation\" need to be in a <span class=\"literal-text\">dataclass</span>, rather than passed as parameters everywhere.\r</p>\n<p>For now, I want the indexes (and explicit NULL values), so these are database columns, rather than an \"all_translations\" JSON blob.\r</p> <hr class=\"section-break\" /> <p>Some of the benchmarks should be re-written once the \"dictionary\" API is available.\r</p> <hr class=\"section-break\" /> <p>Claude invented \"categories\" for the benchmarks: \"Language\", \"Reasoning\", \"Knowledge\", and \"Translation\".  These are ... decent.\r</p>\n<p>But the zeroth category is <span class=\"literal-text\">token introspection</span>.  For \"how many letters is the word <span class=\"literal-text\">triumphant</span>\" questions. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( even the \"spell check\" tests that require repeating a misspelled word are probably in this category)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( you can have an LLM without <span class=\"literal-text\">token introspection</span>.  but it should be very doable.  possibly with some form of API / injection.)</span>\n  </span>\r</p>\n<p>Then, <em>Language</em> (starting with definitions and antonyms), <em>Knowledge</em> (starting with basic geography), and <em>Translation</em> (starting with EN-FR, EN-ZH, SW-KO - word-based).\r</p>\n<p>The only \"reasoning\" task so far might be \"unit conversion\".  But that would have a different name.\r</p> <hr class=\"section-break\" /> <p>The \"translation\" tasks have to deal with the different vocabulary sizes of different languages.\r</p>\n<p>This is one part <span class=\"literal-text\">technical vocabulary</span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( does Swahili have a word for <span class=\"literal-text\">capacitor</span> that isn't a loan word?)</span>\n  </span> and one part <span class=\"literal-text\">eccentric distinctions</span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Chinese has <span class=\"annotated-chinese\" data-pinyin=\"\u00c8R\" data-definition=\"two\">\u4e8c</span> and <span class=\"annotated-chinese\" data-pinyin=\"L\u01cfANG\" data-definition=\"two\">\u4e24</span> for \"two\")</span>\n  </span>\r</p>\n<p>So far, I have largely mitigated this problem by avoiding it.\r</p> <hr class=\"section-break\" /> <p>The dictionary will need some type of \"class\" system.  Specifically, I want to say \"get a random animal\" and have it do that.\r</p>\n<p>I am putting that off as well.  Largely because it is a morass of taxonomical hell that has stymied decades of efforts.</p>","quotes":[],"subject":"edgeley (part 3)"}
