{"chain":[{"channel":"cities","content":"yesterday:\r\n> test ChatGPT's new \"custom voices\" feature. (<xantham> can it do a *xantham*-y voice?)\r\n\r\nThe voices are ... good, but not good enough.  I can't see myself building a product on top of this version.  Maybe the next version.  The API is simple enough that it would take about 2 days to come up with an MVP.\r\n\r\n> test Gemma3:4b and Gemma3:1b locally\r\n\r\nThe results are good, but not magical.  They are tuned for chatting, but struggle more than Gemma2 on mechanical tasks (<red> copy the misspelled word from this sentence).  They do fairly well on the accuracy / time tradeoffs (<red> because the hardware is the same, \"time\" is an accurate way of measuring \"cost\")\r\n\r\n> get the << greenland >> quals to be available on the public internet (<xantham> I still haven't chosen a domain for it.  Should << earlyversion >> have file hosting?  Probably not.)\r\n\r\n> look at some old EA game repos (<resource> https://news.ycombinator.com/item?id=43197131 for commentary) and download some to << universe >>.\r\n\r\n----\r\n\r\nWhat is << universe >>?\r\n\r\nIt is my archive of Git repositories, or similar.\r\n\r\n<<< 1372_ea_games % du -hs\r\n286M\tCnC_Red_Alert\r\n673M\tCnC_Renegade\r\n 51M\tCnC_Tiberian_Dawn\r\n>>>\r\n\r\n----\r\n\r\nfor today:\r\n\r\n> a \"watchdog\" for Atacama?  the machine has become wedged twice in 3 months.  both fixes were of the \"just restart the server\" persuasion. (<orange> well, actually, the problem is that the lexer has an infinite loop on a single unpaired asterisk.)\r\n> more blogging, less << technical >> work","created_at":"2025-03-24T18:37:04.627601","id":323,"is_target":false,"parent_id":null,"processed_content":"<p>yesterday:\r</p>\n<ul>\n<li class=\"arrow-list\"> test ChatGPT's new \"custom voices\" feature. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( can it do a <em>xantham</em>-y voice?)</span>\n  </span>\r</li>\n</ul>\n<p>The voices are ... good, but not good enough.  I can't see myself building a product on top of this version.  Maybe the next version.  The API is simple enough that it would take about 2 days to come up with an MVP.\r</p>\n<ul>\n<li class=\"arrow-list\"> test Gemma3:4b and Gemma3:1b locally\r</li>\n</ul>\n<p>The results are good, but not magical.  They are tuned for chatting, but struggle more than Gemma2 on mechanical tasks <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( copy the misspelled word from this sentence)</span>\n  </span>.  They do fairly well on the accuracy / time tradeoffs <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( because the hardware is the same, \"time\" is an accurate way of measuring \"cost\")</span>\n  </span>\r</p>\n<ul>\n<li class=\"arrow-list\"> get the <span class=\"literal-text\">greenland</span> quals to be available on the public internet <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( I still haven't chosen a domain for it.  Should <span class=\"literal-text\">earlyversion</span> have file hosting?  Probably not.)</span>\n  </span>\r</li>\n</ul>\n<ul>\n<li class=\"arrow-list\"> look at some old EA game repos <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://news.ycombinator.com/item?id=43197131\" target=\"_blank\" rel=\"noopener noreferrer\">https://news.ycombinator.com/item?id=43197131</a> for commentary)</span>\n  </span> and download some to <span class=\"literal-text\">universe</span>.\r</li>\n</ul> <hr class=\"section-break\" /> <p>What is <span class=\"literal-text\">universe</span>?\r</p>\n<p>It is my archive of Git repositories, or similar.\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p> 1372_ea_games % du -hs\r</p>\n<p>286M\tCnC_Red_Alert\r</p>\n<p>673M\tCnC_Renegade\r</p>\n<p> 51M\tCnC_Tiberian_Dawn\r</p></div></div>\r</p> <hr class=\"section-break\" /> <p>for today:\r</p>\n<ul>\n<li class=\"arrow-list\"> a \"watchdog\" for Atacama?  the machine has become wedged twice in 3 months.  both fixes were of the \"just restart the server\" persuasion. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, the problem is that the lexer has an infinite loop on a single unpaired asterisk.)</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> more blogging, less <span class=\"literal-text\">technical</span> work</li>\n</ul>","subject":"minot (part 1)"},{"channel":"cities","content":"for today:\r\n> one investment meeting (<green> as always, further information withheld)\r\n> tutoring (<red> details withheld, but some information may make it to a post) (<mogue> due to staff illness, tutoring was canceled)\r\n> maybe try again to get Claude to see the light about the lexer/parser system (<red> currently, the lexer is looking ahead to capture an entire text block for the *emphasis* tag, rather than letting the parser do this)\r\n> add HTML metadata linking to the RSS feed here.\r\n> maybe get the LLM dashboard available publicly (<red> but first, I would like 3 or 4 more of the \"proficiency\" tests on the dashboard.)\r\n\r\n----\r\n\r\napparently https://github.com/ollama/ollama/issues/7978 is fixed.  This was the \"Ollama always returns structured-JSON responses in alphabetical order\" bug.  Which, when the two fields were << thought >> and << answer >>, was a major problem.\r\n\r\nI need Claude to remove the code that was added to work-around this problem.  Which is at least a 1-hour time commitment.\r\n\r\nAt that point, Claude should be able to construct new \"proficiency\" benchmarks fairly easily. (<red> with a perfect system, it would take only 5 minutes per test.  In the current world, I am hoping for 20-30 minutes.)","created_at":"2025-03-25T17:42:51.649891","id":327,"is_target":true,"parent_id":323,"processed_content":"<p>for today:\r</p>\n<ul>\n<li class=\"arrow-list\"> one investment meeting <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( as always, further information withheld)</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> tutoring <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( details withheld, but some information may make it to a post)</span>\n  </span> <span class=\"colorblock color-mogue\">\n    <span class=\"sigil\">\ud83c\udf0e</span>\n    <span class=\"colortext-content\">( due to staff illness, tutoring was canceled)</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> maybe try again to get Claude to see the light about the lexer/parser system <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( currently, the lexer is looking ahead to capture an entire text block for the <em>emphasis</em> tag, rather than letting the parser do this)</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> add HTML metadata linking to the RSS feed here.\r</li>\n<li class=\"arrow-list\"> maybe get the LLM dashboard available publicly <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( but first, I would like 3 or 4 more of the \"proficiency\" tests on the dashboard.)</span>\n  </span>\r</li>\n</ul> <hr class=\"section-break\" /> <p>apparently <a href=\"https://github.com/ollama/ollama/issues/7978\" target=\"_blank\" rel=\"noopener noreferrer\">https://github.com/ollama/ollama/issues/7978</a> is fixed.  This was the \"Ollama always returns structured-JSON responses in alphabetical order\" bug.  Which, when the two fields were <span class=\"literal-text\">thought</span> and <span class=\"literal-text\">answer</span>, was a major problem.\r</p>\n<p>I need Claude to remove the code that was added to work-around this problem.  Which is at least a 1-hour time commitment.\r</p>\n<p>At that point, Claude should be able to construct new \"proficiency\" benchmarks fairly easily. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( with a perfect system, it would take only 5 minutes per test.  In the current world, I am hoping for 20-30 minutes.)</span>\n  </span></p>","subject":"minot (part 2)"},{"channel":"cities","content":"today: https://spaceship.computer/greenland/model_summary.html\r\n\r\n<red> <<< These are \"proficiency\" metrics. (<xantham> although, every time I use the word \"proficiency\" I want to change it)\r\nThey are simple tasks, currently: translate a word, choose a definition, choose an antonym, find the misspelled word.  And, for the >4B models, as long as the model *knows* the language, it does fairly well.  The 1B models do have some difficulties.\r\nThe *timing* data is interesting.  It is, roughly, a linear relation to model size.  The 9B models are about 4 times slower than the 1B models.  Phi-4 (the largest model tested) is also very clearly the slowest model.\r\nSome of the models I was looking at before (Granite, ExaONE, Hermes, Tulu, Mistral) did not make this round of tests.  For Mistral, the 12B model is too old, and their newest release, at 24B, is too large.  The others didn't distinguish themselves enough from similar Llama models to be worth my time (and hard-drive space). >>>\r\n\r\nremaining todo:\r\n> standardize the logging of prompts and responses.  the *full* text (<context> that is, including the system prompt) should be stored.\r\n> fix the benchmarks.  some of the definitions are too similar.  (<context> previously we had << kingdom >> and << realm >> as choices.  now the closest is << honest >> and << sincere >>.)  some of the translations are still a bit rough. (<red> the translation of \"beautiful\" into French is << beau/belle >>, the LLMs are very reasonably just returning \"beau\" as the translation)\r\n> fix the model warming.  Just calling the \"warm model\" function correctly doesn't do enough warming.\r\n> add additional tests.  hopefully *now* it will take less than 1 hour to make new tests.\r\n\r\nsome of the suggestions regarding new tests:\r\n<teal> <<< Part of Speech Tagging - Present a sentence and ask the model to identify the part of speech (noun, verb, adjective, etc.) for a specific word.\r\nUnit Conversion - Test ability to convert between simple units (kilometers to miles, pounds to kilograms).\r\nAnalogies - Simple analogies like \"day is to night as hot is to ___\".\r\nTense Transformation - Provide a sentence in one tense and ask the model to convert it to another tense.\r\nActive/Passive Voice Conversion - Convert sentences between active and passive voice.\r\n>>>","created_at":"2025-03-26T20:51:59.909829","id":329,"is_target":false,"parent_id":327,"processed_content":"<p>today: <a href=\"https://spaceship.computer/greenland/model_summary.html\" target=\"_blank\" rel=\"noopener noreferrer\">https://spaceship.computer/greenland/model_summary.html</a>\r</p>\n<p><div class=\"mlq color-red\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83d\udca1</span></button><div class=\"mlq-content\"><p> These are \"proficiency\" metrics. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( although, every time I use the word \"proficiency\" I want to change it)</span>\n  </span>\r</p>\n<p>They are simple tasks, currently: translate a word, choose a definition, choose an antonym, find the misspelled word.  And, for the &gt;4B models, as long as the model <em>knows</em> the language, it does fairly well.  The 1B models do have some difficulties.\r</p>\n<p>The <em>timing</em> data is interesting.  It is, roughly, a linear relation to model size.  The 9B models are about 4 times slower than the 1B models.  Phi-4 (the largest model tested) is also very clearly the slowest model.\r</p>\n<p>Some of the models I was looking at before (Granite, ExaONE, Hermes, Tulu, Mistral) did not make this round of tests.  For Mistral, the 12B model is too old, and their newest release, at 24B, is too large.  The others didn't distinguish themselves enough from similar Llama models to be worth my time (and hard-drive space). </p></div></div>\r</p>\n<p>remaining todo:\r</p>\n<ul>\n<li class=\"arrow-list\"> standardize the logging of prompts and responses.  the <em>full</em> text <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( that is, including the system prompt)</span>\n  </span> should be stored.\r</li>\n<li class=\"arrow-list\"> fix the benchmarks.  some of the definitions are too similar.  <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( previously we had <span class=\"literal-text\">kingdom</span> and <span class=\"literal-text\">realm</span> as choices.  now the closest is <span class=\"literal-text\">honest</span> and <span class=\"literal-text\">sincere</span>.)</span>\n  </span>  some of the translations are still a bit rough. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the translation of \"beautiful\" into French is <span class=\"literal-text\">beau/belle</span>, the LLMs are very reasonably just returning \"beau\" as the translation)</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> fix the model warming.  Just calling the \"warm model\" function correctly doesn't do enough warming.\r</li>\n<li class=\"arrow-list\"> add additional tests.  hopefully <em>now</em> it will take less than 1 hour to make new tests.\r</li>\n</ul>\n<p>some of the suggestions regarding new tests:\r</p>\n<p><div class=\"mlq color-teal\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83e\udd16</span></button><div class=\"mlq-content\"><p> Part of Speech Tagging - Present a sentence and ask the model to identify the part of speech (noun, verb, adjective, etc.) for a specific word.\r</p>\n<p>Unit Conversion - Test ability to convert between simple units (kilometers to miles, pounds to kilograms).\r</p>\n<p>Analogies - Simple analogies like \"day is to night as hot is to ___\".\r</p>\n<p>Tense Transformation - Provide a sentence in one tense and ask the model to convert it to another tense.\r</p>\n<p>Active/Passive Voice Conversion - Convert sentences between active and passive voice.\r</p></div></div></p>","subject":"minot (part 3)"}]}