{"channel":"cities","content":"today: https://spaceship.computer/greenland/model_summary.html\r\n\r\n<red> <<< These are \"proficiency\" metrics. (<xantham> although, every time I use the word \"proficiency\" I want to change it)\r\nThey are simple tasks, currently: translate a word, choose a definition, choose an antonym, find the misspelled word.  And, for the >4B models, as long as the model *knows* the language, it does fairly well.  The 1B models do have some difficulties.\r\nThe *timing* data is interesting.  It is, roughly, a linear relation to model size.  The 9B models are about 4 times slower than the 1B models.  Phi-4 (the largest model tested) is also very clearly the slowest model.\r\nSome of the models I was looking at before (Granite, ExaONE, Hermes, Tulu, Mistral) did not make this round of tests.  For Mistral, the 12B model is too old, and their newest release, at 24B, is too large.  The others didn't distinguish themselves enough from similar Llama models to be worth my time (and hard-drive space). >>>\r\n\r\nremaining todo:\r\n> standardize the logging of prompts and responses.  the *full* text (<context> that is, including the system prompt) should be stored.\r\n> fix the benchmarks.  some of the definitions are too similar.  (<context> previously we had << kingdom >> and << realm >> as choices.  now the closest is << honest >> and << sincere >>.)  some of the translations are still a bit rough. (<red> the translation of \"beautiful\" into French is << beau/belle >>, the LLMs are very reasonably just returning \"beau\" as the translation)\r\n> fix the model warming.  Just calling the \"warm model\" function correctly doesn't do enough warming.\r\n> add additional tests.  hopefully *now* it will take less than 1 hour to make new tests.\r\n\r\nsome of the suggestions regarding new tests:\r\n<teal> <<< Part of Speech Tagging - Present a sentence and ask the model to identify the part of speech (noun, verb, adjective, etc.) for a specific word.\r\nUnit Conversion - Test ability to convert between simple units (kilometers to miles, pounds to kilograms).\r\nAnalogies - Simple analogies like \"day is to night as hot is to ___\".\r\nTense Transformation - Provide a sentence in one tense and ask the model to convert it to another tense.\r\nActive/Passive Voice Conversion - Convert sentences between active and passive voice.\r\n>>>","created_at":"2025-03-26T20:51:59.909829","id":329,"llm_annotations":{},"parent_id":327,"processed_content":"<p>today: <a href=\"https://spaceship.computer/greenland/model_summary.html\" target=\"_blank\" rel=\"noopener noreferrer\">https://spaceship.computer/greenland/model_summary.html</a>\r</p>\n<p><div class=\"mlq color-red\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83d\udca1</span></button><div class=\"mlq-content\"><p> These are \"proficiency\" metrics. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( although, every time I use the word \"proficiency\" I want to change it)</span>\n  </span>\r</p>\n<p>They are simple tasks, currently: translate a word, choose a definition, choose an antonym, find the misspelled word.  And, for the &gt;4B models, as long as the model <em>knows</em> the language, it does fairly well.  The 1B models do have some difficulties.\r</p>\n<p>The <em>timing</em> data is interesting.  It is, roughly, a linear relation to model size.  The 9B models are about 4 times slower than the 1B models.  Phi-4 (the largest model tested) is also very clearly the slowest model.\r</p>\n<p>Some of the models I was looking at before (Granite, ExaONE, Hermes, Tulu, Mistral) did not make this round of tests.  For Mistral, the 12B model is too old, and their newest release, at 24B, is too large.  The others didn't distinguish themselves enough from similar Llama models to be worth my time (and hard-drive space). </p></div></div>\r</p>\n<p>remaining todo:\r</p>\n<ul>\n<li class=\"arrow-list\"> standardize the logging of prompts and responses.  the <em>full</em> text <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( that is, including the system prompt)</span>\n  </span> should be stored.\r</li>\n<li class=\"arrow-list\"> fix the benchmarks.  some of the definitions are too similar.  <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( previously we had <span class=\"literal-text\">kingdom</span> and <span class=\"literal-text\">realm</span> as choices.  now the closest is <span class=\"literal-text\">honest</span> and <span class=\"literal-text\">sincere</span>.)</span>\n  </span>  some of the translations are still a bit rough. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the translation of \"beautiful\" into French is <span class=\"literal-text\">beau/belle</span>, the LLMs are very reasonably just returning \"beau\" as the translation)</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> fix the model warming.  Just calling the \"warm model\" function correctly doesn't do enough warming.\r</li>\n<li class=\"arrow-list\"> add additional tests.  hopefully <em>now</em> it will take less than 1 hour to make new tests.\r</li>\n</ul>\n<p>some of the suggestions regarding new tests:\r</p>\n<p><div class=\"mlq color-teal\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83e\udd16</span></button><div class=\"mlq-content\"><p> Part of Speech Tagging - Present a sentence and ask the model to identify the part of speech (noun, verb, adjective, etc.) for a specific word.\r</p>\n<p>Unit Conversion - Test ability to convert between simple units (kilometers to miles, pounds to kilograms).\r</p>\n<p>Analogies - Simple analogies like \"day is to night as hot is to ___\".\r</p>\n<p>Tense Transformation - Provide a sentence in one tense and ask the model to convert it to another tense.\r</p>\n<p>Active/Passive Voice Conversion - Convert sentences between active and passive voice.\r</p></div></div></p>","quotes":[],"subject":"minot (part 3)"}
