{"channel":"llm","content":"JSON output is almost a necessity for an LLM to be usable today.  All of the major LLM platforms have it in some form.  But, if you are using a model from 2023, it might not support it, or it might not work very well.\r\n\r\nWhile many of the improvements from 2 years ago are in the tools running the LLM (<red> such as the token-selection algorithm), there is some amount of understanding of the output-format that needs to be trained into the model.\r\n\r\n----\r\n\r\nTrying to test Phi-2 (December 2023, 2.7B params) or Mistral-0.3 (September 2023, 7B params) seems unlikely to be worth any time/effort.  I know there are newer models that are better; and I'm not sure there will be usable results at all.\r\n\r\nDoes that mean the models we have today will be useless in 18 months?  *Probably not*.  Maybe there will be a GPT-4.1-nano quality model that is 2c IN / 5c OUT per million tokens (<red> currently GPT-4.1-nano is 10c IN / 40c OUT per million tokens).  For almost all personal uses, this is not a substantial improvement.\r\n\r\n----\r\n\r\nWhether << Falcon 3 >> (<resource> https://huggingface.co/blog/falcon3 ) is worth considering is a different question.\r\n\r\nTheir press-release has benchmarks showing them as slightly better than earlier systems of similar size.  But nothing ground-breaking; and in fact we know there can't be anything too unique.  If there were, it would have already been copied.\r\n\r\nIt is \"just another model\". (<yellow> if you want to build a forest, it helps to have many different trees)\r\n\r\n----\r\n\r\nWhat about Granite (the IBM offering)? (<resource> https://www.ibm.com/granite/docs/ )\r\n\r\nThis one I happened to already test.  The results were very unremarkable.  Like most 8B models, this 8B model gave acceptable results for tasks that did not require deep insight or precision.\r\n\r\n----\r\n\r\nThe highest-profile \"local models\" are Gemma (<context> Google's latest model), Llama (<context> Facebook's latest model), QWEN (<context> Alibaba's latest model), and Phi (<context> Microsoft's latest model). (<xantham> Amazon and Apple do not seem to be releasing their own models.  Netflix is not, either.) (<red> there are others; Mistral is probably the leading European provider.) (<xantham> I still don't care about Deepseek; the \"thought\" is largely a party-trick that people will see through soon enough ... also most other models also do that in some way now.)\r\n\r\nAnd, all of these seem to be hitting limits at the 8B param size.  The latest releases are more interesting at the 24-40B param size.  Which *can* be run on a local machine ... just not the ones I own.\r\n\r\n----\r\n\r\nThe 1.5B parameter models are useful for << speculative decoding >> (<context> https://research.google/blog/looking-back-at-speculative-decoding/ ), which is where you use one model to make a cheap \"guess\" for the larger model, allowing more tokens to be calculated at once.\r\n\r\nBeyond that, they are largely toys.  With fine-tuning and testing, you can probably use a model for a single useful task.  But the 1.5B models are not << general-purpose >> AI, and they probably never will be.\r\n\r\n----\r\n\r\nFor \"cloud\" models, there is Gemini (<context> Google), GPT (<context> OpenAI), and Claude (<context> Anthropic).  And, several others that I haven't bothered with. (<red> Perplexity has an API called << Sonar >>.  Amazon has something called << Nova >>.  And there is still TSFKAT's offering.) (<acronym> TSFKAT = \"the site formerly known as Twitter\")\r\n\r\nAnd ... without a specific work-task, it is unlikely that benchmarking / testing these models will come up with any useful data.","created_at":"2025-05-02T19:23:20.769452","id":463,"llm_annotations":{},"parent_id":462,"processed_content":"<p>JSON output is almost a necessity for an LLM to be usable today.  All of the major LLM platforms have it in some form.  But, if you are using a model from 2023, it might not support it, or it might not work very well.\r</p>\n<p>While many of the improvements from 2 years ago are in the tools running the LLM <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( such as the token-selection algorithm)</span>\n  </span>, there is some amount of understanding of the output-format that needs to be trained into the model.\r</p> <hr class=\"section-break\" /> <p>Trying to test Phi-2 (December 2023, 2.7B params) or Mistral-0.3 (September 2023, 7B params) seems unlikely to be worth any time/effort.  I know there are newer models that are better; and I'm not sure there will be usable results at all.\r</p>\n<p>Does that mean the models we have today will be useless in 18 months?  <em>Probably not</em>.  Maybe there will be a GPT-4.1-nano quality model that is 2c IN / 5c OUT per million tokens <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( currently GPT-4.1-nano is 10c IN / 40c OUT per million tokens)</span>\n  </span>.  For almost all personal uses, this is not a substantial improvement.\r</p> <hr class=\"section-break\" /> <p>Whether <span class=\"literal-text\">Falcon 3</span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://huggingface.co/blog/falcon3\" target=\"_blank\" rel=\"noopener noreferrer\">https://huggingface.co/blog/falcon3</a> )</span>\n  </span> is worth considering is a different question.\r</p>\n<p>Their press-release has benchmarks showing them as slightly better than earlier systems of similar size.  But nothing ground-breaking; and in fact we know there can't be anything too unique.  If there were, it would have already been copied.\r</p>\n<p>It is \"just another model\". <span class=\"colorblock color-yellow\">\n    <span class=\"sigil\">\ud83d\udcac</span>\n    <span class=\"colortext-content\">( if you want to build a forest, it helps to have many different trees)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>What about Granite (the IBM offering)? <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://www.ibm.com/granite/docs/\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.ibm.com/granite/docs/</a> )</span>\n  </span>\r</p>\n<p>This one I happened to already test.  The results were very unremarkable.  Like most 8B models, this 8B model gave acceptable results for tasks that did not require deep insight or precision.\r</p> <hr class=\"section-break\" /> <p>The highest-profile \"local models\" are Gemma <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Google's latest model)</span>\n  </span>, Llama <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Facebook's latest model)</span>\n  </span>, QWEN <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Alibaba's latest model)</span>\n  </span>, and Phi <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Microsoft's latest model)</span>\n  </span>. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( Amazon and Apple do not seem to be releasing their own models.  Netflix is not, either.)</span>\n  </span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( there are others; Mistral is probably the leading European provider.)</span>\n  </span> <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( I still don't care about Deepseek; the \"thought\" is largely a party-trick that people will see through soon enough ... also most other models also do that in some way now.)</span>\n  </span>\r</p>\n<p>And, all of these seem to be hitting limits at the 8B param size.  The latest releases are more interesting at the 24-40B param size.  Which <em>can</em> be run on a local machine ... just not the ones I own.\r</p> <hr class=\"section-break\" /> <p>The 1.5B parameter models are useful for <span class=\"literal-text\">speculative decoding</span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://research.google/blog/looking-back-at-speculative-decoding/\" target=\"_blank\" rel=\"noopener noreferrer\">https://research.google/blog/looking-back-at-speculative-decoding/</a> )</span>\n  </span>, which is where you use one model to make a cheap \"guess\" for the larger model, allowing more tokens to be calculated at once.\r</p>\n<p>Beyond that, they are largely toys.  With fine-tuning and testing, you can probably use a model for a single useful task.  But the 1.5B models are not <span class=\"literal-text\">general-purpose</span> AI, and they probably never will be.\r</p> <hr class=\"section-break\" /> <p>For \"cloud\" models, there is Gemini <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Google)</span>\n  </span>, GPT <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( OpenAI)</span>\n  </span>, and Claude <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Anthropic)</span>\n  </span>.  And, several others that I haven't bothered with. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( Perplexity has an API called <span class=\"literal-text\">Sonar</span>.  Amazon has something called <span class=\"literal-text\">Nova</span>.  And there is still TSFKAT's offering.)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( TSFKAT = \"the site formerly known as Twitter\")</span>\n  </span>\r</p>\n<p>And ... without a specific work-task, it is unlikely that benchmarking / testing these models will come up with any useful data.</p>","quotes":[{"text":"if you want to build a forest, it helps to have many different trees","type":"reference"}],"subject":"greenland, a post-mortem, part 3"}