{"channel":"llm","content":"Gemma3 is out: https://blog.google/technology/developers/gemma-3/\r\n\r\nAlso out, since my last evaluations: Claude 3.7, ChatGPT 4.5, QwQ-32B.\r\n\r\n----\r\n\r\nThere are a few \"smoke tests\" I want to run.  But, beyond that, I'm not certain I will have the time or interest to do any deep evaluations.\r\n\r\nI already know that \"8B\" models can do some tasks at a reasonable speed, and can't do other tasks.  It is very unlikely that the new models will move the needle.\r\n\r\nAs far as the new very-large models are concerned: my initial impressions have not shown them to be a substantial improvement.  There is more \"DeepSeek\" style << internal narrative >>, but the results are often worse as a result. (<red> was it a bad test?  do I need to change the prompts?  or are they privileging \"results that make stupid people think the *machine* is smart\" over accurate results?)","created_at":"2025-03-12T14:39:26.318220","id":299,"llm_annotations":{},"parent_id":null,"processed_content":"<p>Gemma3 is out: <a href=\"https://blog.google/technology/developers/gemma-3/\" target=\"_blank\" rel=\"noopener noreferrer\">https://blog.google/technology/developers/gemma-3/</a>\r</p>\n<p>Also out, since my last evaluations: Claude 3.7, ChatGPT 4.5, QwQ-32B.\r</p> <hr class=\"section-break\" /> <p>There are a few \"smoke tests\" I want to run.  But, beyond that, I'm not certain I will have the time or interest to do any deep evaluations.\r</p>\n<p>I already know that \"8B\" models can do some tasks at a reasonable speed, and can't do other tasks.  It is very unlikely that the new models will move the needle.\r</p>\n<p>As far as the new very-large models are concerned: my initial impressions have not shown them to be a substantial improvement.  There is more \"DeepSeek\" style <span class=\"literal-text\">internal narrative</span>, but the results are often worse as a result. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( was it a bad test?  do I need to change the prompts?  or are they privileging \"results that make stupid people think the <em>machine</em> is smart\" over accurate results?)</span>\n  </span></p>","quotes":[],"subject":"another generation"}