another generation

Gemma3 is out: https://blog.google/technology/developers/gemma-3/

Also out, since my last evaluations: Claude 3.7, ChatGPT 4.5, QwQ-32B.


There are a few "smoke tests" I want to run. But, beyond that, I'm not certain I will have the time or interest to do any deep evaluations.

I already know that "8B" models can do some tasks at a reasonable speed, and can't do other tasks. It is very unlikely that the new models will move the needle.

As far as the new very-large models are concerned: my initial impressions have not shown them to be a substantial improvement. There is more "DeepSeek" style internal narrative, but the results are often worse as a result. 💡 ( was it a bad test? do I need to change the prompts? or are they privileging "results that make stupid people think the machine is smart" over accurate results?)