Last night, Claude wrote some code for the qualification metrics. 💡 ( which might now be called exemplar tasks. The metric is "respond to one prompt".) 🔥 ( to some degree, the goal is to test "changes in context" as much as "changes in model")


An earlier task (from late 2023) was to answer the question: Who was Pablo Arosemena? ⚙️ ( the Wikipedia article Pablo Arosemena is about an obscure politician from Panama)

The 8b models don't know who this is. But, they most commonly think he is an obscure painter. 🔥 ( probably because of Pablo Picasso)

Is there some sense this is a true stereotype? Is it more likely he was a painter than that he was a baker, a masseuse, or a peasant farmer? ⚔️ ( well, actually ... it's more likely someone of this name had one of those jobs. But, less likely they were written about.)


I am getting a new computer. 🔥 ( "only" $600) This should allow better speed comparisons between the models. 💡 ( the inconveniences of having an external USB drive, power demands, and heat creation "on my lap" grew to be too much.)


I need to do one more "schema improvement" pass on the "dictionary".

Things like "Chinese translation", "Korean translation" need to be in a dataclass, rather than passed as parameters everywhere.

For now, I want the indexes (and explicit NULL values), so these are database columns, rather than an "all_translations" JSON blob.


Some of the benchmarks should be re-written once the "dictionary" API is available.


Claude invented "categories" for the benchmarks: "Language", "Reasoning", "Knowledge", and "Translation". These are ... decent.

But the zeroth category is token introspection. For "how many letters is the word triumphant" questions. 💡 ( even the "spell check" tests that require repeating a misspelled word are probably in this category) ⚙️ ( you can have an LLM without token introspection. but it should be very doable. possibly with some form of API / injection.)

Then, Language (starting with definitions and antonyms), Knowledge (starting with basic geography), and Translation (starting with EN-FR, EN-ZH, SW-KO - word-based).

The only "reasoning" task so far might be "unit conversion". But that would have a different name.


The "translation" tasks have to deal with the different vocabulary sizes of different languages.

This is one part technical vocabulary 💡 ( does Swahili have a word for capacitor that isn't a loan word?) and one part eccentric distinctions ⚙️ ( Chinese has and for "two")

So far, I have largely mitigated this problem by avoiding it.


The dictionary will need some type of "class" system. Specifically, I want to say "get a random animal" and have it do that.

I am putting that off as well. Largely because it is a morass of taxonomical hell that has stymied decades of efforts.