Channel: Cities - Project Journal
In reply to: edgeley (part 2) (View Chain)
Last night, Claude wrote some code for the qualification metrics. 💡 ( which might now be called exemplar tasks. The metric is "respond to one prompt".) 🔥 ( to some degree, the goal is to test "changes in context" as much as "changes in model")
An earlier task (from late 2023) was to answer the question: Who was Pablo Arosemena? ⚙️ ( the Wikipedia article Pablo Arosemena is about an obscure politician from Panama)
The 8b models don't know who this is. But, they most commonly think he is an obscure painter. 🔥 ( probably because of Pablo Picasso)
Is there some sense this is a true stereotype? Is it more likely he was a painter than that he was a baker, a masseuse, or a peasant farmer? ⚔️ ( well, actually ... it's more likely someone of this name had one of those jobs. But, less likely they were written about.)
I am getting a new computer. 🔥 ( "only" $600) This should allow better speed comparisons between the models. 💡 ( the inconveniences of having an external USB drive, power demands, and heat creation "on my lap" grew to be too much.)
I need to do one more "schema improvement" pass on the "dictionary".
Things like "Chinese translation", "Korean translation" need to be in a dataclass, rather than passed as parameters everywhere.
For now, I want the indexes (and explicit NULL values), so these are database columns, rather than an "all_translations" JSON blob.
Some of the benchmarks should be re-written once the "dictionary" API is available.
Claude invented "categories" for the benchmarks: "Language", "Reasoning", "Knowledge", and "Translation". These are ... decent.
But the zeroth category is token introspection. For "how many letters is the word triumphant" questions. 💡 ( even the "spell check" tests that require repeating a misspelled word are probably in this category) ⚙️ ( you can have an LLM without token introspection. but it should be very doable. possibly with some form of API / injection.)
Then, Language (starting with definitions and antonyms), Knowledge (starting with basic geography), and Translation (starting with EN-FR, EN-ZH, SW-KO - word-based).
The only "reasoning" task so far might be "unit conversion". But that would have a different name.
The "translation" tasks have to deal with the different vocabulary sizes of different languages.
This is one part technical vocabulary 💡 ( does Swahili have a word for capacitor that isn't a loan word?) and one part eccentric distinctions ⚙️ ( Chinese has 二 and 两 for "two")
So far, I have largely mitigated this problem by avoiding it.
The dictionary will need some type of "class" system. Specifically, I want to say "get a random animal" and have it do that.
I am putting that off as well. Largely because it is a morass of taxonomical hell that has stymied decades of efforts.