View All Cities Messages
2025-03-24 18:37:04

yesterday:

  • test ChatGPT's new "custom voices" feature. 🔥 ( can it do a xantham-y voice?)

The voices are ... good, but not good enough. I can't see myself building a product on top of this version. Maybe the next version. The API is simple enough that it would take about 2 days to come up with an MVP.

  • test Gemma3:4b and Gemma3:1b locally

The results are good, but not magical. They are tuned for chatting, but struggle more than Gemma2 on mechanical tasks 💡 ( copy the misspelled word from this sentence) . They do fairly well on the accuracy / time tradeoffs 💡 ( because the hardware is the same, "time" is an accurate way of measuring "cost")

  • get the greenland quals to be available on the public internet 🔥 ( I still haven't chosen a domain for it. Should earlyversion have file hosting? Probably not.)

What is universe?

It is my archive of Git repositories, or similar.

1372_ea_games % du -hs

286M CnC_Red_Alert

673M CnC_Renegade

51M CnC_Tiberian_Dawn


for today:

  • a "watchdog" for Atacama? the machine has become wedged twice in 3 months. both fixes were of the "just restart the server" persuasion. ⚔️ ( well, actually, the problem is that the lexer has an infinite loop on a single unpaired asterisk.)
  • more blogging, less technical work
2025-03-25 17:42:51

for today:

  • one investment meeting ⚙️ ( as always, further information withheld)
  • tutoring 💡 ( details withheld, but some information may make it to a post) 🌎 ( due to staff illness, tutoring was canceled)
  • maybe try again to get Claude to see the light about the lexer/parser system 💡 ( currently, the lexer is looking ahead to capture an entire text block for the emphasis tag, rather than letting the parser do this)
  • add HTML metadata linking to the RSS feed here.
  • maybe get the LLM dashboard available publicly 💡 ( but first, I would like 3 or 4 more of the "proficiency" tests on the dashboard.)

apparently https://github.com/ollama/ollama/issues/7978 is fixed. This was the "Ollama always returns structured-JSON responses in alphabetical order" bug. Which, when the two fields were thought and answer, was a major problem.

I need Claude to remove the code that was added to work-around this problem. Which is at least a 1-hour time commitment.

At that point, Claude should be able to construct new "proficiency" benchmarks fairly easily. 💡 ( with a perfect system, it would take only 5 minutes per test. In the current world, I am hoping for 20-30 minutes.)

2025-03-26 20:51:59

today: https://spaceship.computer/greenland/model_summary.html

These are "proficiency" metrics. 🔥 ( although, every time I use the word "proficiency" I want to change it)

They are simple tasks, currently: translate a word, choose a definition, choose an antonym, find the misspelled word. And, for the >4B models, as long as the model knows the language, it does fairly well. The 1B models do have some difficulties.

The timing data is interesting. It is, roughly, a linear relation to model size. The 9B models are about 4 times slower than the 1B models. Phi-4 (the largest model tested) is also very clearly the slowest model.

Some of the models I was looking at before (Granite, ExaONE, Hermes, Tulu, Mistral) did not make this round of tests. For Mistral, the 12B model is too old, and their newest release, at 24B, is too large. The others didn't distinguish themselves enough from similar Llama models to be worth my time (and hard-drive space).

remaining todo:

  • standardize the logging of prompts and responses. the full text ⚙️ ( that is, including the system prompt) should be stored.
  • fix the benchmarks. some of the definitions are too similar. ⚙️ ( previously we had kingdom and realm as choices. now the closest is honest and sincere.) some of the translations are still a bit rough. 💡 ( the translation of "beautiful" into French is beau/belle, the LLMs are very reasonably just returning "beau" as the translation)
  • fix the model warming. Just calling the "warm model" function correctly doesn't do enough warming.
  • add additional tests. hopefully now it will take less than 1 hour to make new tests.

some of the suggestions regarding new tests:

Part of Speech Tagging - Present a sentence and ask the model to identify the part of speech (noun, verb, adjective, etc.) for a specific word.

Unit Conversion - Test ability to convert between simple units (kilometers to miles, pounds to kilograms).

Analogies - Simple analogies like "day is to night as hot is to ___".

Tense Transformation - Provide a sentence in one tense and ask the model to convert it to another tense.

Active/Passive Voice Conversion - Convert sentences between active and passive voice.