Recent Messages

2025-03-31 22:05:35

Another round of "chainsaw coding" with Claude. 🔥 ( the generators now use generators!) 💡 ( Previously, Claude's approach was to just make one question without history. But we do want history. Use all the canned questions from text files (once), then generate files from local data (without repeating), then ask the LLM.)

It mostly works. The LLM generation is slightly flaky. It gave both "sad" and "melancholy" as candidate opposites for "happy". It is including the pinyin for Chinese translations some of the time.

But at least it is (mostly) using the correct interfaces now. 💡 ( there are still some useless parameters. The "tags" for questions are meaningless. The difficulty is arbitrary. The evaluation criteria are often silly.)


The various "general knowledge" benchmarks should be quick to write 🔥 ( tomorrow) . Right now I see six categories for the initial tests:

  • History
  • Geography
  • Chemistry
  • Biology
  • Sports 🔥 ( American athletes, mostly)
  • Music 💡 ( 20th century English-language music, mostly)

These will be "easy" questions. Or, at least, multiple-choice.

2025-03-28 13:41:49

yesterday:

  • three new "greenland" benchmarks: letter count, unit conversion, part-of-speech detection. 💡 ( it should not be a surprise to the contemporary reader that the models struggle the most with "letter count" - how many "r"s are in strawberry.)

still to do:

  • code cleanup 💡 ( the "run" method is written in slightly different form seven times)
  • more benchmarks
  • dashboard UI improvements
2025-03-26 20:51:59

today: https://spaceship.computer/greenland/model_summary.html

These are "proficiency" metrics. 🔥 ( although, every time I use the word "proficiency" I want to change it)

They are simple tasks, currently: translate a word, choose a definition, choose an antonym, find the misspelled word. And, for the >4B models, as long as the model knows the language, it does fairly well. The 1B models do have some difficulties.

The timing data is interesting. It is, roughly, a linear relation to model size. The 9B models are about 4 times slower than the 1B models. Phi-4 (the largest model tested) is also very clearly the slowest model.

Some of the models I was looking at before (Granite, ExaONE, Hermes, Tulu, Mistral) did not make this round of tests. For Mistral, the 12B model is too old, and their newest release, at 24B, is too large. The others didn't distinguish themselves enough from similar Llama models to be worth my time (and hard-drive space).

remaining todo:

  • standardize the logging of prompts and responses. the full text ⚙️ ( that is, including the system prompt) should be stored.
  • fix the benchmarks. some of the definitions are too similar. ⚙️ ( previously we had kingdom and realm as choices. now the closest is honest and sincere.) some of the translations are still a bit rough. 💡 ( the translation of "beautiful" into French is beau/belle, the LLMs are very reasonably just returning "beau" as the translation)
  • fix the model warming. Just calling the "warm model" function correctly doesn't do enough warming.
  • add additional tests. hopefully now it will take less than 1 hour to make new tests.

some of the suggestions regarding new tests:

Part of Speech Tagging - Present a sentence and ask the model to identify the part of speech (noun, verb, adjective, etc.) for a specific word.

Unit Conversion - Test ability to convert between simple units (kilometers to miles, pounds to kilograms).

Analogies - Simple analogies like "day is to night as hot is to ___".

Tense Transformation - Provide a sentence in one tense and ask the model to convert it to another tense.

Active/Passive Voice Conversion - Convert sentences between active and passive voice.

2025-03-25 17:42:51

for today:

  • one investment meeting ⚙️ ( as always, further information withheld)
  • tutoring 💡 ( details withheld, but some information may make it to a post) 🌎 ( due to staff illness, tutoring was canceled)
  • maybe try again to get Claude to see the light about the lexer/parser system 💡 ( currently, the lexer is looking ahead to capture an entire text block for the emphasis tag, rather than letting the parser do this)
  • add HTML metadata linking to the RSS feed here.
  • maybe get the LLM dashboard available publicly 💡 ( but first, I would like 3 or 4 more of the "proficiency" tests on the dashboard.)

apparently https://github.com/ollama/ollama/issues/7978 is fixed. This was the "Ollama always returns structured-JSON responses in alphabetical order" bug. Which, when the two fields were thought and answer, was a major problem.

I need Claude to remove the code that was added to work-around this problem. Which is at least a 1-hour time commitment.

At that point, Claude should be able to construct new "proficiency" benchmarks fairly easily. 💡 ( with a perfect system, it would take only 5 minutes per test. In the current world, I am hoping for 20-30 minutes.)

2025-03-24 20:07:33

In my testing, I am starting to make a distinction between two types of "tests" for LLMs.


A proficiency test covers simple tasks. Some examples:

  • Repeat the misspelled word in this sentence.
  • Translate this word from English to French. 💡 ( the linguistic knowledge of models is an unresolved question. Should it know 5 languages, or 40, or 400? In the specific case of English/French: it is plausible to claim that one cannot truly know the English language without knowing French. The LLM should also know French.)
  • Choose the definition of this word.

The accuracy in performing these tasks is, often, surprisingly bad, compared to performance on other tasks. This may be due to a lack of training for these tasks.


On the other hand, a qual test ⚙️ ( possibly for "qualification") are more difficult.

  • Write two paragraphs about the city of Toulouse.
  • Explain the theory of relativity to a nine-year-old.
  • Answer these questions from the GRE Verbal Reasoning section.

From a technical perspective: many of these are free-form responses that are scored by a larger LLM.

The interesting question is not whether any LLM can answer these, but whether an LLM under 16GB in size can do so.


the Frontier tests are not particularly interesting.

2025-03-24 19:54:31

Today's background viewing: Satisfactory.

The commentary is regularly comparing it to Factorio.

The obvious difference is obvious: Satisfactory is a first-person view, Factorio is a top-down view.


The "production chains" are similar. And, similarly abstract.

Copper mine to Copper furnace to Copper sheet press.

Is this how copper is actually made? It doesn't matter.


The "magical multi-function assembly machines" are shared as well. It is, once again, an abstraction. If you assume you are just building the blueprint, the abstractions 🔥 ( how does the character carry 400 buildings and 20 locomotives) are manageable.


Satisfactory has a lot of the experience be "exploring the game-world". Climb trees, swim across rivers.

I find the graphics to be slightly nauseating. But they are "good".

What are the levels of quality of graphics?

1. ASCII art.

2. Vector graphics.

3. Basic 3-D.

4. Advanced 3-D.

5. Photo-realistic.

This game would be at level 4.


Satisfactory, because of the UI, is much more time-consuming. "Run a pipe with water between two points" goes from a 1-minute task to a 10-minute task. 💡 ( task is the correct word. These games are, largely, about determining and executing on a long series of tasks.)

2025-03-24 18:37:04

yesterday:

  • test ChatGPT's new "custom voices" feature. 🔥 ( can it do a xantham-y voice?)

The voices are ... good, but not good enough. I can't see myself building a product on top of this version. Maybe the next version. The API is simple enough that it would take about 2 days to come up with an MVP.

  • test Gemma3:4b and Gemma3:1b locally

The results are good, but not magical. They are tuned for chatting, but struggle more than Gemma2 on mechanical tasks 💡 ( copy the misspelled word from this sentence) . They do fairly well on the accuracy / time tradeoffs 💡 ( because the hardware is the same, "time" is an accurate way of measuring "cost")

  • get the greenland quals to be available on the public internet 🔥 ( I still haven't chosen a domain for it. Should earlyversion have file hosting? Probably not.)

What is universe?

It is my archive of Git repositories, or similar.

1372_ea_games % du -hs

286M CnC_Red_Alert

673M CnC_Renegade

51M CnC_Tiberian_Dawn


for today:

  • a "watchdog" for Atacama? the machine has become wedged twice in 3 months. both fixes were of the "just restart the server" persuasion. ⚔️ ( well, actually, the problem is that the lexer has an infinite loop on a single unpaired asterisk.)
  • more blogging, less technical work
2025-03-22 17:28:35

Trying to get the machine to explain a distinction 🔥 ( I find this debate too annoying to write up myself)

Democracy and institutional independence represent distinct dimensions of societal organization that can exist separately or together. Democracy fundamentally concerns how political power is allocated—through free elections with broad suffrage, competing parties, and peaceful power transitions—while institutional independence relates to the autonomy that non-governmental entities maintain from state control, regardless of how that state is governed. A democratic system can feature strong governmental oversight of universities, companies, and civic organizations if voters support such arrangements, while conversely, an authoritarian regime might grant significant autonomy to certain institutions (particularly economic ones) while denying citizens meaningful political choice. The clearest distinction emerges when considering that democracy addresses who holds and transfers political power, while institutional independence concerns the boundaries of where that power reaches—two separate questions that societies answer in various combinations rather than a single unified concept.


The relationship between democracy and institutional independence is complex and not strictly correlated.

Democracy primarily concerns how political leaders are selected (through free and fair elections) and how government power is distributed and constrained. Institutional independence relates to how much autonomy non-governmental organizations have from state control.

These dimensions can combine in various ways:

Democratic systems with high institutional independence: Countries like the United States, Canada, and Germany maintain democratic elections while allowing significant autonomy for companies, universities, and civic organizations.

Democratic systems with more state involvement: Countries like France or the Nordic nations have robust democratic processes but may feature more state coordination in certain sectors (like education or healthcare) while still maintaining institutional autonomy in many areas.

Non-democratic systems with limited institutional independence: China, Russia, and Iran allow few or no meaningful democratic choices while also maintaining strong state control over institutions.

More rare: Non-democratic systems with some institutional independence: Singapore has limited political competition but grants significant autonomy to businesses and universities in many domains. The UAE similarly restricts political rights while allowing considerable economic freedom.


What does it mean?

Democracy doesn't guarantee institutional independence. Democratically elected governments can still choose to exercise strong control over universities, businesses, or the press.

Institutional independence doesn't guarantee democracy. Some authoritarian states grant limited autonomy to certain sectors (especially economic ones) while restricting political rights.

Many democracies balance institutional independence with public accountability through regulation rather than direct control.

The most comprehensive liberty tends to exist in systems that combine democratic political rights with substantial institutional independence, creating multiple centers of power and preventing excessive concentration of authority.

The degree of institutional independence reflects fundamental choices about how a society is organized and the proper relationship between state and civil society, which can vary even among functioning democracies.


What is "institutional independence"?

Several terms are commonly used to describe the concept of institutional independence in different contexts:

Institutional Autonomy: Perhaps the most direct alternative, emphasizing an institution's ability to govern itself and make decisions without external control.

Civil Society: Refers to the network of independent organizations and associations that exist outside government control.

Private Sector Independence: Specifically for businesses operating with minimal government intervention.

Academic Freedom: Used particularly for educational institutions' right to determine research and teaching agendas.

Separation of Powers: While typically referring to branches of government, this concept extends to the separation between government and non-governmental institutions.

Limited Government: The philosophy that government power should be constrained, allowing other institutions to operate independently.

Subsidiarity: The principle that matters should be handled by the smallest, lowest, or least centralized competent authority.

Market Economy: Economic system where businesses operate independently with minimal state direction.

Rule of Law: Legal framework that protects institutions from arbitrary government intervention.

Pluralism: The existence of multiple independent centers of power and decision-making in society.

Liberalism: In the classical sense, emphasizing limited government and protection of civil liberties, allowing institutions to operate freely.

These terms often overlap conceptually with institutional independence but may emphasize different aspects or apply to specific domains of society.

2025-03-22 15:15:09

https://ossaiceejay.substack.com/p/he-was-sentenced-to-a-luxury-resort

It ... is a very long article. But that's what the machine is here for:

In 2013, 16-year-old Ethan Couch killed four people and severely injured two others while driving drunk in Texas. Despite the severity of his actions, Couch avoided prison time when his attorneys successfully argued that he suffered from "affluenza" - a made-up condition suggesting his wealthy upbringing prevented him from understanding consequences. Instead of jail, he received 10 years probation and was sent to an expensive rehabilitation facility.

Two years later, after a video surfaced showing Couch apparently violating his probation by drinking, he and his mother Tonya fled to Mexico. They were eventually caught, and Couch served nearly two years in jail for the probation violation before being released in 2018. He violated probation again in 2020 when he was found with THC, but received only additional community service rather than jail time.

Couch completed his probation in April 2024, having largely avoided serious consequences for his actions. Meanwhile, the victims' families continue to suffer, including Sergio Molina who was left paralyzed from the crash and requires constant care. The case became a national symbol of how wealth and privilege can influence the justice system, with critics pointing out that a less affluent defendant would likely have faced decades in prison for the same crimes.

2025-03-19 18:31:47

💡 the RSS feed has issues. It is doing a quick "remove the HTML" pass. What is needed, instead, is a separate "xml_generator.py" file.

also: the earlyversion mail server is finally working. now I just have to configure the mailboxes. and maybe have a mailbox that earlyversion reads via IMAP. 🤖 ( A mail server requires SMTP-in, SMTP-out, and IMAP to handle email communication efficiently. SMTP-in is responsible for receiving emails from external senders, listening on port 25 and storing incoming messages in the recipient’s mailbox. SMTP-out, on the other hand, is used to send emails from local users to external recipients, typically operating on port 587 (or 465 for SSL).
IMAP allows users to access and manage their stored emails from multiple devices while keeping everything synchronized. Unlike POP3, IMAP ensures that messages remain on the server, providing features like folder management and search functionality. It operates on port 143 (or 993 for secure connections) )
⚙️ ( one problem with an email API is that it is difficult to verify the sender of an email is who it says it is. not impossible, but hard.)

as far as CSS classes: Claude (after some poking) suggests using "msg", "ui", and "layout" prefixes to cut through the naming complexity. and, possibly, "atacama" for the colortext blocks. 💡 ( it is a good idea ... if i ever do the renaming)

Show Older Messages