That link is https://spaceship.computer/greenland/ .


Nobody particularly cares about the "space-time tradeoff" with these models. 💡 ( which is a shame, because it is very relevant to both industrial uses and AI safety concerns)

If an 8B model does 5% better because of "chain-of-thought" but takes 15 times longer, it's generally not actually better than a 14B model would have been.

And, a lot of the "thought" should be tools, rather than the illusion-of-thought that (at least the small LLMs) love. 💡 ( the prime example is what's the capital of Spain? Oh, I think I heard once that it is Madrid! style bullshit.)


We don't need a mythical super-human AI to generate mass unemployment in knowledge-workers.

We don't need models that have a desire to "escape" or "replicate". We don't need to worry about "alignment". We certainly don't need By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun.

The ordinary-intelligence AI, that I can already run on my computer, is already enough to trigger mass-unemployment. ⚔️ ( well, actually, the 8B models aren't quite good enough or fast enough. but the GPT-4.1-nano size models are cheap enough and good enough to be sufficient. once the tools and the workflows are improved.)

But, this social change is not something that an AI Safety Team can address. The myth-making of the all-powerful AI is, for lack of a better word, dumb. If you really want there to be meaning to it, you can use enough it's a metaphor to make their arguments somewhat match the future. But you can't kill a metaphor with a shotgun.


There is an insidious meme in the LLM community, that a benchmark where models can get 100% is a bad benchmark.

This could not be farther from the truth.

If your only concern is "how advanced is the state-of-the-art model", there is a slight amount of sense to this. But, the new benchmarks are often mind-bogglingly stupid.

When the questions are obscure trivia that shouldn't even be in the training set, deliberately-obfuscated mathematical puzzles, or evaluate this complicated Python function without using Python, it is arguable that getting the question right (from memory, in a short response) is the wrong response. The machine shouldn't know, or should have to spend more time/effort than is allowed. 💡 ( the machine isn't magic. if you ask it to solve a computational task that takes O(n^3) time in O(n) time, it won't do it. at best, it will make guesses that evade your spot-checking.)

I affirmatively want benchmarks that GPT-4.1-mini gets a perfect score on. I want to know what the tasks which the machine can do perfectly are; and at what point it starts being able to do so.


One approach I have considered but not found any good outcomes from is the consensus of mediocre models approach.

If you take 7 8B models, and ask them all the same question, and then "merge" the outputs, will you get a better result?

This is not exactly the same as the "mixture of experts" approach for various models. But, there are similarities. ... Perhaps the difference is that Mixture of Experts is beneficial, and mixing general-purpose models is not.