Another round of "chainsaw coding" with Claude. 🔥 ( the generators now use generators!) 💡 ( Previously, Claude's approach was to just make one question without history. But we do want history. Use all the canned questions from text files (once), then generate files from local data (without repeating), then ask the LLM.)
It mostly works. The LLM generation is slightly flaky. It gave both "sad" and "melancholy" as candidate opposites for "happy". It is including the pinyin for Chinese translations some of the time.
But at least it is (mostly) using the correct interfaces now. 💡 ( there are still some useless parameters. The "tags" for questions are meaningless. The difficulty is arbitrary. The evaluation criteria are often silly.)
The various "general knowledge" benchmarks should be quick to write 🔥 ( tomorrow) . Right now I see six categories for the initial tests:
- History
- Geography
- Chemistry
- Biology
- Sports 🔥 ( American athletes, mostly)
- Music 💡 ( 20th century English-language music, mostly)
These will be "easy" questions. Or, at least, multiple-choice.