jamestown (part 4)

This system does demonstrate the flaws of some of the smaller models. Gemma3:4b, when I asked it to define artillery, said that one definition was an alternative for artilegia, a (non-existent) word for "fingers or toes".


The next goal is to get a Chinese word-frequency distribution.

This isn't too hard.

  • Download a zhwiki dump.
  • Generate a list of the "top 2500" pages. 💡 ( I have some old code that does a PageRank-like algorithm to find the top pages. For enwiki, the results are quite good; with a few anomalies like Oxford University Press being highly ranked because of citations.) ⚙️ ( it is convenient that most of the Wiki templates are in English.)
  • Use jieba to tokenize the text.
  • Get a word count.
  • "Merge" this with the English lists.

The obvious problem is that words don't match 1-1 between languages. eros, agape, and philia.

The details of these problems are unpredictable at this time.