jamestown (part 4)
Posted by Alexander Power
Received: 2025-04-13 18:27:23
Channel: Cities - Project Journal
In reply to: jamestown (part 3) (View Chain)
Replies:
This system does demonstrate the flaws of some of the smaller models. Gemma3:4b, when I asked it to define artillery, said that one definition was an alternative for artilegia, a (non-existent) word for "fingers or toes".
The next goal is to get a Chinese word-frequency distribution.
This isn't too hard.
- Download a zhwiki dump.
- Generate a list of the "top 2500" pages. 💡 ( I have some old code that does a PageRank-like algorithm to find the top pages. For enwiki, the results are quite good; with a few anomalies like Oxford University Press being highly ranked because of citations.) ⚙️ ( it is convenient that most of the Wiki templates are in English.)
- Use jieba to tokenize the text.
- Get a word count.
- "Merge" this with the English lists.
The obvious problem is that words don't match 1-1 between languages. eros, agape, and philia.
The details of these problems are unpredictable at this time.