{"channel":"cities","content":"This system does demonstrate the flaws of some of the smaller models.  Gemma3:4b, when I asked it to define << artillery >>, said that one definition was an alternative for << artilegia >>, a (non-existent) word for \"fingers or toes\".\r\n\r\n----\r\n\r\nThe next goal is to get a *Chinese* word-frequency distribution.\r\n\r\nThis isn't too hard.\r\n\r\n# Download a zhwiki dump.\r\n# Generate a list of the \"top 2500\" pages. (<red> I have some old code that does a PageRank-like algorithm to find the top pages.  For << enwiki >>, the results are quite good; with a few anomalies like [[Oxford University Press]] being highly ranked because of citations.) (<green> it is convenient that most of the Wiki templates are in English.)\r\n# Use << jieba >> to tokenize the text.\r\n# Get a word count.\r\n# \"Merge\" this with the English lists.\r\n\r\n----\r\n\r\nThe obvious problem is that words don't match 1-1 between languages.  << eros >>, << agape >>, and << philia >>.\r\n\r\nThe details of these problems are unpredictable at this time.\r\n\r\n----","created_at":"2025-04-13T18:27:23.864613","id":349,"llm_annotations":{},"parent_id":347,"processed_content":"<p>This system does demonstrate the flaws of some of the smaller models.  Gemma3:4b, when I asked it to define <span class=\"literal-text\">artillery</span>, said that one definition was an alternative for <span class=\"literal-text\">artilegia</span>, a (non-existent) word for \"fingers or toes\".\r</p> <hr class=\"section-break\" /> <p>The next goal is to get a <em>Chinese</em> word-frequency distribution.\r</p>\n<p>This isn't too hard.\r</p>\n<ul>\n<li class=\"number-list\"> Download a zhwiki dump.\r</li>\n<li class=\"number-list\"> Generate a list of the \"top 2500\" pages. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( I have some old code that does a PageRank-like algorithm to find the top pages.  For <span class=\"literal-text\">enwiki</span>, the results are quite good; with a few anomalies like <a href=\"https://en.wikipedia.org/wiki/Oxford_University_Press\" class=\"wikilink\" target=\"_blank\">Oxford University Press</a> being highly ranked because of citations.)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( it is convenient that most of the Wiki templates are in English.)</span>\n  </span>\r</li>\n<li class=\"number-list\"> Use <span class=\"literal-text\">jieba</span> to tokenize the text.\r</li>\n<li class=\"number-list\"> Get a word count.\r</li>\n<li class=\"number-list\"> \"Merge\" this with the English lists.\r</li>\n</ul> <hr class=\"section-break\" /> <p>The obvious problem is that words don't match 1-1 between languages.  <span class=\"literal-text\">eros</span>, <span class=\"literal-text\">agape</span>, and <span class=\"literal-text\">philia</span>.\r</p>\n<p>The details of these problems are unpredictable at this time.\r</p>","quotes":[],"subject":"jamestown (part 4)"}