{"channel":"cities","content":"the project has been at the << liminal >> point between \"interesting\" and \"we already have dictionaries\".\r\n\r\nsome things are easier for the LLM than others.  for \"get an IPA pronunciation\", I eventually determined that the best choice was to give up and download a flat-text file.  (and, maybe, get the *machine* to deal with the ambiguous cases)\r\n\r\n----\r\n\r\nthe \"merge various word-frequency lists into one list\" code, after a few rounds of telling the *machine* it was wrong, now works fine.\r\n\r\n<red> the top-level differences are predictable.  words like << you >> are less common on Wikipedia.\r\n<green> because of some bug, words like \"vernacular\" and \"justification\" are showing up in the top 250.  also, not all the lists manage contractions correctly, giving \"words\" like << isn >> and << doesn >> ('t).\r\n\r\n----\r\n\r\nperhaps the next task is \"generate an annotated version of text, where the text is *colored* based on the word frequency\".\r\n<red> and, possibly, the uncommon words get Chinese translations added.\r\n<xantham> that will certainly help the English monoglot who is confused!\r\n\r\n----\r\n\r\nthe variance in the word frequencies (mostly) says something about the \"cultural loading\" of the words.\r\n\r\nmost of the words that are more common in 19th century books are low-cultural-loading.  (<red> perhaps the prevalence of << gentleman >> is cultural.  but words like << rain >> and << hat >> are more frequent just because other words (<< geometry >>, << organic >>) are less frequent.)","created_at":"2025-04-12T15:42:24.374769","id":347,"llm_annotations":{},"parent_id":344,"processed_content":"<p>the project has been at the <span class=\"literal-text\">liminal</span> point between \"interesting\" and \"we already have dictionaries\".\r</p>\n<p>some things are easier for the LLM than others.  for \"get an IPA pronunciation\", I eventually determined that the best choice was to give up and download a flat-text file.  (and, maybe, get the <em>machine</em> to deal with the ambiguous cases)\r</p> <hr class=\"section-break\" /> <p>the \"merge various word-frequency lists into one list\" code, after a few rounds of telling the <em>machine</em> it was wrong, now works fine.\r</p>\n<p><span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\"> the top-level differences are predictable.  words like <span class=\"literal-text\">you</span> are less common on Wikipedia.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\"> because of some bug, words like \"vernacular\" and \"justification\" are showing up in the top 250.  also, not all the lists manage contractions correctly, giving \"words\" like <span class=\"literal-text\">isn</span> and <span class=\"literal-text\">doesn</span> ('t).\r</span>\n  </span></p> <hr class=\"section-break\" /> <p>perhaps the next task is \"generate an annotated version of text, where the text is <em>colored</em> based on the word frequency\".\r</p>\n<p><span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\"> and, possibly, the uncommon words get Chinese translations added.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\"> that will certainly help the English monoglot who is confused!\r</span>\n  </span></p> <hr class=\"section-break\" /> <p>the variance in the word frequencies (mostly) says something about the \"cultural loading\" of the words.\r</p>\n<p>most of the words that are more common in 19th century books are low-cultural-loading.  <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( perhaps the prevalence of <span class=\"literal-text\">gentleman</span> is cultural.  but words like <span class=\"literal-text\">rain</span> and <span class=\"literal-text\">hat</span> are more frequent just because other words (<span class=\"literal-text\">geometry</span>, <span class=\"literal-text\">organic</span>) are less frequent.)</span>\n  </span></p>","quotes":[],"subject":"jamestown (part 3)"}