{"chain":[{"channel":"cities","content":"<green> Jamestown, North Dakota, is located along I-94 in the eastern half of the state.\r\n\r\nToday's focus is on \"word frequency\".\r\n\r\n----\r\n\r\n<gray> <<< I started with two corpuses: one of 19th century literature (from Project Gutenberg), one of 20th century \"sci-fi\" literature.  I got a rough word-rank for each, and combined them (<green> using the harmonic mean) to get a combined word-list. >>>\r\n\r\n<teal> <<< While many high-frequency function words such as *the*, *and*, and *of* maintain consistent rankings, others like *said*, *her*, *she*, and *me* show substantial divergence, suggesting notable stylistic or thematic shifts between the two periods and genres. >>>\r\n\r\n----\r\n\r\nIt is also a word-list at all.  Some of the notes:\r\n\r\n# The word \"whale\" shows up a lot more in the 19th century corpus.  This is because one of the books is [[Moby Dick]].\r\n# I am hoping to run an exhaustive listing of a few attributes.  These include:\r\n> polysemy. (<green> I am less concerned with words like << get >> which have so many meanings as-to be indefinable, but instead words like << saw >> (\u770b or \u952f\u5b50) or << face >> (\u9762\u5411 or \u8138))\r\n> by lemma.  \"went\" (108th) v. \"go\" (80th).\r\n> by part-of-speech.  defined as \"what the LLMs define as part-of-speech\".\r\n> a \"second-level\" of word-type details.","created_at":"2025-04-06T20:39:52.199751","id":343,"is_target":false,"parent_id":null,"processed_content":"<p><span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\"> Jamestown, North Dakota, is located along I-94 in the eastern half of the state.\r</span>\n  </span></p>\n<p>Today's focus is on \"word frequency\".\r</p> <hr class=\"section-break\" /> <p><div class=\"mlq color-gray\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83d\udcad</span></button><div class=\"mlq-content\"><p> I started with two corpuses: one of 19th century literature (from Project Gutenberg), one of 20th century \"sci-fi\" literature.  I got a rough word-rank for each, and combined them <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( using the harmonic mean)</span>\n  </span> to get a combined word-list. </p></div></div>\r</p>\n<p><div class=\"mlq color-teal\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83e\udd16</span></button><div class=\"mlq-content\"><p> While many high-frequency function words such as <em>the</em>, <em>and</em>, and <em>of</em> maintain consistent rankings, others like <em>said</em>, <em>her</em>, <em>she</em>, and <em>me</em> show substantial divergence, suggesting notable stylistic or thematic shifts between the two periods and genres. </p></div></div>\r</p> <hr class=\"section-break\" /> <p>It is also a word-list at all.  Some of the notes:\r</p>\n<ul>\n<li class=\"number-list\"> The word \"whale\" shows up a lot more in the 19th century corpus.  This is because one of the books is <a href=\"https://en.wikipedia.org/wiki/Moby_Dick\" class=\"wikilink\" target=\"_blank\">Moby Dick</a>.\r</li>\n<li class=\"number-list\"> I am hoping to run an exhaustive listing of a few attributes.  These include:\r</li>\n</ul>\n<ul>\n<li class=\"arrow-list\"> polysemy. <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( I am less concerned with words like <span class=\"literal-text\">get</span> which have so many meanings as-to be indefinable, but instead words like <span class=\"literal-text\">saw</span> (<span class=\"annotated-chinese\" data-pinyin=\"K\u00c0N\" data-definition=\"to see; to look at\">\u770b</span> or <span class=\"annotated-chinese\" data-pinyin=\"J\u00d9 ZI\" data-definition=\"a saw\">\u952f\u5b50</span>) or <span class=\"literal-text\">face</span> (<span class=\"annotated-chinese\" data-pinyin=\"M\u00ccAN X\u00ccANG\" data-definition=\"to face\">\u9762\u5411</span> or <span class=\"annotated-chinese\" data-pinyin=\"L\u01cfAN\" data-definition=\"face\">\u8138</span>))</span>\n  </span>\r</li>\n<li class=\"arrow-list\"> by lemma.  \"went\" (108th) v. \"go\" (80th).\r</li>\n<li class=\"arrow-list\"> by part-of-speech.  defined as \"what the LLMs define as part-of-speech\".\r</li>\n<li class=\"arrow-list\"> a \"second-level\" of word-type details.</li>\n</ul>","subject":"jamestown (part 1)"},{"channel":"cities","content":"This seems to be \"evolving\" into a dictionary.  Somewhat by accident; the *machine* wrote a command-line frontend for me (without my asking) that looks a *lot* like a dictionary.\r\n\r\n----\r\n\r\nWhen are two definitions the same? (<xantham> when they are not *different*.  and they are different if the part-of-speech differs, or the Chinese translation differs.  and some other times.)\r\n\r\n----\r\n\r\nThese three seem to be the same.\r\n\r\n<<< Word: eternal\r\nRank: 2547\r\n\r\nDefinitions:\r\n  [1] Existing or continuing without end; everlasting.\r\n    Confidence: 0.95\r\n    Part of speech: adjective\r\n    Lemma: eternal\r\n    Chinese: \u6c38\u6052\r\n    Subtype: duration\r\n    Notes: Often used in philosophical or religious contexts to describe the nature of existence or the divine.\r\n    Examples:\r\n      - The concept of eternal life is central to many religions.\r\n\r\n  [2] Lasting or existing forever; without beginning or end.\r\n    Confidence: 0.90\r\n    Part of speech: adjective\r\n    Lemma: eternal\r\n    Chinese: \u6c38\u6052\r\n    Subtype: duration\r\n    Notes: Can refer to abstract ideas, such as love or truth, that are considered timeless.\r\n    Examples:\r\n      - Her love for him felt eternal, transcending time and space.\r\n\r\n  [3] Relating to or being a part of eternity; timeless.\r\n    Confidence: 0.85\r\n    Part of speech: adjective\r\n    Lemma: eternal\r\n    Chinese: \u6c38\u6052\r\n    Subtype: duration\r\n    Notes: Often used in literary or poetic contexts to evoke a sense of infinity.\r\n    Examples:\r\n      - The stars in the night sky seemed to whisper eternal secrets.\r\n>>>\r\n\r\nOn the other hand, these three might be different:\r\n\r\n<<< Word: transform\r\nRank: 2646\r\n\r\nDefinitions:\r\n  [1] To change in form, appearance, or structure; to metamorphose.\r\n    Confidence: 0.95\r\n    Part of speech: verb\r\n    Lemma: transform\r\n    Chinese: \u53d8\u5f62\r\n    Subtype: change\r\n    Notes: Often used in contexts involving significant change or conversion.\r\n    Examples:\r\n      - The caterpillar will transform into a butterfly.\r\n\r\n  [2] To cause to change in character or condition; to convert.\r\n    Confidence: 0.90\r\n    Part of speech: verb\r\n    Lemma: transform\r\n    Chinese: \u8f6c\u53d8\r\n    Subtype: change\r\n    Notes: Commonly used in contexts such as technology, business, or personal development.\r\n    Examples:\r\n      - The new software will transform the way we manage our projects.\r\n\r\n  [3] In mathematics, to change the coordinates of a point or the representation of a function.\r\n    Confidence: 0.85\r\n    Part of speech: verb\r\n    Lemma: transform\r\n    Chinese: \u53d8\u6362\r\n    Subtype: change\r\n    Notes: Used in various branches of mathematics, including geometry and calculus.\r\n    Examples:\r\n      - We can transform the equation into a simpler form. >>>\r\n\r\n----\r\n\r\nand these are *definitely* different:\r\n\r\n<<<   [1] A piece of furniture with a flat top and one or more legs, used for placing items on or for working at.\r\n    Confidence: 1.00\r\n    Part of speech: noun\r\n    Lemma: table\r\n    Chinese: \u684c\u5b50\r\n    Subtype: small_movable_object\r\n    Notes: Commonly used in homes and offices.\r\n    Examples:\r\n      - We gathered around the dining table for dinner.\r\n\r\n  [2] A systematic arrangement of data, usually in rows and columns, for easy reference.\r\n    Confidence: 1.00\r\n    Part of speech: noun\r\n    Lemma: table\r\n    Chinese: \u8868\u683c\r\n    Subtype: small_movable_object\r\n    Notes: Often used in academic and professional contexts.\r\n    Examples:\r\n      - The research paper included a table summarizing the results of the experiment.\r\n\r\n  [3] To postpone consideration of a motion or proposal in a meeting or legislative assembly.\r\n    Confidence: 0.90\r\n    Part of speech: verb\r\n    Lemma: table\r\n    Chinese: \u6401\u7f6e\r\n    Subtype: other\r\n    Notes: Usage can vary by region; in some places, it means to bring a motion forward for discussion.\r\n    Examples:\r\n      - The committee decided to table the discussion until next week.\r\n>>>","created_at":"2025-04-07T18:21:15.867594","id":344,"is_target":false,"parent_id":343,"processed_content":"<p>This seems to be \"evolving\" into a dictionary.  Somewhat by accident; the <em>machine</em> wrote a command-line frontend for me (without my asking) that looks a <em>lot</em> like a dictionary.\r</p> <hr class=\"section-break\" /> <p>When are two definitions the same? <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( when they are not <em>different</em>.  and they are different if the part-of-speech differs, or the Chinese translation differs.  and some other times.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>These three seem to be the same.\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p> Word: eternal\r</p>\n<p>Rank: 2547\r</p>\n<p>\r</p>\n<p>Definitions:\r</p>\n<p>  [1] Existing or continuing without end; everlasting.\r</p>\n<p>    Confidence: 0.95\r</p>\n<p>    Part of speech: adjective\r</p>\n<p>    Lemma: eternal\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"Y\u01d1NG H\u00c9NG\" data-definition=\"eternal\">\u6c38\u6052</span>\r</p>\n<p>    Subtype: duration\r</p>\n<p>    Notes: Often used in philosophical or religious contexts to describe the nature of existence or the divine.\r</p>\n<p>    Examples:\r</p>\n<p>      - The concept of eternal life is central to many religions.\r</p>\n<p>\r</p>\n<p>  [2] Lasting or existing forever; without beginning or end.\r</p>\n<p>    Confidence: 0.90\r</p>\n<p>    Part of speech: adjective\r</p>\n<p>    Lemma: eternal\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"Y\u01d1NG H\u00c9NG\" data-definition=\"eternal\">\u6c38\u6052</span>\r</p>\n<p>    Subtype: duration\r</p>\n<p>    Notes: Can refer to abstract ideas, such as love or truth, that are considered timeless.\r</p>\n<p>    Examples:\r</p>\n<p>      - Her love for him felt eternal, transcending time and space.\r</p>\n<p>\r</p>\n<p>  [3] Relating to or being a part of eternity; timeless.\r</p>\n<p>    Confidence: 0.85\r</p>\n<p>    Part of speech: adjective\r</p>\n<p>    Lemma: eternal\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"Y\u01d1NG H\u00c9NG\" data-definition=\"eternal\">\u6c38\u6052</span>\r</p>\n<p>    Subtype: duration\r</p>\n<p>    Notes: Often used in literary or poetic contexts to evoke a sense of infinity.\r</p>\n<p>    Examples:\r</p>\n<p>      - The stars in the night sky seemed to whisper eternal secrets.\r</p></div></div>\r</p>\n<p>On the other hand, these three might be different:\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p> Word: transform\r</p>\n<p>Rank: 2646\r</p>\n<p>\r</p>\n<p>Definitions:\r</p>\n<p>  [1] To change in form, appearance, or structure; to metamorphose.\r</p>\n<p>    Confidence: 0.95\r</p>\n<p>    Part of speech: verb\r</p>\n<p>    Lemma: transform\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"B\u00ccAN X\u00cdNG\" data-definition=\"to become deformed; to change shape; to morph\">\u53d8\u5f62</span>\r</p>\n<p>    Subtype: change\r</p>\n<p>    Notes: Often used in contexts involving significant change or conversion.\r</p>\n<p>    Examples:\r</p>\n<p>      - The caterpillar will transform into a butterfly.\r</p>\n<p>\r</p>\n<p>  [2] To cause to change in character or condition; to convert.\r</p>\n<p>    Confidence: 0.90\r</p>\n<p>    Part of speech: verb\r</p>\n<p>    Lemma: transform\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"ZH\u01d3AN B\u00ccAN\" data-definition=\"to change\">\u8f6c\u53d8</span>\r</p>\n<p>    Subtype: change\r</p>\n<p>    Notes: Commonly used in contexts such as technology, business, or personal development.\r</p>\n<p>    Examples:\r</p>\n<p>      - The new software will transform the way we manage our projects.\r</p>\n<p>\r</p>\n<p>  [3] In mathematics, to change the coordinates of a point or the representation of a function.\r</p>\n<p>    Confidence: 0.85\r</p>\n<p>    Part of speech: verb\r</p>\n<p>    Lemma: transform\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"B\u00ccAN H\u00d9AN\" data-definition=\"to transform\">\u53d8\u6362</span>\r</p>\n<p>    Subtype: change\r</p>\n<p>    Notes: Used in various branches of mathematics, including geometry and calculus.\r</p>\n<p>    Examples:\r</p>\n<p>      - We can transform the equation into a simpler form. </p></div></div>\r</p> <hr class=\"section-break\" /> <p>and these are <em>definitely</em> different:\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">-</span></button><div class=\"mlq-content\"><p>   [1] A piece of furniture with a flat top and one or more legs, used for placing items on or for working at.\r</p>\n<p>    Confidence: 1.00\r</p>\n<p>    Part of speech: noun\r</p>\n<p>    Lemma: table\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"ZH\u016aO ZI\" data-definition=\"table\">\u684c\u5b50</span>\r</p>\n<p>    Subtype: small_movable_object\r</p>\n<p>    Notes: Commonly used in homes and offices.\r</p>\n<p>    Examples:\r</p>\n<p>      - We gathered around the dining table for dinner.\r</p>\n<p>\r</p>\n<p>  [2] A systematic arrangement of data, usually in rows and columns, for easy reference.\r</p>\n<p>    Confidence: 1.00\r</p>\n<p>    Part of speech: noun\r</p>\n<p>    Lemma: table\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"B\u01cfAO G\u00c9\" data-definition=\"form; table\">\u8868\u683c</span>\r</p>\n<p>    Subtype: small_movable_object\r</p>\n<p>    Notes: Often used in academic and professional contexts.\r</p>\n<p>    Examples:\r</p>\n<p>      - The research paper included a table summarizing the results of the experiment.\r</p>\n<p>\r</p>\n<p>  [3] To postpone consideration of a motion or proposal in a meeting or legislative assembly.\r</p>\n<p>    Confidence: 0.90\r</p>\n<p>    Part of speech: verb\r</p>\n<p>    Lemma: table\r</p>\n<p>    Chinese: <span class=\"annotated-chinese\" data-pinyin=\"G\u0112 ZH\u00cc\" data-definition=\"to shelve\">\u6401\u7f6e</span>\r</p>\n<p>    Subtype: other\r</p>\n<p>    Notes: Usage can vary by region; in some places, it means to bring a motion forward for discussion.\r</p>\n<p>    Examples:\r</p>\n<p>      - The committee decided to table the discussion until next week.\r</p></div></div></p>","subject":"jamestown (part 2)"},{"channel":"cities","content":"the project has been at the << liminal >> point between \"interesting\" and \"we already have dictionaries\".\r\n\r\nsome things are easier for the LLM than others.  for \"get an IPA pronunciation\", I eventually determined that the best choice was to give up and download a flat-text file.  (and, maybe, get the *machine* to deal with the ambiguous cases)\r\n\r\n----\r\n\r\nthe \"merge various word-frequency lists into one list\" code, after a few rounds of telling the *machine* it was wrong, now works fine.\r\n\r\n<red> the top-level differences are predictable.  words like << you >> are less common on Wikipedia.\r\n<green> because of some bug, words like \"vernacular\" and \"justification\" are showing up in the top 250.  also, not all the lists manage contractions correctly, giving \"words\" like << isn >> and << doesn >> ('t).\r\n\r\n----\r\n\r\nperhaps the next task is \"generate an annotated version of text, where the text is *colored* based on the word frequency\".\r\n<red> and, possibly, the uncommon words get Chinese translations added.\r\n<xantham> that will certainly help the English monoglot who is confused!\r\n\r\n----\r\n\r\nthe variance in the word frequencies (mostly) says something about the \"cultural loading\" of the words.\r\n\r\nmost of the words that are more common in 19th century books are low-cultural-loading.  (<red> perhaps the prevalence of << gentleman >> is cultural.  but words like << rain >> and << hat >> are more frequent just because other words (<< geometry >>, << organic >>) are less frequent.)","created_at":"2025-04-12T15:42:24.374769","id":347,"is_target":false,"parent_id":344,"processed_content":"<p>the project has been at the <span class=\"literal-text\">liminal</span> point between \"interesting\" and \"we already have dictionaries\".\r</p>\n<p>some things are easier for the LLM than others.  for \"get an IPA pronunciation\", I eventually determined that the best choice was to give up and download a flat-text file.  (and, maybe, get the <em>machine</em> to deal with the ambiguous cases)\r</p> <hr class=\"section-break\" /> <p>the \"merge various word-frequency lists into one list\" code, after a few rounds of telling the <em>machine</em> it was wrong, now works fine.\r</p>\n<p><span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\"> the top-level differences are predictable.  words like <span class=\"literal-text\">you</span> are less common on Wikipedia.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\"> because of some bug, words like \"vernacular\" and \"justification\" are showing up in the top 250.  also, not all the lists manage contractions correctly, giving \"words\" like <span class=\"literal-text\">isn</span> and <span class=\"literal-text\">doesn</span> ('t).\r</span>\n  </span></p> <hr class=\"section-break\" /> <p>perhaps the next task is \"generate an annotated version of text, where the text is <em>colored</em> based on the word frequency\".\r</p>\n<p><span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\"> and, possibly, the uncommon words get Chinese translations added.\r</span>\n  </span></p>\n<p><span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\"> that will certainly help the English monoglot who is confused!\r</span>\n  </span></p> <hr class=\"section-break\" /> <p>the variance in the word frequencies (mostly) says something about the \"cultural loading\" of the words.\r</p>\n<p>most of the words that are more common in 19th century books are low-cultural-loading.  <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( perhaps the prevalence of <span class=\"literal-text\">gentleman</span> is cultural.  but words like <span class=\"literal-text\">rain</span> and <span class=\"literal-text\">hat</span> are more frequent just because other words (<span class=\"literal-text\">geometry</span>, <span class=\"literal-text\">organic</span>) are less frequent.)</span>\n  </span></p>","subject":"jamestown (part 3)"},{"channel":"cities","content":"This system does demonstrate the flaws of some of the smaller models.  Gemma3:4b, when I asked it to define << artillery >>, said that one definition was an alternative for << artilegia >>, a (non-existent) word for \"fingers or toes\".\r\n\r\n----\r\n\r\nThe next goal is to get a *Chinese* word-frequency distribution.\r\n\r\nThis isn't too hard.\r\n\r\n# Download a zhwiki dump.\r\n# Generate a list of the \"top 2500\" pages. (<red> I have some old code that does a PageRank-like algorithm to find the top pages.  For << enwiki >>, the results are quite good; with a few anomalies like [[Oxford University Press]] being highly ranked because of citations.) (<green> it is convenient that most of the Wiki templates are in English.)\r\n# Use << jieba >> to tokenize the text.\r\n# Get a word count.\r\n# \"Merge\" this with the English lists.\r\n\r\n----\r\n\r\nThe obvious problem is that words don't match 1-1 between languages.  << eros >>, << agape >>, and << philia >>.\r\n\r\nThe details of these problems are unpredictable at this time.\r\n\r\n----","created_at":"2025-04-13T18:27:23.864613","id":349,"is_target":false,"parent_id":347,"processed_content":"<p>This system does demonstrate the flaws of some of the smaller models.  Gemma3:4b, when I asked it to define <span class=\"literal-text\">artillery</span>, said that one definition was an alternative for <span class=\"literal-text\">artilegia</span>, a (non-existent) word for \"fingers or toes\".\r</p> <hr class=\"section-break\" /> <p>The next goal is to get a <em>Chinese</em> word-frequency distribution.\r</p>\n<p>This isn't too hard.\r</p>\n<ul>\n<li class=\"number-list\"> Download a zhwiki dump.\r</li>\n<li class=\"number-list\"> Generate a list of the \"top 2500\" pages. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( I have some old code that does a PageRank-like algorithm to find the top pages.  For <span class=\"literal-text\">enwiki</span>, the results are quite good; with a few anomalies like <a href=\"https://en.wikipedia.org/wiki/Oxford_University_Press\" class=\"wikilink\" target=\"_blank\">Oxford University Press</a> being highly ranked because of citations.)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( it is convenient that most of the Wiki templates are in English.)</span>\n  </span>\r</li>\n<li class=\"number-list\"> Use <span class=\"literal-text\">jieba</span> to tokenize the text.\r</li>\n<li class=\"number-list\"> Get a word count.\r</li>\n<li class=\"number-list\"> \"Merge\" this with the English lists.\r</li>\n</ul> <hr class=\"section-break\" /> <p>The obvious problem is that words don't match 1-1 between languages.  <span class=\"literal-text\">eros</span>, <span class=\"literal-text\">agape</span>, and <span class=\"literal-text\">philia</span>.\r</p>\n<p>The details of these problems are unpredictable at this time.\r</p>","subject":"jamestown (part 4)"},{"channel":"cities","content":"5/5\r\n\r\nit has been a drudge to get work done.\r\n\r\n----\r\n\r\nthe end-goal is within sight.  one more round of prompt-tuning, a $1 gpt-4.1-nano run, and some rounds of \"LLM consensus checking\".\r\n\r\nthe goal then becomes *applications*.\r\n\r\n# LLM benchmarks. (<xantham> use the *machine* to feed the *machine*) (<red> \"which word has this definition\" questions will be possible at some point.  but not yet.)\r\n# Elementary education. (<red> which words should a 3rd/5th grader know?  be studying?) (<green> so far no useful progress on \"how easy/hard is it to spell this word\")\r\n# Second-language learning. (<red> a \"which is the Chinese for this word in this sentence\" app.)\r\n# Text difficulty.  A smarter metric than Flesch-Kincaid.\r\n\r\n----\r\n\r\nThe \"cosine similarity\" question of \"how similar are these word definitions\" is not yet solved.  I'm not sure I can solve it.\r\n\r\nI can test it; I have several ways of generating embeddings.  And (probably) these can include a sentence as context.\r\n\r\n----\r\n\r\nI also have no solutions for the \"group different word-forms with the same meaning\".  For << jump >>, << jumps >>, << jumped >>, for example.\r\n\r\nThis would be a much more substantial problem for more highly-conjugated languages.  With English, it is almost avoidable.\r\n\r\n----\r\n\r\nAround word 5000, i am seeing << raft >>, << yield >>, << algebra >>, and << pizza >>.\r\n\r\nThis seems correct enough?  \"Algebra\" is more common in encyclopedic contexts, and \"pizza\" doesn't show up in the 19th century corpus at all.\r\n\r\n----\r\n\r\nBut, as far as \"exploration\" is concerned, I am reaching diminishing returns.\r\n\r\nI have one more list of \"see if Claude can do this quickly\".  After that, \"glenora\" will become an inactive project.","created_at":"2025-04-16T18:57:28.551568","id":351,"is_target":true,"parent_id":349,"processed_content":"<p>5/5\r</p>\n<p>it has been a drudge to get work done.\r</p> <hr class=\"section-break\" /> <p>the end-goal is within sight.  one more round of prompt-tuning, a $1 gpt-4.1-nano run, and some rounds of \"LLM consensus checking\".\r</p>\n<p>the goal then becomes <em>applications</em>.\r</p>\n<ul>\n<li class=\"number-list\"> LLM benchmarks. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( use the <em>machine</em> to feed the <em>machine</em>)</span>\n  </span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( \"which word has this definition\" questions will be possible at some point.  but not yet.)</span>\n  </span>\r</li>\n<li class=\"number-list\"> Elementary education. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( which words should a 3rd/5th grader know?  be studying?)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( so far no useful progress on \"how easy/hard is it to spell this word\")</span>\n  </span>\r</li>\n<li class=\"number-list\"> Second-language learning. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( a \"which is the Chinese for this word in this sentence\" app.)</span>\n  </span>\r</li>\n<li class=\"number-list\"> Text difficulty.  A smarter metric than Flesch-Kincaid.\r</li>\n</ul> <hr class=\"section-break\" /> <p>The \"cosine similarity\" question of \"how similar are these word definitions\" is not yet solved.  I'm not sure I can solve it.\r</p>\n<p>I can test it; I have several ways of generating embeddings.  And (probably) these can include a sentence as context.\r</p> <hr class=\"section-break\" /> <p>I also have no solutions for the \"group different word-forms with the same meaning\".  For <span class=\"literal-text\">jump</span>, <span class=\"literal-text\">jumps</span>, <span class=\"literal-text\">jumped</span>, for example.\r</p>\n<p>This would be a much more substantial problem for more highly-conjugated languages.  With English, it is almost avoidable.\r</p> <hr class=\"section-break\" /> <p>Around word 5000, i am seeing <span class=\"literal-text\">raft</span>, <span class=\"literal-text\">yield</span>, <span class=\"literal-text\">algebra</span>, and <span class=\"literal-text\">pizza</span>.\r</p>\n<p>This seems correct enough?  \"Algebra\" is more common in encyclopedic contexts, and \"pizza\" doesn't show up in the 19th century corpus at all.\r</p> <hr class=\"section-break\" /> <p>But, as far as \"exploration\" is concerned, I am reaching diminishing returns.\r</p>\n<p>I have one more list of \"see if Claude can do this quickly\".  After that, \"glenora\" will become an inactive project.</p>","subject":"jamestown (part 5)"}]}
