jamestown (part 3)
Channel: Cities - Project Journal
In reply to: jamestown (part 2) (View Chain)
the project has been at the liminal point between "interesting" and "we already have dictionaries".
some things are easier for the LLM than others. for "get an IPA pronunciation", I eventually determined that the best choice was to give up and download a flat-text file. (and, maybe, get the machine to deal with the ambiguous cases)
the "merge various word-frequency lists into one list" code, after a few rounds of telling the machine it was wrong, now works fine.
💡 the top-level differences are predictable. words like you are less common on Wikipedia.
⚙️ because of some bug, words like "vernacular" and "justification" are showing up in the top 250. also, not all the lists manage contractions correctly, giving "words" like isn and doesn ('t).
perhaps the next task is "generate an annotated version of text, where the text is colored based on the word frequency".
💡 and, possibly, the uncommon words get Chinese translations added.
🔥 that will certainly help the English monoglot who is confused!
the variance in the word frequencies (mostly) says something about the "cultural loading" of the words.
most of the words that are more common in 19th century books are low-cultural-loading. 💡 ( perhaps the prevalence of gentleman is cultural. but words like rain and hat are more frequent just because other words (geometry, organic) are less frequent.)