jamestown (part 1)

⚙️ Jamestown, North Dakota, is located along I-94 in the eastern half of the state.

Today's focus is on "word frequency".


I started with two corpuses: one of 19th century literature (from Project Gutenberg), one of 20th century "sci-fi" literature. I got a rough word-rank for each, and combined them ⚙️ ( using the harmonic mean) to get a combined word-list.

While many high-frequency function words such as the, and, and of maintain consistent rankings, others like said, her, she, and me show substantial divergence, suggesting notable stylistic or thematic shifts between the two periods and genres.


It is also a word-list at all. Some of the notes:

  • The word "whale" shows up a lot more in the 19th century corpus. This is because one of the books is Moby Dick.
  • I am hoping to run an exhaustive listing of a few attributes. These include:
  • polysemy. ⚙️ ( I am less concerned with words like get which have so many meanings as-to be indefinable, but instead words like saw ( or 锯子) or face (面向 or ))
  • by lemma. "went" (108th) v. "go" (80th).
  • by part-of-speech. defined as "what the LLMs define as part-of-speech".
  • a "second-level" of word-type details.