Top Related Projects
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
Quick Overview
FrequencyWords is a GitHub repository that provides word frequency lists for various languages. These lists are derived from subtitles and contain the most common words in each language, along with their frequency of occurrence. The project aims to assist language learners, researchers, and developers working on natural language processing tasks.
Pros
- Covers a wide range of languages (over 40)
- Provides both raw and processed word lists
- Includes frequency information for each word
- Regularly updated with new languages and improvements
Cons
- Some languages have limited word counts
- Data quality may vary depending on the source of subtitles
- Lacks advanced linguistic features (e.g., part-of-speech tagging)
- May not accurately represent formal or academic language use
Code Examples
This is not a code library, so code examples are not applicable.
Getting Started
As this is not a code library, there are no specific getting started instructions. However, users can access the word frequency lists by following these steps:
- Visit the GitHub repository: https://github.com/hermitdave/FrequencyWords
- Navigate to the desired language folder in the
content
directory - Download the raw text file containing the word list
- Use the downloaded file in your project or analysis as needed
The word lists are typically formatted with one word per line, followed by its frequency count. For example:
the 23135851
of 13151942
and 12997637
to 12136980
a 9081174
in 8469404
Users can easily parse these files using standard text processing tools or programming languages of their choice.
Competitor Comparisons
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
Pros of google-10000-english
- Focused on the most common English words, making it ideal for basic language learning and simple NLP tasks
- Includes multiple lists of varying sizes (1k, 3k, 5k, 10k words)
- Simple, clean format with one word per line
Cons of google-10000-english
- Limited to English language only
- Lacks frequency information for each word
- Not regularly updated (last commit in 2016)
Code comparison
FrequencyWords:
1 the 25.1462
2 be 20.5737
3 and 19.0960
4 of 18.0579
5 a 14.9920
google-10000-english:
the
of
to
and
a
Summary
FrequencyWords offers a more comprehensive dataset with frequency information for multiple languages, while google-10000-english provides a simpler, focused list of common English words. FrequencyWords is better suited for more advanced linguistic analysis and multi-language applications, whereas google-10000-english is ideal for basic English language tasks and learning.
:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
Pros of english-words
- Larger word list with over 466,000 English words
- Includes various word forms (plurals, verb conjugations, etc.)
- Simple text file format, easy to parse and use in different applications
Cons of english-words
- Lacks frequency information for words
- May include rare or obsolete words, potentially less relevant for practical use
- No additional metadata or categorization of words
Code comparison
FrequencyWords:
1 the 25.1462
2 be 20.5705
3 to 19.0826
4 of 18.9539
5 and 15.9321
english-words:
A
a
aa
aal
aalii
Summary
FrequencyWords focuses on providing word frequency data for multiple languages, while english-words offers a comprehensive list of English words without frequency information. FrequencyWords is more suitable for applications requiring word usage statistics, whereas english-words is better for tasks needing a broad vocabulary reference. The choice between the two depends on the specific requirements of the project and whether word frequency data is necessary.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
FrequencyWords
Repository for Frequency Word List Generator and processed files
In early days I hosted the generated files on OneDrive with my blog https://invokeit.wordpress.com/frequency-word-lists/ linking to it. Moving forward, the code and the generated outputs are on GitHub.
OpenSubtitle tokenized source
The data used to generate 2016 lists can be found at http://opus.lingfil.uu.se/OpenSubtitles2016.php The data used to generate 2018 lists can be found at http://opus.nlpl.eu/OpenSubtitles2018.php
Format
Frequency lists are on the {word}{space}{numer_of_occurences_in_corpus}
. By example, in file en_50k.txt
:
you 22484400
i 19975318
the 17594291
to 13200962
...
Usages
These data are reused by various widely used opensource projects, among which Wikipedia, input methods and autocomplete keyoards, etc.
License
MIT License for code.
CC-by-sa-4.0 for content.
Top Related Projects
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot