google-10000-english
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
Top Related Projects
Repository for Frequency Word List Generator and processed files
:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
📜 A collection of wordlists for many different usages
Hunspell dictionaries in UTF-8
Quick Overview
The "google-10000-english" repository is a collection of word lists containing the most common English words. It's based on the Google's Trillion Word Corpus and provides lists of varying lengths (1,000, 3,000, 5,000, 10,000, 20,000, and 30,000 words) sorted by frequency of occurrence.
Pros
- Provides a comprehensive list of common English words for various applications
- Multiple list sizes available to suit different needs
- Words are sorted by frequency, making it easy to prioritize the most common terms
- Open-source and freely available for use in projects
Cons
- Not actively maintained (last update was in 2016)
- Limited to English language only
- May not include newer terms or slang that have become popular since the last update
- No additional context or metadata provided for the words (e.g., part of speech, definitions)
Code Examples
This repository does not contain a code library, but rather text files with word lists. Therefore, there are no code examples to provide.
Getting Started
As this is not a code library, there are no specific getting started instructions. However, to use the word lists in your project, you can follow these general steps:
- Clone or download the repository from GitHub: https://github.com/first20hours/google-10000-english
- Choose the appropriate word list file based on your needs (e.g.,
google-10000-english-no-swears.txt
for a list of 10,000 words without profanity) - Import the chosen file into your project and process it as needed (e.g., read the file line by line to create an array or set of words)
Note that the usage of these word lists will depend on your specific application and programming language.
Competitor Comparisons
Repository for Frequency Word List Generator and processed files
Pros of FrequencyWords
- Covers multiple languages (50+) with frequency data
- Includes larger word lists (up to 300k words per language)
- Provides additional metadata like part of speech
Cons of FrequencyWords
- Less focused on English-specific data
- May include less common or specialized words
- Potentially more complex to use due to multiple files and formats
Code Comparison
FrequencyWords:
1 the 25.1462
2 be 12.4661
3 and 9.9776
4 of 9.7826
5 a 8.0604
google-10000-english:
the
of
to
and
a
Summary
FrequencyWords offers a more comprehensive, multi-language approach with additional metadata, while google-10000-english provides a simpler, focused list of common English words. The choice between them depends on the specific needs of the project, such as language requirements, desired word count, and the importance of frequency data.
:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
Pros of english-words
- Much larger vocabulary (466k+ words vs. 10k)
- Includes various word forms and specialized terms
- Regularly updated and maintained
Cons of english-words
- Larger file size, potentially slower to load
- May include obscure or rarely used words
- Less focused on common, everyday vocabulary
Code Comparison
english-words:
with open('words_alpha.txt', 'r') as f:
words = [word.strip() for word in f]
google-10000-english:
with open('google-10000-english.txt', 'r') as f:
words = [word.strip() for word in f]
The code to read and process the word lists is essentially identical for both repositories. The main difference lies in the content and size of the word lists themselves.
english-words provides a comprehensive collection of English words, suitable for applications requiring an extensive vocabulary. google-10000-english offers a concise list of common words, ideal for projects focusing on everyday language or where processing speed is crucial.
Choose english-words for broad coverage or google-10000-english for a streamlined, frequently-used word set.
📜 A collection of wordlists for many different usages
Pros of wordlists
- More comprehensive, containing multiple wordlists for various purposes
- Includes specialized lists like common passwords and usernames
- Regularly updated and maintained
Cons of wordlists
- Larger file size and repository, potentially slower to download
- May contain inappropriate or offensive words in some lists
- Less focused on general English vocabulary
Code comparison
google-10000-english:
the
of
to
and
a
wordlists:
123456
password
12345678
qwerty
123456789
The google-10000-english repository contains a single list of the 10,000 most common English words, sorted by frequency. It's primarily useful for language learning, text analysis, and basic password strength checking.
The wordlists repository offers a variety of lists for different purposes, including common passwords, usernames, and domain names. It's more suitable for security testing, password cracking, and penetration testing scenarios.
While google-10000-english focuses on general English vocabulary, wordlists provides a broader range of word collections for various applications, making it more versatile but less specialized for language-related tasks.
Hunspell dictionaries in UTF-8
Pros of dictionaries
- Offers a wide variety of dictionaries in multiple languages
- Includes specialized dictionaries (e.g., medical terms, profanities)
- Regularly updated and maintained
Cons of dictionaries
- Larger file sizes due to comprehensive word lists
- May require more processing time for applications
- Less focused on common English words
Code comparison
google-10000-english:
the
of
to
and
a
dictionaries:
{
"name": "en",
"words": [
"a",
"aa",
"aaa",
"aaron",
"ab"
]
}
Summary
google-10000-english provides a simple list of the most common English words, ideal for basic language processing tasks. dictionaries offers a more comprehensive and diverse set of word lists across multiple languages and domains. While google-10000-english is more lightweight and focused, dictionaries provides greater flexibility and depth for various linguistic applications.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
About This Repo
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
According to the Google Machine Translation Team:
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.
This repo is derived from Peter Norvig's compilation of the 1/3 million most frequent English words. I limited this file to the 10,000 most common words, then removed the appended frequency counts by running this sed command in my text editor:
sed 's/[0-9]*//g'
Special thanks to koseki for de-duplicating the list.
Swear-free lists
There are two additional lists which are identical to the original 10,000 word list, but with swear words removed. Swear words were removed based on these lists:
- reimertz/curse-words
- MauriceButler/badwords
- LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
Word length lists
Three of the lists (all based on the US english list) are based on word length:
- Short: 1-4 characters
- Medium: 5-8 characters
- Long: 9+ characters
Each list retains the original list sorting (by frequency, decending).
Usage
This repo is useful as a corpus for typing training programs. According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.
To use this list as a training corpus in Amphetype, paste the contents into the "Lesson Generator" tab with the following settings:
Make **3** copies of the list
Divide into sublists of size **3**
Add to sources as **google-10000-english**
In the "Sources" tab, you should see google-10000-english available for training. Set WPM at 10 more than your current average, set accuracy to 98%, and you're set to train.
Enjoy!
Top Related Projects
Repository for Frequency Word List Generator and processed files
:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
📜 A collection of wordlists for many different usages
Hunspell dictionaries in UTF-8
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot