google-10000-english

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

4,063

1,945

4,063

View on GitHub

Top Related Projects

FrequencyWords

1,297

Repository for Frequency Word List Generator and processed files

english-words

11,167

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

wordlists

1,685

📜 Yet another collection of wordlists

Quick Overview

The "google-10000-english" repository is a collection of word lists containing the most common English words. It's based on the Google's Trillion Word Corpus and provides lists of varying lengths (1,000, 3,000, 5,000, 10,000, 20,000, and 30,000 words) sorted by frequency of occurrence.

Pros

Provides a comprehensive list of common English words for various applications
Multiple list sizes available to suit different needs
Words are sorted by frequency, making it easy to prioritize the most common terms
Open-source and freely available for use in projects

Cons

Not actively maintained (last update was in 2016)
Limited to English language only
May not include newer terms or slang that have become popular since the last update
No additional context or metadata provided for the words (e.g., part of speech, definitions)

Code Examples

This repository does not contain a code library, but rather text files with word lists. Therefore, there are no code examples to provide.

Getting Started

As this is not a code library, there are no specific getting started instructions. However, to use the word lists in your project, you can follow these general steps:

Clone or download the repository from GitHub: https://github.com/first20hours/google-10000-english
Choose the appropriate word list file based on your needs (e.g., google-10000-english-no-swears.txt for a list of 10,000 words without profanity)
Import the chosen file into your project and process it as needed (e.g., read the file line by line to create an array or set of words)

Note that the usage of these word lists will depend on your specific application and programming language.

Competitor Comparisons

FrequencyWords

1,297

Repository for Frequency Word List Generator and processed files

Pros of FrequencyWords

Covers multiple languages (50+) with frequency data
Includes larger word lists (up to 300k words per language)
Provides additional metadata like part of speech

Cons of FrequencyWords

Less focused on English-specific data
May include less common or specialized words
Potentially more complex to use due to multiple files and formats

Code Comparison

FrequencyWords:

1 the 25.1462
2 be 12.4661
3 and 9.9776
4 of 9.7826
5 a 8.0604

google-10000-english:

the
of
to
and
a

Summary

FrequencyWords offers a more comprehensive, multi-language approach with additional metadata, while google-10000-english provides a simpler, focused list of common English words. The choice between them depends on the specific needs of the project, such as language requirements, desired word count, and the importance of frequency data.

english-words

11,167

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Pros of english-words

Much larger vocabulary (466k+ words vs. 10k)
Includes various word forms and specialized terms
Regularly updated and maintained

Cons of english-words

Larger file size, potentially slower to load
May include obscure or rarely used words
Less focused on common, everyday vocabulary

Code Comparison

english-words:

with open('words_alpha.txt', 'r') as f:
    words = [word.strip() for word in f]

google-10000-english:

with open('google-10000-english.txt', 'r') as f:
    words = [word.strip() for word in f]

The code to read and process the word lists is essentially identical for both repositories. The main difference lies in the content and size of the word lists themselves.

english-words provides a comprehensive collection of English words, suitable for applications requiring an extensive vocabulary. google-10000-english offers a concise list of common words, ideal for projects focusing on everyday language or where processing speed is crucial.

Choose english-words for broad coverage or google-10000-english for a streamlined, frequently-used word set.

wordlists

1,685

📜 Yet another collection of wordlists

Pros of wordlists

More comprehensive, containing multiple wordlists for various purposes
Includes specialized lists like common passwords and usernames
Regularly updated and maintained

Cons of wordlists

Larger file size and repository, potentially slower to download
May contain inappropriate or offensive words in some lists
Less focused on general English vocabulary

Code comparison

google-10000-english:

the
of
to
and
a

wordlists:

123456
password
12345678
qwerty
123456789

The google-10000-english repository contains a single list of the 10,000 most common English words, sorted by frequency. It's primarily useful for language learning, text analysis, and basic password strength checking.

The wordlists repository offers a variety of lists for different purposes, including common passwords, usernames, and domain names. It's more suitable for security testing, password cracking, and penetration testing scenarios.

While google-10000-english focuses on general English vocabulary, wordlists provides a broader range of word collections for various applications, making it more versatile but less specialized for language-related tasks.

dictionaries

1,286

Hunspell dictionaries in UTF-8

Pros of dictionaries

Offers a wide variety of dictionaries in multiple languages
Includes specialized dictionaries (e.g., medical terms, profanities)
Regularly updated and maintained

Cons of dictionaries

Larger file sizes due to comprehensive word lists
May require more processing time for applications
Less focused on common English words

Code comparison

google-10000-english:

the
of
to
and
a

dictionaries:

{
  "name": "en",
  "words": [
    "a",
    "aa",
    "aaa",
    "aaron",
    "ab"
  ]
}

Summary

google-10000-english provides a simple list of the most common English words, ideal for basic language processing tasks. dictionaries offers a more comprehensive and diverse set of word lists across multiple languages and domains. While google-10000-english is more lightweight and focused, dictionaries provides greater flexibility and depth for various linguistic applications.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

About This Repo

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

According to the Google Machine Translation Team:

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

This repo is derived from Peter Norvig's compilation of the 1/3 million most frequent English words. I limited this file to the 10,000 most common words, then removed the appended frequency counts by running this sed command in my text editor:

sed 's/[0-9]*//g'

Special thanks to koseki for de-duplicating the list.

Swear-free lists

There are two additional lists which are identical to the original 10,000 word list, but with swear words removed. Swear words were removed based on these lists:

Word length lists

Three of the lists (all based on the US english list) are based on word length:

Short: 1-4 characters
Medium: 5-8 characters
Long: 9+ characters

Each list retains the original list sorting (by frequency, decending).

Usage

This repo is useful as a corpus for typing training programs. According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.

To use this list as a training corpus in Amphetype, paste the contents into the "Lesson Generator" tab with the following settings:

Make **3** copies of the list

Divide into sublists of size **3**

Add to sources as **google-10000-english**

In the "Sources" tab, you should see google-10000-english available for training. Set WPM at 10 more than your current average, set accuracy to 98%, and you're set to train.

Enjoy!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot