Convert Figma logo to code with AI

first20hours logogoogle-10000-english

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

3,892
1,927
3,892
23

Top Related Projects

Repository for Frequency Word List Generator and processed files

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

📜 A collection of wordlists for many different usages

Hunspell dictionaries in UTF-8

Quick Overview

The "google-10000-english" repository is a collection of word lists containing the most common English words. It's based on the Google's Trillion Word Corpus and provides lists of varying lengths (1,000, 3,000, 5,000, 10,000, 20,000, and 30,000 words) sorted by frequency of occurrence.

Pros

  • Provides a comprehensive list of common English words for various applications
  • Multiple list sizes available to suit different needs
  • Words are sorted by frequency, making it easy to prioritize the most common terms
  • Open-source and freely available for use in projects

Cons

  • Not actively maintained (last update was in 2016)
  • Limited to English language only
  • May not include newer terms or slang that have become popular since the last update
  • No additional context or metadata provided for the words (e.g., part of speech, definitions)

Code Examples

This repository does not contain a code library, but rather text files with word lists. Therefore, there are no code examples to provide.

Getting Started

As this is not a code library, there are no specific getting started instructions. However, to use the word lists in your project, you can follow these general steps:

  1. Clone or download the repository from GitHub: https://github.com/first20hours/google-10000-english
  2. Choose the appropriate word list file based on your needs (e.g., google-10000-english-no-swears.txt for a list of 10,000 words without profanity)
  3. Import the chosen file into your project and process it as needed (e.g., read the file line by line to create an array or set of words)

Note that the usage of these word lists will depend on your specific application and programming language.

Competitor Comparisons

Repository for Frequency Word List Generator and processed files

Pros of FrequencyWords

  • Covers multiple languages (50+) with frequency data
  • Includes larger word lists (up to 300k words per language)
  • Provides additional metadata like part of speech

Cons of FrequencyWords

  • Less focused on English-specific data
  • May include less common or specialized words
  • Potentially more complex to use due to multiple files and formats

Code Comparison

FrequencyWords:

1 the 25.1462
2 be 12.4661
3 and 9.9776
4 of 9.7826
5 a 8.0604

google-10000-english:

the
of
to
and
a

Summary

FrequencyWords offers a more comprehensive, multi-language approach with additional metadata, while google-10000-english provides a simpler, focused list of common English words. The choice between them depends on the specific needs of the project, such as language requirements, desired word count, and the importance of frequency data.

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Pros of english-words

  • Much larger vocabulary (466k+ words vs. 10k)
  • Includes various word forms and specialized terms
  • Regularly updated and maintained

Cons of english-words

  • Larger file size, potentially slower to load
  • May include obscure or rarely used words
  • Less focused on common, everyday vocabulary

Code Comparison

english-words:

with open('words_alpha.txt', 'r') as f:
    words = [word.strip() for word in f]

google-10000-english:

with open('google-10000-english.txt', 'r') as f:
    words = [word.strip() for word in f]

The code to read and process the word lists is essentially identical for both repositories. The main difference lies in the content and size of the word lists themselves.

english-words provides a comprehensive collection of English words, suitable for applications requiring an extensive vocabulary. google-10000-english offers a concise list of common words, ideal for projects focusing on everyday language or where processing speed is crucial.

Choose english-words for broad coverage or google-10000-english for a streamlined, frequently-used word set.

📜 A collection of wordlists for many different usages

Pros of wordlists

  • More comprehensive, containing multiple wordlists for various purposes
  • Includes specialized lists like common passwords and usernames
  • Regularly updated and maintained

Cons of wordlists

  • Larger file size and repository, potentially slower to download
  • May contain inappropriate or offensive words in some lists
  • Less focused on general English vocabulary

Code comparison

google-10000-english:

the
of
to
and
a

wordlists:

123456
password
12345678
qwerty
123456789

The google-10000-english repository contains a single list of the 10,000 most common English words, sorted by frequency. It's primarily useful for language learning, text analysis, and basic password strength checking.

The wordlists repository offers a variety of lists for different purposes, including common passwords, usernames, and domain names. It's more suitable for security testing, password cracking, and penetration testing scenarios.

While google-10000-english focuses on general English vocabulary, wordlists provides a broader range of word collections for various applications, making it more versatile but less specialized for language-related tasks.

Hunspell dictionaries in UTF-8

Pros of dictionaries

  • Offers a wide variety of dictionaries in multiple languages
  • Includes specialized dictionaries (e.g., medical terms, profanities)
  • Regularly updated and maintained

Cons of dictionaries

  • Larger file sizes due to comprehensive word lists
  • May require more processing time for applications
  • Less focused on common English words

Code comparison

google-10000-english:

the
of
to
and
a

dictionaries:

{
  "name": "en",
  "words": [
    "a",
    "aa",
    "aaa",
    "aaron",
    "ab"
  ]
}

Summary

google-10000-english provides a simple list of the most common English words, ideal for basic language processing tasks. dictionaries offers a more comprehensive and diverse set of word lists across multiple languages and domains. While google-10000-english is more lightweight and focused, dictionaries provides greater flexibility and depth for various linguistic applications.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Not Maintained

About This Repo

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

According to the Google Machine Translation Team:

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

This repo is derived from Peter Norvig's compilation of the 1/3 million most frequent English words. I limited this file to the 10,000 most common words, then removed the appended frequency counts by running this sed command in my text editor:

sed 's/[0-9]*//g'

Special thanks to koseki for de-duplicating the list.

Swear-free lists

There are two additional lists which are identical to the original 10,000 word list, but with swear words removed. Swear words were removed based on these lists:

Word length lists

Three of the lists (all based on the US english list) are based on word length:

  • Short: 1-4 characters
  • Medium: 5-8 characters
  • Long: 9+ characters

Each list retains the original list sorting (by frequency, decending).

Usage

This repo is useful as a corpus for typing training programs. According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.

To use this list as a training corpus in Amphetype, paste the contents into the "Lesson Generator" tab with the following settings:

Make **3** copies of the list

Divide into sublists of size **3**

Add to sources as **google-10000-english**

In the "Sources" tab, you should see google-10000-english available for training. Set WPM at 10 more than your current average, set accuracy to 98%, and you're set to train.

Enjoy!