FrequencyWords

Repository for Frequency Word List Generator and processed files

1,297

568

1,297

View on GitHub

Top Related Projects

google-10000-english

4,063

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

english-words

11,167

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Quick Overview

FrequencyWords is a GitHub repository that provides word frequency lists for various languages. These lists are derived from subtitles and contain the most common words in each language, along with their frequency of occurrence. The project aims to assist language learners, researchers, and developers working on natural language processing tasks.

Pros

Covers a wide range of languages (over 40)
Provides both raw and processed word lists
Includes frequency information for each word
Regularly updated with new languages and improvements

Cons

Some languages have limited word counts
Data quality may vary depending on the source of subtitles
Lacks advanced linguistic features (e.g., part-of-speech tagging)
May not accurately represent formal or academic language use

Code Examples

This is not a code library, so code examples are not applicable.

Getting Started

As this is not a code library, there are no specific getting started instructions. However, users can access the word frequency lists by following these steps:

Visit the GitHub repository: https://github.com/hermitdave/FrequencyWords
Navigate to the desired language folder in the content directory
Download the raw text file containing the word list
Use the downloaded file in your project or analysis as needed

The word lists are typically formatted with one word per line, followed by its frequency count. For example:

the 23135851
of 13151942
and 12997637
to 12136980
a 9081174
in 8469404

Users can easily parse these files using standard text processing tools or programming languages of their choice.

Competitor Comparisons

google-10000-english

4,063

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

Pros of google-10000-english

Focused on the most common English words, making it ideal for basic language learning and simple NLP tasks
Includes multiple lists of varying sizes (1k, 3k, 5k, 10k words)
Simple, clean format with one word per line

Cons of google-10000-english

Limited to English language only
Lacks frequency information for each word
Not regularly updated (last commit in 2016)

Code comparison

FrequencyWords:

1 the 25.1462
2 be 20.5737
3 and 19.0960
4 of 18.0579
5 a 14.9920

google-10000-english:

the
of
to
and
a

Summary

FrequencyWords offers a more comprehensive dataset with frequency information for multiple languages, while google-10000-english provides a simpler, focused list of common English words. FrequencyWords is better suited for more advanced linguistic analysis and multi-language applications, whereas google-10000-english is ideal for basic English language tasks and learning.

english-words

11,167

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Pros of english-words

Larger word list with over 466,000 English words
Includes various word forms (plurals, verb conjugations, etc.)
Simple text file format, easy to parse and use in different applications

Cons of english-words

Lacks frequency information for words
May include rare or obsolete words, potentially less relevant for practical use
No additional metadata or categorization of words

Code comparison

FrequencyWords:

1 the 25.1462
2 be 20.5705
3 to 19.0826
4 of 18.9539
5 and 15.9321

english-words:

A
a
aa
aal
aalii

Summary

FrequencyWords focuses on providing word frequency data for multiple languages, while english-words offers a comprehensive list of English words without frequency information. FrequencyWords is more suitable for applications requiring word usage statistics, whereas english-words is better for tasks needing a broad vocabulary reference. The choice between the two depends on the specific requirements of the project and whether word frequency data is necessary.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

FrequencyWords

Repository for Frequency Word List Generator and processed files

In early days I hosted the generated files on OneDrive with my blog https://invokeit.wordpress.com/frequency-word-lists/ linking to it. Moving forward, the code and the generated outputs are on GitHub.

OpenSubtitle tokenized source

The data used to generate 2016 lists can be found at http://opus.lingfil.uu.se/OpenSubtitles2016.php The data used to generate 2018 lists can be found at http://opus.nlpl.eu/OpenSubtitles2018.php

Format

Frequency lists are on the {word}{space}{numer_of_occurences_in_corpus}. By example, in file en_50k.txt :

you 22484400
i 19975318
the 17594291
to 13200962
...

Usages

These data are reused by various widely used opensource projects, among which Wikipedia, input methods and autocomplete keyoards, etc.

License

MIT License for code.
CC-by-sa-4.0 for content.

Top Related Projects

google-10000-english

4,063

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.

english-words

11,167

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot