fastText

Library for fast text representation and classification.

26,297

4,783

26,297

555

View on GitHub

Top Related Projects

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

bert

39,267

TensorFlow code and pre-trained models for BERT

stanza

7,500

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

flair

14,239

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Quick Overview

FastText is an open-source library developed by Facebook Research for efficient learning of word representations and sentence classification. It extends the word2vec model with subword information, allowing it to handle out-of-vocabulary words and work effectively with morphologically rich languages.

Pros

Fast and efficient training and inference, even on large datasets
Supports both supervised and unsupervised learning tasks
Handles out-of-vocabulary words through subword information
Lightweight and easy to integrate into existing projects

Cons

May not capture complex semantic relationships as well as more advanced models like BERT
Limited in handling context-dependent word meanings
Requires careful preprocessing and hyperparameter tuning for optimal performance
Not suitable for tasks requiring deep language understanding or generation

Code Examples

Loading a pre-trained model and getting word vectors:

import fasttext

# Load pre-trained model
model = fasttext.load_model('cc.en.300.bin')

# Get word vector
word_vector = model.get_word_vector('example')
print(word_vector)

Training a text classification model:

import fasttext

# Train a classifier
model = fasttext.train_supervised(input="train.txt", lr=0.5, epoch=25, wordNgrams=2)

# Predict labels for new texts
predictions = model.predict(["example text to classify"])
print(predictions)

Finding similar words:

import fasttext

# Load pre-trained model
model = fasttext.load_model('cc.en.300.bin')

# Find similar words
similar_words = model.get_nearest_neighbors('computer', k=5)
print(similar_words)

Getting Started

To get started with FastText, follow these steps:

Install FastText using pip:
```
pip install fasttext
```
Download a pre-trained model or prepare your training data.
Use the code examples above to load a model, train a classifier, or perform word vector operations.
For more advanced usage, refer to the official FastText documentation and examples in the GitHub repository.

Competitor Comparisons

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

More comprehensive NLP toolkit with advanced features like named entity recognition and dependency parsing
Optimized for production use with efficient memory usage and fast processing speeds
Extensive documentation and community support

Cons of spaCy

Steeper learning curve due to its more complex architecture
Larger model sizes, which may impact loading times and memory usage
Less suitable for simple text classification tasks compared to fastText

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

fastText:

import fasttext

model = fasttext.train_supervised("train.txt")
result = model.predict("This is a text sample")
print(result)

spaCy offers more advanced NLP capabilities out-of-the-box, while fastText excels in simple and efficient text classification tasks. spaCy provides a more comprehensive toolkit for various NLP tasks, whereas fastText focuses on fast and lightweight text classification and word representation learning.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Supports a wide range of state-of-the-art models for various NLP tasks
Offers easy-to-use APIs for fine-tuning and inference
Regularly updated with new models and features

Cons of transformers

Higher computational requirements and slower inference
Steeper learning curve for beginners
Larger model sizes, requiring more storage and memory

Code comparison

fastText:

import fasttext

model = fasttext.train_supervised("train.txt")
result = model.predict("example text")

transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
inputs = tokenizer("example text", return_tensors="pt")
outputs = model(**inputs)

gensim

16,122

Topic Modelling for Humans

Pros of Gensim

Broader range of algorithms and models (e.g., LSA, LDA, word2vec)
More extensive documentation and tutorials
Seamless integration with NumPy and SciPy

Cons of Gensim

Slower performance for large-scale text classification tasks
Less optimized for production environments
Steeper learning curve for beginners

Code Comparison

FastText:

import fasttext

model = fasttext.train_supervised("train.txt")
result = model.predict("example text")

Gensim:

from gensim.models import Word2Vec

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
similar_words = model.wv.most_similar("dog")

Both libraries offer efficient text processing capabilities, but FastText excels in speed and simplicity for text classification tasks, while Gensim provides a wider range of algorithms and more flexibility for various NLP applications. FastText is more suitable for production environments and large-scale tasks, whereas Gensim is better for exploratory analysis and research purposes.

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

More advanced contextual understanding of language
Better performance on complex NLP tasks like question answering
Supports fine-tuning for specific downstream tasks

Cons of BERT

Significantly higher computational requirements
Longer training and inference times
More complex model architecture, harder to implement and understand

Code Comparison

FastText example:

import fasttext
model = fasttext.train_supervised("train.txt")
result = model.predict("example text")

BERT example:

from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer("example text", return_tensors="pt")
outputs = model(**inputs)

FastText is simpler to use and faster to train, making it suitable for quick text classification tasks. BERT, while more complex, offers superior performance on a wider range of NLP tasks, especially those requiring deep contextual understanding. FastText is better for large-scale, resource-constrained applications, while BERT excels in scenarios where accuracy is paramount and computational resources are available.

stanza

7,500

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

Pros of Stanza

Offers a wider range of NLP tasks, including tokenization, POS tagging, named entity recognition, and dependency parsing
Supports over 60 languages with pre-trained models
Provides a more comprehensive NLP pipeline for advanced language processing tasks

Cons of Stanza

Generally slower processing speed compared to fastText
Requires more computational resources due to its comprehensive nature
Has a steeper learning curve for beginners due to its broader feature set

Code Comparison

Stanza:

import stanza
nlp = stanza.Pipeline('en')
doc = nlp("Hello world!")
print([(word.text, word.lemma, word.pos) for sent in doc.sentences for word in sent.words])

fastText:

import fasttext
model = fasttext.load_model("model.bin")
text = "Hello world!"
print(model.predict(text))

The code snippets demonstrate that Stanza provides more detailed linguistic analysis, while fastText focuses on efficient text classification and word representation. Stanza offers a full NLP pipeline, whereas fastText is more specialized for specific tasks like text classification and word embeddings.

flair

14,239

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Pros of Flair

More advanced and flexible NLP capabilities, including state-of-the-art sequence labeling and text classification
Supports a wider range of pre-trained models and embeddings
Easier integration with PyTorch for deep learning tasks

Cons of Flair

Slower training and inference times compared to FastText
Higher memory requirements due to more complex models
Steeper learning curve for beginners

Code Comparison

FastText:

import fasttext

model = fasttext.train_supervised("train.txt")
result = model.predict("example text")

Flair:

from flair.data import Sentence
from flair.models import TextClassifier

classifier = TextClassifier.load('en-sentiment')
sentence = Sentence('example text')
classifier.predict(sentence)

Both libraries offer straightforward ways to train and use text classification models, but Flair's approach is more aligned with modern deep learning practices. FastText is generally simpler and faster for basic tasks, while Flair provides more advanced features and flexibility for complex NLP tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

fastText

fastText is a library for efficient learning of word representations and sentence classification.

Resources
Requirements
Building fastText
Example use cases
Full documentation
References
Join the fastText community
License

Resources

Models

Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.

Supplementary data

The preprocessed YFCC100M data used in [2].

FAQ

You can find answers to frequently asked questions on our website.

Cheatsheet

We also provide a cheatsheet full of useful one-liners.

Requirements

We are continuously building and testing our library, CLI and Python bindings under various docker images using circleci.

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

(g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.

For the word-similarity evaluation script you will need:

Python 2.6 or newer
NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

Python version 2.7 or >=3.4
NumPy & SciPy
pybind11

One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find our latest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or use predict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in 3. You can run the script quantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. JÃ©gou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

Facebook page: https://www.facebook.com/groups/1174547215919768
Google group: https://groups.google.com/forum/#!forum/fasttext-library
Contact: egrave@fb.com, bojanowski@fb.com, ajoulin@fb.com, tmikolov@fb.com

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of spaCy

Cons of spaCy

Code Comparison

Pros of transformers

Cons of transformers

Code comparison

Pros of Gensim

Cons of Gensim

Code Comparison

Pros of BERT

Cons of BERT

Code Comparison

Pros of Stanza

Cons of Stanza

Code Comparison

Pros of Flair

Cons of Flair

Code Comparison

Convert designs to code with AI

README

fastText

Table of contents

Resources

Models

Supplementary data

FAQ

Cheatsheet

Requirements

Building fastText

Getting the source code

Building fastText using make (preferred)

Building fastText using cmake

Building fastText for Python

Example use cases

Word representation learning

Obtaining word vectors for out-of-vocabulary words

Text classification

Full documentation

References

Enriching Word Vectors with Subword Information

Bag of Tricks for Efficient Text Classification

FastText.zip: Compressing text classification models

Join the fastText community

License

Top Related Projects

Convert designs to code with AI