text

Models, data loaders and abstractions for language processing, powered by PyTorch

3,542

812

3,542

345

View on GitHub

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

text

1,258

Making text a first-class citizen in TensorFlow.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

flair

14,148

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Quick Overview

PyTorch Text (torchtext) is a library for natural language processing (NLP) tasks in PyTorch. It provides utilities for text and language processing, including data loading, text preprocessing, and common datasets for NLP tasks. The library aims to make it easier for researchers and practitioners to work with text data in PyTorch.

Pros

Seamless integration with PyTorch ecosystem
Efficient data loading and batching for NLP tasks
Pre-built datasets and vocabularies for common NLP benchmarks
Flexible text preprocessing pipelines

Cons

Learning curve for users new to PyTorch
Documentation can be sparse in some areas
Some features may be less performant compared to specialized libraries
Occasional breaking changes between versions

Code Examples

Loading and preprocessing a dataset:

from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Load IMDB dataset
train_iter = IMDB(split='train')

# Create tokenizer and vocabulary
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Text pipeline
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

Creating a DataLoader:

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

# Convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)

# Create DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)

Using pre-trained word embeddings:

from torchtext.vocab import GloVe

# Load pre-trained GloVe embeddings
glove = GloVe(name='6B', dim=100)

# Get embedding for a word
word_embedding = glove['example']

Getting Started

To get started with torchtext, first install it using pip:

pip install torchtext

Then, you can import and use the library in your Python code:

import torchtext

# Load a dataset
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')

# Create a tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

# Build vocabulary
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Use the vocabulary to process text
text = "This is an example sentence."
processed = [vocab[token] for token in tokenizer(text)]
print(processed)

This basic example demonstrates how to load a dataset, create a tokenizer, build a vocabulary, and process text using torchtext.

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Extensive pre-trained model library with easy-to-use APIs
Comprehensive documentation and active community support
Seamless integration with popular deep learning frameworks

Cons of Transformers

Larger package size and potentially higher resource requirements
Steeper learning curve for beginners due to its extensive features
May include unnecessary components for simple NLP tasks

Code Comparison

Text:

import torchtext
from torchtext.data import Field, TabularDataset

text_field = Field(tokenize='spacy')
dataset = TabularDataset(path='data.csv', format='csv', fields=[('text', text_field)])

Transformers:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")

Text focuses on data processing and dataset creation, while Transformers emphasizes pre-trained model usage and tokenization. Transformers offers a more streamlined approach for working with state-of-the-art models, whereas Text provides greater flexibility in data handling and preprocessing for custom NLP tasks.

text

1,258

Making text a first-class citizen in TensorFlow.

Pros of TensorFlow Text

More comprehensive text processing capabilities, including tokenization, normalization, and text segmentation
Better integration with TensorFlow ecosystem and TensorFlow Extended (TFX) for production pipelines
Supports multiple languages and scripts out-of-the-box

Cons of TensorFlow Text

Steeper learning curve due to more complex API
Less flexibility in customizing text processing operations
Slower development cycle compared to PyTorch Text

Code Comparison

PyTorch Text:

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
vocab = build_vocab_from_iterator(map(tokenizer, text_iter))

TensorFlow Text:

import tensorflow_text as text

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['hello world', 'how are you'])
vocab = text.BertTokenizer.from_vocab_file(vocab_file_path)

Both libraries offer tokenization and vocabulary building, but TensorFlow Text provides more advanced options for text processing and supports a wider range of languages. PyTorch Text has a simpler API, making it easier to get started with basic text processing tasks.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

More comprehensive NLP toolkit with pre-trained models and pipelines
Faster processing speed, especially for large-scale text analysis
Better documentation and community support

Cons of spaCy

Less flexibility for custom model architectures
Steeper learning curve for beginners
Limited support for deep learning tasks compared to PyTorch ecosystem

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

torchtext:

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("This is a sample sentence.")
vocab = build_vocab_from_iterator([tokens])

Both libraries offer text processing capabilities, but spaCy provides a more complete out-of-the-box solution for various NLP tasks, while torchtext focuses on providing building blocks for deep learning models in PyTorch. spaCy excels in performance and ease of use for standard NLP tasks, whereas torchtext offers more flexibility for custom model development within the PyTorch ecosystem.

nltk

14,016

NLTK Source

Pros of NLTK

Comprehensive suite of text processing libraries and tools
Extensive documentation and educational resources
Large, established community with long-term support

Cons of NLTK

Slower performance compared to PyTorch Text
Less integration with deep learning frameworks
More complex setup and usage for certain tasks

Code Comparison

NLTK:

import nltk
from nltk.tokenize import word_tokenize

text = "Hello, world!"
tokens = word_tokenize(text)

PyTorch Text:

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)

Summary

NLTK is a comprehensive, well-documented library with a large community, making it ideal for educational purposes and traditional NLP tasks. However, it may be slower and less integrated with modern deep learning frameworks compared to PyTorch Text. PyTorch Text offers better performance and seamless integration with PyTorch, making it more suitable for deep learning-based NLP tasks. The choice between the two depends on the specific requirements of your project and your familiarity with each library's ecosystem.

gensim

15,988

Topic Modelling for Humans

Pros of Gensim

More mature and established library with a larger ecosystem of tools and models
Focuses on topic modeling and document similarity, offering specialized algorithms
Generally faster and more memory-efficient for large-scale text processing tasks

Cons of Gensim

Less integrated with deep learning frameworks compared to torchtext
May require more manual preprocessing and data preparation
Limited support for newer transformer-based models and techniques

Code Comparison

Gensim example (word2vec training):

from gensim.models import Word2Vec

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)

torchtext example (text classification dataset):

from torchtext.datasets import AG_NEWS

train_iter = AG_NEWS(split='train')

Summary

Gensim is better suited for traditional NLP tasks and topic modeling, while torchtext integrates seamlessly with PyTorch for deep learning-based NLP. Gensim offers more specialized algorithms and is generally faster, but torchtext provides easier integration with neural network models and newer NLP techniques.

flair

14,148

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Pros of Flair

More user-friendly and intuitive API for NLP tasks
Offers pre-trained models for various languages and domains
Provides easy-to-use named entity recognition (NER) capabilities

Cons of Flair

Less flexible for custom model architectures
Smaller community and fewer contributors compared to PyTorch Text
May have slower performance for large-scale tasks

Code Comparison

Flair:

from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')
sentence = Sentence('John Doe works at Microsoft.')
tagger.predict(sentence)

PyTorch Text:

import torchtext
from torchtext.data import Field, Dataset
from torchtext.models import LSTM

text_field = Field(tokenize='spacy')
model = LSTM(input_size, hidden_size, num_layers)

Both libraries offer NLP functionalities, but Flair provides a higher-level API for common tasks, while PyTorch Text offers more flexibility for custom implementations. Flair is better suited for quick prototyping and out-of-the-box NLP solutions, whereas PyTorch Text is more appropriate for researchers and developers who need fine-grained control over their models and data processing pipelines.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. image:: docs/source/_static/img/torchtext_logo.png

.. image:: https://circleci.com/gh/pytorch/text.svg?style=svg :target: https://circleci.com/gh/pytorch/text

.. image:: https://codecov.io/gh/pytorch/text/branch/main/graph/badge.svg :target: https://codecov.io/gh/pytorch/text

.. image:: https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v :target: https://pytorch.org/text/

torchtext +++++++++

WARNING: TorchText development is stopped and the 0.18 release (April 2024) will be the last stable release of the library.

This repository consists of:

torchtext.datasets <https://github.com/pytorch/text/tree/main/torchtext/datasets>_: The raw text iterators for common NLP datasets
torchtext.data <https://github.com/pytorch/text/tree/main/torchtext/data>_: Some basic NLP building blocks
torchtext.transforms <https://github.com/pytorch/text/tree/main/torchtext/transforms.py>_: Basic text-processing transformations
torchtext.models <https://github.com/pytorch/text/tree/main/torchtext/models>_: Pre-trained models
torchtext.vocab <https://github.com/pytorch/text/tree/main/torchtext/vocab>_: Vocab and Vectors related classes and factory functions
examples <https://github.com/pytorch/text/tree/main/examples>_: Example NLP workflows with PyTorch and torchtext library.

Installation

We recommend Anaconda as a Python package management system. Please refer to pytorch.org <https://pytorch.org/>_ for the details of PyTorch installation. The following are the corresponding torchtext versions and supported Python versions.

.. csv-table:: Version Compatibility :header: "PyTorch version", "torchtext version", "Supported Python version" :widths: 10, 10, 10

nightly build, main, ">=3.8, <=3.11" 2.3.0, 0.18.0, ">=3.8, <=3.11" 2.2.0, 0.17.0, ">=3.8, <=3.11" 2.1.0, 0.16.0, ">=3.8, <=3.11" 2.0.0, 0.15.0, ">=3.8, <=3.11" 1.13.0, 0.14.0, ">=3.7, <=3.10" 1.12.0, 0.13.0, ">=3.7, <=3.10" 1.11.0, 0.12.0, ">=3.6, <=3.9" 1.10.0, 0.11.0, ">=3.6, <=3.9" 1.9.1, 0.10.1, ">=3.6, <=3.9" 1.9, 0.10, ">=3.6, <=3.9" 1.8.1, 0.9.1, ">=3.6, <=3.9" 1.8, 0.9, ">=3.6, <=3.9" 1.7.1, 0.8.1, ">=3.6, <=3.9" 1.7, 0.8, ">=3.6, <=3.8" 1.6, 0.7, ">=3.6, <=3.8" 1.5, 0.6, ">=3.5, <=3.8" 1.4, 0.5, "2.7, >=3.5, <=3.8" 0.4 and below, 0.2.3, "2.7, >=3.5, <=3.8"

Using conda::

conda install -c pytorch torchtext

Using pip::

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy <http://spacy.io/>_, you need to install SpaCy and download its English model::

pip install spacy
python -m spacy download en_core_web_sm

Alternatively, you might want to use the Moses <http://www.statmt.org/moses/>_ tokenizer port in SacreMoses <https://github.com/alvations/sacremoses>_ (split from NLTK <http://nltk.org/>_). You have to install SacreMoses::

pip install sacremoses

For torchtext 0.5 and below, sentencepiece::

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.::

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using the nightly build of PyTorch, checkout the environment it was built with conda (here) <https://github.com/pytorch/builder/tree/main/conda>_ and pip (here) <https://github.com/pytorch/builder/tree/main/manywheel>_.

Additionally, datasets in torchtext are implemented using the torchdata library. Please take a look at the installation instructions <https://github.com/pytorch/data#installation>_ to download the latest nightlies or install from source.

Documentation

Find the documentation here <https://pytorch.org/text/>_.

Datasets

The datasets module currently contains:

Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
Machine translation: IWSLT2016, IWSLT2017, Multi30k
Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
Question answering: SQuAD1, SQuAD2
Text classification: SST2, AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
Model pre-training: CC-100

Models

The library currently consist of following pre-trained models:

RoBERTa: Base and Large Architecture <https://github.com/pytorch/fairseq/tree/main/examples/roberta#pre-trained-models>_
DistilRoBERTa <https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md>_
XLM-RoBERTa: Base and Large Architure <https://github.com/pytorch/fairseq/tree/main/examples/xlmr#pre-trained-models>_
T5: Small, Base, Large, 3B, and 11B Architecture <https://github.com/google-research/text-to-text-transfer-transformer>_
Flan-T5: Base, Large, XL, and XXL Architecture <https://github.com/google-research/t5x>_

Tokenizers

The transforms module currently support following scriptable tokenizers:

SentencePiece <https://github.com/google/sentencepiece>_
GPT-2 BPE <https://github.com/openai/gpt-2/blob/master/src/encoder.py>_
CLIP <https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py>_
RE2 <https://github.com/google/re2>_
BERT <https://arxiv.org/pdf/1810.04805.pdf>_

Tutorials

To get started with torchtext, users may refer to the following tutorial available on PyTorch website.

SST-2 binary text classification using XLM-R pre-trained model <https://pytorch.org/text/stable/tutorials/sst2_classification_non_distributed.html>_
Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>_
Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>_
Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>_

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot