Convert Figma logo to code with AI

pytorch logotext

Models, data loaders and abstractions for language processing, powered by PyTorch

3,525
812
3,525
334

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

1,239

Making text a first-class citizen in TensorFlow.

30,447

💫 Industrial-strength Natural Language Processing (NLP) in Python

13,520

NLTK Source

15,616

Topic Modelling for Humans

13,877

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Quick Overview

PyTorch Text (torchtext) is a library for natural language processing (NLP) tasks in PyTorch. It provides utilities for text and language processing, including data loading, text preprocessing, and common datasets for NLP tasks. The library aims to make it easier for researchers and practitioners to work with text data in PyTorch.

Pros

  • Seamless integration with PyTorch ecosystem
  • Efficient data loading and batching for NLP tasks
  • Pre-built datasets and vocabularies for common NLP benchmarks
  • Flexible text preprocessing pipelines

Cons

  • Learning curve for users new to PyTorch
  • Documentation can be sparse in some areas
  • Some features may be less performant compared to specialized libraries
  • Occasional breaking changes between versions

Code Examples

  1. Loading and preprocessing a dataset:
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Load IMDB dataset
train_iter = IMDB(split='train')

# Create tokenizer and vocabulary
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Text pipeline
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
  1. Creating a DataLoader:
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

# Convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)

# Create DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
  1. Using pre-trained word embeddings:
from torchtext.vocab import GloVe

# Load pre-trained GloVe embeddings
glove = GloVe(name='6B', dim=100)

# Get embedding for a word
word_embedding = glove['example']

Getting Started

To get started with torchtext, first install it using pip:

pip install torchtext

Then, you can import and use the library in your Python code:

import torchtext

# Load a dataset
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')

# Create a tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

# Build vocabulary
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Use the vocabulary to process text
text = "This is an example sentence."
processed = [vocab[token] for token in tokenizer(text)]
print(processed)

This basic example demonstrates how to load a dataset, create a tokenizer, build a vocabulary, and process text using torchtext.

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

  • Extensive pre-trained model library with easy-to-use APIs
  • Comprehensive documentation and active community support
  • Seamless integration with popular deep learning frameworks

Cons of Transformers

  • Larger package size and potentially higher resource requirements
  • Steeper learning curve for beginners due to its extensive features
  • May include unnecessary components for simple NLP tasks

Code Comparison

Text:

import torchtext
from torchtext.data import Field, TabularDataset

text_field = Field(tokenize='spacy')
dataset = TabularDataset(path='data.csv', format='csv', fields=[('text', text_field)])

Transformers:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")

Text focuses on data processing and dataset creation, while Transformers emphasizes pre-trained model usage and tokenization. Transformers offers a more streamlined approach for working with state-of-the-art models, whereas Text provides greater flexibility in data handling and preprocessing for custom NLP tasks.

1,239

Making text a first-class citizen in TensorFlow.

Pros of TensorFlow Text

  • More comprehensive text processing capabilities, including tokenization, normalization, and text segmentation
  • Better integration with TensorFlow ecosystem and TensorFlow Extended (TFX) for production pipelines
  • Supports multiple languages and scripts out-of-the-box

Cons of TensorFlow Text

  • Steeper learning curve due to more complex API
  • Less flexibility in customizing text processing operations
  • Slower development cycle compared to PyTorch Text

Code Comparison

PyTorch Text:

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
vocab = build_vocab_from_iterator(map(tokenizer, text_iter))

TensorFlow Text:

import tensorflow_text as text

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['hello world', 'how are you'])
vocab = text.BertTokenizer.from_vocab_file(vocab_file_path)

Both libraries offer tokenization and vocabulary building, but TensorFlow Text provides more advanced options for text processing and supports a wider range of languages. PyTorch Text has a simpler API, making it easier to get started with basic text processing tasks.

30,447

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

  • More comprehensive NLP toolkit with pre-trained models and pipelines
  • Faster processing speed, especially for large-scale text analysis
  • Better documentation and community support

Cons of spaCy

  • Less flexibility for custom model architectures
  • Steeper learning curve for beginners
  • Limited support for deep learning tasks compared to PyTorch ecosystem

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

torchtext:

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("This is a sample sentence.")
vocab = build_vocab_from_iterator([tokens])

Both libraries offer text processing capabilities, but spaCy provides a more complete out-of-the-box solution for various NLP tasks, while torchtext focuses on providing building blocks for deep learning models in PyTorch. spaCy excels in performance and ease of use for standard NLP tasks, whereas torchtext offers more flexibility for custom model development within the PyTorch ecosystem.

13,520

NLTK Source

Pros of NLTK

  • Comprehensive suite of text processing libraries and tools
  • Extensive documentation and educational resources
  • Large, established community with long-term support

Cons of NLTK

  • Slower performance compared to PyTorch Text
  • Less integration with deep learning frameworks
  • More complex setup and usage for certain tasks

Code Comparison

NLTK:

import nltk
from nltk.tokenize import word_tokenize

text = "Hello, world!"
tokens = word_tokenize(text)

PyTorch Text:

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)

Summary

NLTK is a comprehensive, well-documented library with a large community, making it ideal for educational purposes and traditional NLP tasks. However, it may be slower and less integrated with modern deep learning frameworks compared to PyTorch Text. PyTorch Text offers better performance and seamless integration with PyTorch, making it more suitable for deep learning-based NLP tasks. The choice between the two depends on the specific requirements of your project and your familiarity with each library's ecosystem.

15,616

Topic Modelling for Humans

Pros of Gensim

  • More mature and established library with a larger ecosystem of tools and models
  • Focuses on topic modeling and document similarity, offering specialized algorithms
  • Generally faster and more memory-efficient for large-scale text processing tasks

Cons of Gensim

  • Less integrated with deep learning frameworks compared to torchtext
  • May require more manual preprocessing and data preparation
  • Limited support for newer transformer-based models and techniques

Code Comparison

Gensim example (word2vec training):

from gensim.models import Word2Vec

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)

torchtext example (text classification dataset):

from torchtext.datasets import AG_NEWS

train_iter = AG_NEWS(split='train')

Summary

Gensim is better suited for traditional NLP tasks and topic modeling, while torchtext integrates seamlessly with PyTorch for deep learning-based NLP. Gensim offers more specialized algorithms and is generally faster, but torchtext provides easier integration with neural network models and newer NLP techniques.

13,877

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Pros of Flair

  • More user-friendly and intuitive API for NLP tasks
  • Offers pre-trained models for various languages and domains
  • Provides easy-to-use named entity recognition (NER) capabilities

Cons of Flair

  • Less flexible for custom model architectures
  • Smaller community and fewer contributors compared to PyTorch Text
  • May have slower performance for large-scale tasks

Code Comparison

Flair:

from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')
sentence = Sentence('John Doe works at Microsoft.')
tagger.predict(sentence)

PyTorch Text:

import torchtext
from torchtext.data import Field, Dataset
from torchtext.models import LSTM

text_field = Field(tokenize='spacy')
model = LSTM(input_size, hidden_size, num_layers)

Both libraries offer NLP functionalities, but Flair provides a higher-level API for common tasks, while PyTorch Text offers more flexibility for custom implementations. Flair is better suited for quick prototyping and out-of-the-box NLP solutions, whereas PyTorch Text is more appropriate for researchers and developers who need fine-grained control over their models and data processing pipelines.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. image:: docs/source/_static/img/torchtext_logo.png

.. image:: https://circleci.com/gh/pytorch/text.svg?style=svg :target: https://circleci.com/gh/pytorch/text

.. image:: https://codecov.io/gh/pytorch/text/branch/main/graph/badge.svg :target: https://codecov.io/gh/pytorch/text

.. image:: https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v :target: https://pytorch.org/text/

torchtext +++++++++

WARNING: TorchText development is stopped and the 0.18 release (April 2024) will be the last stable release of the library.

This repository consists of:

  • torchtext.datasets <https://github.com/pytorch/text/tree/main/torchtext/datasets>_: The raw text iterators for common NLP datasets
  • torchtext.data <https://github.com/pytorch/text/tree/main/torchtext/data>_: Some basic NLP building blocks
  • torchtext.transforms <https://github.com/pytorch/text/tree/main/torchtext/transforms.py>_: Basic text-processing transformations
  • torchtext.models <https://github.com/pytorch/text/tree/main/torchtext/models>_: Pre-trained models
  • torchtext.vocab <https://github.com/pytorch/text/tree/main/torchtext/vocab>_: Vocab and Vectors related classes and factory functions
  • examples <https://github.com/pytorch/text/tree/main/examples>_: Example NLP workflows with PyTorch and torchtext library.

Installation

We recommend Anaconda as a Python package management system. Please refer to pytorch.org <https://pytorch.org/>_ for the details of PyTorch installation. The following are the corresponding torchtext versions and supported Python versions.

.. csv-table:: Version Compatibility :header: "PyTorch version", "torchtext version", "Supported Python version" :widths: 10, 10, 10

nightly build, main, ">=3.8, <=3.11" 2.3.0, 0.18.0, ">=3.8, <=3.11" 2.2.0, 0.17.0, ">=3.8, <=3.11" 2.1.0, 0.16.0, ">=3.8, <=3.11" 2.0.0, 0.15.0, ">=3.8, <=3.11" 1.13.0, 0.14.0, ">=3.7, <=3.10" 1.12.0, 0.13.0, ">=3.7, <=3.10" 1.11.0, 0.12.0, ">=3.6, <=3.9" 1.10.0, 0.11.0, ">=3.6, <=3.9" 1.9.1, 0.10.1, ">=3.6, <=3.9" 1.9, 0.10, ">=3.6, <=3.9" 1.8.1, 0.9.1, ">=3.6, <=3.9" 1.8, 0.9, ">=3.6, <=3.9" 1.7.1, 0.8.1, ">=3.6, <=3.9" 1.7, 0.8, ">=3.6, <=3.8" 1.6, 0.7, ">=3.6, <=3.8" 1.5, 0.6, ">=3.5, <=3.8" 1.4, 0.5, "2.7, >=3.5, <=3.8" 0.4 and below, 0.2.3, "2.7, >=3.5, <=3.8"

Using conda::

conda install -c pytorch torchtext

Using pip::

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy <http://spacy.io/>_, you need to install SpaCy and download its English model::

pip install spacy
python -m spacy download en_core_web_sm

Alternatively, you might want to use the Moses <http://www.statmt.org/moses/>_ tokenizer port in SacreMoses <https://github.com/alvations/sacremoses>_ (split from NLTK <http://nltk.org/>_). You have to install SacreMoses::

pip install sacremoses

For torchtext 0.5 and below, sentencepiece::

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.::

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using the nightly build of PyTorch, checkout the environment it was built with conda (here) <https://github.com/pytorch/builder/tree/main/conda>_ and pip (here) <https://github.com/pytorch/builder/tree/main/manywheel>_.

Additionally, datasets in torchtext are implemented using the torchdata library. Please take a look at the installation instructions <https://github.com/pytorch/data#installation>_ to download the latest nightlies or install from source.

Documentation

Find the documentation here <https://pytorch.org/text/>_.

Datasets

The datasets module currently contains:

  • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
  • Machine translation: IWSLT2016, IWSLT2017, Multi30k
  • Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
  • Question answering: SQuAD1, SQuAD2
  • Text classification: SST2, AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • Model pre-training: CC-100

Models

The library currently consist of following pre-trained models:

  • RoBERTa: Base and Large Architecture <https://github.com/pytorch/fairseq/tree/main/examples/roberta#pre-trained-models>_
  • DistilRoBERTa <https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md>_
  • XLM-RoBERTa: Base and Large Architure <https://github.com/pytorch/fairseq/tree/main/examples/xlmr#pre-trained-models>_
  • T5: Small, Base, Large, 3B, and 11B Architecture <https://github.com/google-research/text-to-text-transfer-transformer>_
  • Flan-T5: Base, Large, XL, and XXL Architecture <https://github.com/google-research/t5x>_

Tokenizers

The transforms module currently support following scriptable tokenizers:

  • SentencePiece <https://github.com/google/sentencepiece>_
  • GPT-2 BPE <https://github.com/openai/gpt-2/blob/master/src/encoder.py>_
  • CLIP <https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py>_
  • RE2 <https://github.com/google/re2>_
  • BERT <https://arxiv.org/pdf/1810.04805.pdf>_

Tutorials

To get started with torchtext, users may refer to the following tutorial available on PyTorch website.

  • SST-2 binary text classification using XLM-R pre-trained model <https://pytorch.org/text/stable/tutorials/sst2_classification_non_distributed.html>_
  • Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>_
  • Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>_
  • Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>_

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!