text
Models, data loaders and abstractions for language processing, powered by PyTorch
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Making text a first-class citizen in TensorFlow.
💫 Industrial-strength Natural Language Processing (NLP) in Python
NLTK Source
Topic Modelling for Humans
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Quick Overview
PyTorch Text (torchtext) is a library for natural language processing (NLP) tasks in PyTorch. It provides utilities for text and language processing, including data loading, text preprocessing, and common datasets for NLP tasks. The library aims to make it easier for researchers and practitioners to work with text data in PyTorch.
Pros
- Seamless integration with PyTorch ecosystem
- Efficient data loading and batching for NLP tasks
- Pre-built datasets and vocabularies for common NLP benchmarks
- Flexible text preprocessing pipelines
Cons
- Learning curve for users new to PyTorch
- Documentation can be sparse in some areas
- Some features may be less performant compared to specialized libraries
- Occasional breaking changes between versions
Code Examples
- Loading and preprocessing a dataset:
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
# Load IMDB dataset
train_iter = IMDB(split='train')
# Create tokenizer and vocabulary
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
# Text pipeline
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
- Creating a DataLoader:
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset
# Convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)
# Create DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
- Using pre-trained word embeddings:
from torchtext.vocab import GloVe
# Load pre-trained GloVe embeddings
glove = GloVe(name='6B', dim=100)
# Get embedding for a word
word_embedding = glove['example']
Getting Started
To get started with torchtext, first install it using pip:
pip install torchtext
Then, you can import and use the library in your Python code:
import torchtext
# Load a dataset
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
# Create a tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')
# Build vocabulary
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
# Use the vocabulary to process text
text = "This is an example sentence."
processed = [vocab[token] for token in tokenizer(text)]
print(processed)
This basic example demonstrates how to load a dataset, create a tokenizer, build a vocabulary, and process text using torchtext.
Competitor Comparisons
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Extensive pre-trained model library with easy-to-use APIs
- Comprehensive documentation and active community support
- Seamless integration with popular deep learning frameworks
Cons of Transformers
- Larger package size and potentially higher resource requirements
- Steeper learning curve for beginners due to its extensive features
- May include unnecessary components for simple NLP tasks
Code Comparison
Text:
import torchtext
from torchtext.data import Field, TabularDataset
text_field = Field(tokenize='spacy')
dataset = TabularDataset(path='data.csv', format='csv', fields=[('text', text_field)])
Transformers:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
Text focuses on data processing and dataset creation, while Transformers emphasizes pre-trained model usage and tokenization. Transformers offers a more streamlined approach for working with state-of-the-art models, whereas Text provides greater flexibility in data handling and preprocessing for custom NLP tasks.
Making text a first-class citizen in TensorFlow.
Pros of TensorFlow Text
- More comprehensive text processing capabilities, including tokenization, normalization, and text segmentation
- Better integration with TensorFlow ecosystem and TensorFlow Extended (TFX) for production pipelines
- Supports multiple languages and scripts out-of-the-box
Cons of TensorFlow Text
- Steeper learning curve due to more complex API
- Less flexibility in customizing text processing operations
- Slower development cycle compared to PyTorch Text
Code Comparison
PyTorch Text:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tokenizer = get_tokenizer("basic_english")
vocab = build_vocab_from_iterator(map(tokenizer, text_iter))
TensorFlow Text:
import tensorflow_text as text
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['hello world', 'how are you'])
vocab = text.BertTokenizer.from_vocab_file(vocab_file_path)
Both libraries offer tokenization and vocabulary building, but TensorFlow Text provides more advanced options for text processing and supports a wider range of languages. PyTorch Text has a simpler API, making it easier to get started with basic text processing tasks.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- More comprehensive NLP toolkit with pre-trained models and pipelines
- Faster processing speed, especially for large-scale text analysis
- Better documentation and community support
Cons of spaCy
- Less flexibility for custom model architectures
- Steeper learning curve for beginners
- Limited support for deep learning tasks compared to PyTorch ecosystem
Code Comparison
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence.")
for token in doc:
print(token.text, token.pos_, token.dep_)
torchtext:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("This is a sample sentence.")
vocab = build_vocab_from_iterator([tokens])
Both libraries offer text processing capabilities, but spaCy provides a more complete out-of-the-box solution for various NLP tasks, while torchtext focuses on providing building blocks for deep learning models in PyTorch. spaCy excels in performance and ease of use for standard NLP tasks, whereas torchtext offers more flexibility for custom model development within the PyTorch ecosystem.
NLTK Source
Pros of NLTK
- Comprehensive suite of text processing libraries and tools
- Extensive documentation and educational resources
- Large, established community with long-term support
Cons of NLTK
- Slower performance compared to PyTorch Text
- Less integration with deep learning frameworks
- More complex setup and usage for certain tasks
Code Comparison
NLTK:
import nltk
from nltk.tokenize import word_tokenize
text = "Hello, world!"
tokens = word_tokenize(text)
PyTorch Text:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)
Summary
NLTK is a comprehensive, well-documented library with a large community, making it ideal for educational purposes and traditional NLP tasks. However, it may be slower and less integrated with modern deep learning frameworks compared to PyTorch Text. PyTorch Text offers better performance and seamless integration with PyTorch, making it more suitable for deep learning-based NLP tasks. The choice between the two depends on the specific requirements of your project and your familiarity with each library's ecosystem.
Topic Modelling for Humans
Pros of Gensim
- More mature and established library with a larger ecosystem of tools and models
- Focuses on topic modeling and document similarity, offering specialized algorithms
- Generally faster and more memory-efficient for large-scale text processing tasks
Cons of Gensim
- Less integrated with deep learning frameworks compared to torchtext
- May require more manual preprocessing and data preparation
- Limited support for newer transformer-based models and techniques
Code Comparison
Gensim example (word2vec training):
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
torchtext example (text classification dataset):
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
Summary
Gensim is better suited for traditional NLP tasks and topic modeling, while torchtext integrates seamlessly with PyTorch for deep learning-based NLP. Gensim offers more specialized algorithms and is generally faster, but torchtext provides easier integration with neural network models and newer NLP techniques.
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Pros of Flair
- More user-friendly and intuitive API for NLP tasks
- Offers pre-trained models for various languages and domains
- Provides easy-to-use named entity recognition (NER) capabilities
Cons of Flair
- Less flexible for custom model architectures
- Smaller community and fewer contributors compared to PyTorch Text
- May have slower performance for large-scale tasks
Code Comparison
Flair:
from flair.data import Sentence
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
sentence = Sentence('John Doe works at Microsoft.')
tagger.predict(sentence)
PyTorch Text:
import torchtext
from torchtext.data import Field, Dataset
from torchtext.models import LSTM
text_field = Field(tokenize='spacy')
model = LSTM(input_size, hidden_size, num_layers)
Both libraries offer NLP functionalities, but Flair provides a higher-level API for common tasks, while PyTorch Text offers more flexibility for custom implementations. Flair is better suited for quick prototyping and out-of-the-box NLP solutions, whereas PyTorch Text is more appropriate for researchers and developers who need fine-grained control over their models and data processing pipelines.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
.. image:: docs/source/_static/img/torchtext_logo.png
.. image:: https://circleci.com/gh/pytorch/text.svg?style=svg :target: https://circleci.com/gh/pytorch/text
.. image:: https://codecov.io/gh/pytorch/text/branch/main/graph/badge.svg :target: https://codecov.io/gh/pytorch/text
.. image:: https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v :target: https://pytorch.org/text/
torchtext +++++++++
WARNING: TorchText development is stopped and the 0.18
release (April 2024) will be the last stable release of the library.
This repository consists of:
torchtext.datasets <https://github.com/pytorch/text/tree/main/torchtext/datasets>
_: The raw text iterators for common NLP datasetstorchtext.data <https://github.com/pytorch/text/tree/main/torchtext/data>
_: Some basic NLP building blockstorchtext.transforms <https://github.com/pytorch/text/tree/main/torchtext/transforms.py>
_: Basic text-processing transformationstorchtext.models <https://github.com/pytorch/text/tree/main/torchtext/models>
_: Pre-trained modelstorchtext.vocab <https://github.com/pytorch/text/tree/main/torchtext/vocab>
_: Vocab and Vectors related classes and factory functionsexamples <https://github.com/pytorch/text/tree/main/examples>
_: Example NLP workflows with PyTorch and torchtext library.
Installation
We recommend Anaconda as a Python package management system. Please refer to pytorch.org <https://pytorch.org/>
_ for the details of PyTorch installation. The following are the corresponding torchtext
versions and supported Python versions.
.. csv-table:: Version Compatibility :header: "PyTorch version", "torchtext version", "Supported Python version" :widths: 10, 10, 10
nightly build, main, ">=3.8, <=3.11" 2.3.0, 0.18.0, ">=3.8, <=3.11" 2.2.0, 0.17.0, ">=3.8, <=3.11" 2.1.0, 0.16.0, ">=3.8, <=3.11" 2.0.0, 0.15.0, ">=3.8, <=3.11" 1.13.0, 0.14.0, ">=3.7, <=3.10" 1.12.0, 0.13.0, ">=3.7, <=3.10" 1.11.0, 0.12.0, ">=3.6, <=3.9" 1.10.0, 0.11.0, ">=3.6, <=3.9" 1.9.1, 0.10.1, ">=3.6, <=3.9" 1.9, 0.10, ">=3.6, <=3.9" 1.8.1, 0.9.1, ">=3.6, <=3.9" 1.8, 0.9, ">=3.6, <=3.9" 1.7.1, 0.8.1, ">=3.6, <=3.9" 1.7, 0.8, ">=3.6, <=3.8" 1.6, 0.7, ">=3.6, <=3.8" 1.5, 0.6, ">=3.5, <=3.8" 1.4, 0.5, "2.7, >=3.5, <=3.8" 0.4 and below, 0.2.3, "2.7, >=3.5, <=3.8"
Using conda::
conda install -c pytorch torchtext
Using pip::
pip install torchtext
Optional requirements
If you want to use English tokenizer from SpaCy <http://spacy.io/>
_, you need to install SpaCy and download its English model::
pip install spacy
python -m spacy download en_core_web_sm
Alternatively, you might want to use the Moses <http://www.statmt.org/moses/>
_ tokenizer port in SacreMoses <https://github.com/alvations/sacremoses>
_ (split from NLTK <http://nltk.org/>
_). You have to install SacreMoses::
pip install sacremoses
For torchtext 0.5 and below, sentencepiece
::
conda install -c powerai sentencepiece
Building from source
To build torchtext from source, you need git
, CMake
and C++11 compiler such as g++
.::
git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive
# Linux
python setup.py clean install
# OSX
CC=clang CXX=clang++ python setup.py clean install
# or ``python setup.py develop`` if you are making modifications.
Note
When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext.
If you are using the nightly build of PyTorch, checkout the environment it was built with conda (here) <https://github.com/pytorch/builder/tree/main/conda>
_ and pip (here) <https://github.com/pytorch/builder/tree/main/manywheel>
_.
Additionally, datasets in torchtext are implemented using the torchdata library. Please take a look at the
installation instructions <https://github.com/pytorch/data#installation>
_ to download the latest nightlies or install from source.
Documentation
Find the documentation here <https://pytorch.org/text/>
_.
Datasets
The datasets module currently contains:
- Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
- Machine translation: IWSLT2016, IWSLT2017, Multi30k
- Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
- Question answering: SQuAD1, SQuAD2
- Text classification: SST2, AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
- Model pre-training: CC-100
Models
The library currently consist of following pre-trained models:
- RoBERTa:
Base and Large Architecture <https://github.com/pytorch/fairseq/tree/main/examples/roberta#pre-trained-models>
_ DistilRoBERTa <https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md>
_- XLM-RoBERTa:
Base and Large Architure <https://github.com/pytorch/fairseq/tree/main/examples/xlmr#pre-trained-models>
_ - T5:
Small, Base, Large, 3B, and 11B Architecture <https://github.com/google-research/text-to-text-transfer-transformer>
_ - Flan-T5:
Base, Large, XL, and XXL Architecture <https://github.com/google-research/t5x>
_
Tokenizers
The transforms module currently support following scriptable tokenizers:
SentencePiece <https://github.com/google/sentencepiece>
_GPT-2 BPE <https://github.com/openai/gpt-2/blob/master/src/encoder.py>
_CLIP <https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py>
_RE2 <https://github.com/google/re2>
_BERT <https://arxiv.org/pdf/1810.04805.pdf>
_
Tutorials
To get started with torchtext, users may refer to the following tutorial available on PyTorch website.
SST-2 binary text classification using XLM-R pre-trained model <https://pytorch.org/text/stable/tutorials/sst2_classification_non_distributed.html>
_Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>
_Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>
_Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>
_
Disclaimer on Datasets
This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Making text a first-class citizen in TensorFlow.
💫 Industrial-strength Natural Language Processing (NLP) in Python
NLTK Source
Topic Modelling for Humans
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot