tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

9,965

947

9,965

103

View on GitHub

Top Related Projects

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

sentencepiece

11,136

Unsupervised text tokenizer for Neural Network-based text generation.

allennlp

11,862

An open-source NLP research library, built on PyTorch.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

bert

39,267

TensorFlow code and pre-trained models for BERT

Quick Overview

The huggingface/tokenizers repository is a fast and modern tokenization library, providing an implementation of today's most used tokenizers. It offers a flexible and easy-to-use API, allowing users to train and use various tokenization algorithms efficiently. The library is implemented in Rust with Python bindings, ensuring high performance and cross-language compatibility.

Pros

Extremely fast tokenization, significantly outperforming pure Python implementations
Supports a wide range of tokenization algorithms, including BPE, WordPiece, and Unigram
Provides a unified API for different tokenizers, making it easy to switch between algorithms
Offers both pre-trained tokenizers and the ability to train custom tokenizers

Cons

Learning curve for users unfamiliar with Rust or advanced tokenization concepts
Limited support for languages other than Python (though Rust API is available)
May require additional setup steps compared to pure Python libraries
Some advanced features might be overwhelming for beginners

Code Examples

Loading and using a pre-trained tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)

Training a new tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.train(["path/to/file1.txt", "path/to/file2.txt"], trainer)
tokenizer.save("path/to/new_tokenizer.json")

Encoding and decoding with a tokenizer:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("path/to/tokenizer.json")

encoded = tokenizer.encode("Hello, world!")
print(encoded.ids)  # Token IDs
print(encoded.tokens)  # Original tokens

decoded = tokenizer.decode(encoded.ids)
print(decoded)  # Original text

Getting Started

To get started with huggingface/tokenizers:

Install the library:

pip install tokenizers

Import and use a pre-trained tokenizer:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer.encode("Hello, world!")
print(encoded.tokens)

For more advanced usage, refer to the official documentation and examples in the repository.

Competitor Comparisons

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

Focuses on optimizing large-scale model training and inference
Offers advanced distributed training techniques like ZeRO and 3D parallelism
Provides comprehensive memory optimization strategies

Cons of DeepSpeed

Steeper learning curve due to its complexity and advanced features
Primarily designed for large models, may be overkill for smaller projects

Code Comparison

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=params
)

Tokenizers:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-cased")
output = tokenizer.encode("Hello, world!")

Key Differences

DeepSpeed is a comprehensive training optimization library, while Tokenizers focuses specifically on text tokenization
DeepSpeed is more suitable for large-scale distributed training scenarios, whereas Tokenizers is a lightweight solution for text processing
DeepSpeed requires more setup and configuration, while Tokenizers offers a simpler API for quick integration

Use Cases

Use DeepSpeed for training large language models or when working with distributed systems
Choose Tokenizers for efficient text preprocessing in natural language processing tasks or when working with smaller models

sentencepiece

11,136

Unsupervised text tokenizer for Neural Network-based text generation.

Pros of SentencePiece

Language-agnostic and unsupervised text tokenization
Supports subword units (byte-pair-encoding and unigram language model)
Directly trainable from raw sentences

Cons of SentencePiece

Slower tokenization speed compared to Tokenizers
Less flexibility in customizing tokenization rules
Limited support for advanced features like BPE dropout

Code Comparison

SentencePiece:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("model.model")
tokens = sp.EncodeAsPieces("Hello world")

Tokenizers:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokens = tokenizer.encode("Hello world").tokens

Both libraries offer efficient tokenization solutions, but Tokenizers provides faster processing and more advanced features. SentencePiece excels in language-agnostic tokenization and is particularly useful for Asian languages. Tokenizers offers greater flexibility and customization options, making it more suitable for complex NLP tasks and fine-tuning tokenization strategies.

allennlp

11,862

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

Comprehensive NLP toolkit with a wide range of pre-built models and tasks
Flexible and extensible architecture for building custom NLP pipelines
Strong integration with PyTorch and support for distributed training

Cons of AllenNLP

Steeper learning curve due to its broader scope and complexity
Potentially slower performance for specific tokenization tasks
Less focus on tokenization compared to Tokenizers' specialized approach

Code Comparison

AllenNLP tokenization example:

from allennlp.data.tokenizers import SpacyTokenizer

tokenizer = SpacyTokenizer()
tokens = tokenizer.tokenize("Hello, world!")

Tokenizers tokenization example:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

tokenizer = Tokenizer(WordPiece())
tokens = tokenizer.encode("Hello, world!")

While both libraries offer tokenization capabilities, Tokenizers provides a more specialized and efficient approach to tokenization, whereas AllenNLP offers a broader range of NLP functionalities beyond just tokenization. AllenNLP is better suited for complex NLP pipelines, while Tokenizers excels in fast and customizable tokenization tasks.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

Comprehensive NLP library with a wide range of features beyond tokenization
Offers pre-trained models for various languages and tasks
Provides a user-friendly API and extensive documentation

Cons of spaCy

Can be slower for simple tokenization tasks compared to specialized tokenizers
Larger library size and potential overhead for projects only needing tokenization
May require more system resources for full functionality

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
tokens = [token.text for token in doc]

tokenizers:

from tokenizers import Tokenizer
from tokenizers.models import WordLevel

tokenizer = Tokenizer(WordLevel())
tokenizer.train_from_iterator(["Hello, world!"])
tokens = tokenizer.encode("Hello, world!").tokens

Both libraries offer tokenization capabilities, but spaCy provides a more comprehensive NLP toolkit, while tokenizers focuses specifically on efficient tokenization. The choice between them depends on the project's requirements and scope.

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

Pioneered the transformer-based language model architecture
Provides pre-trained models for various languages and tasks
Extensive documentation and research papers available

Cons of BERT

Less flexible tokenization options compared to tokenizers
Slower tokenization process, especially for large datasets
Limited to specific vocabulary sizes and model architectures

Code Comparison

BERT tokenization:

from bert_tokenizer import FullTokenizer

tokenizer = FullTokenizer(vocab_file="vocab.txt", do_lower_case=True)
tokens = tokenizer.tokenize("Hello, how are you?")

tokenizers tokenization:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

tokenizer = Tokenizer(WordPiece(vocab="vocab.json", unk_token="[UNK]"))
tokens = tokenizer.encode("Hello, how are you?").tokens

Key Differences

tokenizers offers more customizable tokenization options
BERT focuses on the entire model architecture, while tokenizers specializes in tokenization
tokenizers provides faster tokenization, especially for large datasets
BERT includes pre-trained models, while tokenizers is primarily a tokenization library
tokenizers supports various tokenization algorithms, whereas BERT uses a specific WordPiece implementation

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Performances

Performances can vary depending on hardware, but running the ~/bindings/python/benches/test_tiktoken.py should give the following on a g6 aws instance:

Bindings

We provide bindings to the following languages (more to come!):

Rust (Original implementation)
Python
Node.js
Ruby (Contributed by @ankane, external repo)

Installation

You can install from source using:

pip install git+https://github.com/huggingface/tokenizers.git#subdirectory=bindings/python

our install the released versions with

pip install tokenizers

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you ð ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the documentation or the quicktour to learn more!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot