Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Unsupervised text tokenizer for Neural Network-based text generation.
An open-source NLP research library, built on PyTorch.
💫 Industrial-strength Natural Language Processing (NLP) in Python
TensorFlow code and pre-trained models for BERT
Quick Overview
The huggingface/tokenizers repository is a fast and modern tokenization library, providing an implementation of today's most used tokenizers. It offers a flexible and easy-to-use API, allowing users to train and use various tokenization algorithms efficiently. The library is implemented in Rust with Python bindings, ensuring high performance and cross-language compatibility.
Pros
- Extremely fast tokenization, significantly outperforming pure Python implementations
- Supports a wide range of tokenization algorithms, including BPE, WordPiece, and Unigram
- Provides a unified API for different tokenizers, making it easy to switch between algorithms
- Offers both pre-trained tokenizers and the ability to train custom tokenizers
Cons
- Learning curve for users unfamiliar with Rust or advanced tokenization concepts
- Limited support for languages other than Python (though Rust API is available)
- May require additional setup steps compared to pure Python libraries
- Some advanced features might be overwhelming for beginners
Code Examples
- Loading and using a pre-trained tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)
- Training a new tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["path/to/file1.txt", "path/to/file2.txt"], trainer)
tokenizer.save("path/to/new_tokenizer.json")
- Encoding and decoding with a tokenizer:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("path/to/tokenizer.json")
encoded = tokenizer.encode("Hello, world!")
print(encoded.ids) # Token IDs
print(encoded.tokens) # Original tokens
decoded = tokenizer.decode(encoded.ids)
print(decoded) # Original text
Getting Started
To get started with huggingface/tokenizers:
- Install the library:
pip install tokenizers
- Import and use a pre-trained tokenizer:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer.encode("Hello, world!")
print(encoded.tokens)
- For more advanced usage, refer to the official documentation and examples in the repository.
Competitor Comparisons
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- Focuses on optimizing and scaling deep learning training, offering significant speed improvements for large models
- Provides a comprehensive suite of optimization techniques, including ZeRO, pipeline parallelism, and 3D parallelism
- Integrates well with popular deep learning frameworks like PyTorch and supports various hardware configurations
Cons of DeepSpeed
- Has a steeper learning curve due to its complexity and wide range of features
- Primarily designed for training optimization, while tokenizers is more specialized for text processing tasks
- May require more setup and configuration for smaller projects or simpler use cases
Code Comparison
DeepSpeed (model initialization):
model_engine, optimizer, _, _ = deepspeed.initialize(
args=args,
model=model,
model_parameters=model.parameters()
)
tokenizers (tokenizer creation):
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())
tokenizer.train(files=["path/to/files/*.txt"], vocab_size=30000)
While both libraries serve different purposes, this comparison highlights DeepSpeed's focus on model training optimization and tokenizers' specialization in text processing and tokenization tasks.
Unsupervised text tokenizer for Neural Network-based text generation.
Pros of SentencePiece
- Language-agnostic and unsupervised text tokenization
- Supports subword units (byte-pair-encoding and unigram language model)
- Directly trainable from raw sentences
Cons of SentencePiece
- Slower tokenization speed compared to Tokenizers
- Less flexibility in customizing tokenization rules
- Limited support for advanced features like BPE dropout
Code Comparison
SentencePiece:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("model.model")
tokens = sp.EncodeAsPieces("Hello world")
Tokenizers:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokens = tokenizer.encode("Hello world").tokens
Both libraries offer efficient tokenization solutions, but Tokenizers provides faster processing and more advanced features. SentencePiece excels in language-agnostic tokenization and is particularly useful for Asian languages. Tokenizers offers greater flexibility and customization options, making it more suitable for complex NLP tasks and fine-tuning tokenization strategies.
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- Comprehensive NLP toolkit with a wide range of pre-built models and tasks
- Flexible and extensible architecture for building custom NLP pipelines
- Strong integration with PyTorch and support for distributed training
Cons of AllenNLP
- Steeper learning curve due to its broader scope and complexity
- Potentially slower performance for specific tokenization tasks
- Less focus on tokenization compared to Tokenizers' specialized approach
Code Comparison
AllenNLP tokenization example:
from allennlp.data.tokenizers import SpacyTokenizer
tokenizer = SpacyTokenizer()
tokens = tokenizer.tokenize("Hello, world!")
Tokenizers tokenization example:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
tokenizer = Tokenizer(WordPiece())
tokens = tokenizer.encode("Hello, world!")
While both libraries offer tokenization capabilities, Tokenizers provides a more specialized and efficient approach to tokenization, whereas AllenNLP offers a broader range of NLP functionalities beyond just tokenization. AllenNLP is better suited for complex NLP pipelines, while Tokenizers excels in fast and customizable tokenization tasks.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- Comprehensive NLP library with a wide range of features beyond tokenization
- Offers pre-trained models for various languages and tasks
- Provides a user-friendly API and extensive documentation
Cons of spaCy
- Can be slower for simple tokenization tasks compared to specialized tokenizers
- Larger library size and potential overhead for projects only needing tokenization
- May require more system resources for full functionality
Code Comparison
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
tokens = [token.text for token in doc]
tokenizers:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
tokenizer = Tokenizer(WordLevel())
tokenizer.train_from_iterator(["Hello, world!"])
tokens = tokenizer.encode("Hello, world!").tokens
Both libraries offer tokenization capabilities, but spaCy provides a more comprehensive NLP toolkit, while tokenizers focuses specifically on efficient tokenization. The choice between them depends on the project's requirements and scope.
TensorFlow code and pre-trained models for BERT
Pros of BERT
- Pioneered the transformer-based language model architecture
- Provides pre-trained models for various languages and tasks
- Extensive documentation and research papers available
Cons of BERT
- Less flexible tokenization options compared to tokenizers
- Slower tokenization process, especially for large datasets
- Limited to specific vocabulary sizes and model architectures
Code Comparison
BERT tokenization:
from bert_tokenizer import FullTokenizer
tokenizer = FullTokenizer(vocab_file="vocab.txt", do_lower_case=True)
tokens = tokenizer.tokenize("Hello, how are you?")
tokenizers tokenization:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
tokenizer = Tokenizer(WordPiece(vocab="vocab.json", unk_token="[UNK]"))
tokens = tokenizer.encode("Hello, how are you?").tokens
Key Differences
- tokenizers offers more customizable tokenization options
- BERT focuses on the entire model architecture, while tokenizers specializes in tokenization
- tokenizers provides faster tokenization, especially for large datasets
- BERT includes pre-trained models, while tokenizers is primarily a tokenization library
- tokenizers supports various tokenization algorithms, whereas BERT uses a specific WordPiece implementation
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Performances
Performances can vary depending on hardware, but running the ~/bindings/python/benches/test_tiktoken.py should give the following on a g6 aws instance:
Bindings
We provide bindings to the following languages (more to come!):
Quick example using Python:
Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())
You can customize how pre-tokenization (e.g., splitting into words) is done:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
Then training your tokenizer on a set of files just takes two lines of codes:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)
Once your tokenizer is trained, encode any text with just one line:
output = tokenizer.encode("Hello, y'all! How are you ð ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
Check the documentation or the quicktour to learn more!
Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Unsupervised text tokenizer for Neural Network-based text generation.
An open-source NLP research library, built on PyTorch.
💫 Industrial-strength Natural Language Processing (NLP) in Python
TensorFlow code and pre-trained models for BERT
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot