Convert Figma logo to code with AI

huggingface logotokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

8,984
789
8,984
49

Top Related Projects

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Unsupervised text tokenizer for Neural Network-based text generation.

11,750

An open-source NLP research library, built on PyTorch.

29,635

💫 Industrial-strength Natural Language Processing (NLP) in Python

37,810

TensorFlow code and pre-trained models for BERT

Quick Overview

The huggingface/tokenizers repository is a fast and modern tokenization library, providing an implementation of today's most used tokenizers. It offers a flexible and easy-to-use API, allowing users to train and use various tokenization algorithms efficiently. The library is implemented in Rust with Python bindings, ensuring high performance and cross-language compatibility.

Pros

  • Extremely fast tokenization, significantly outperforming pure Python implementations
  • Supports a wide range of tokenization algorithms, including BPE, WordPiece, and Unigram
  • Provides a unified API for different tokenizers, making it easy to switch between algorithms
  • Offers both pre-trained tokenizers and the ability to train custom tokenizers

Cons

  • Learning curve for users unfamiliar with Rust or advanced tokenization concepts
  • Limited support for languages other than Python (though Rust API is available)
  • May require additional setup steps compared to pure Python libraries
  • Some advanced features might be overwhelming for beginners

Code Examples

  1. Loading and using a pre-trained tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)
  1. Training a new tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.train(["path/to/file1.txt", "path/to/file2.txt"], trainer)
tokenizer.save("path/to/new_tokenizer.json")
  1. Encoding and decoding with a tokenizer:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("path/to/tokenizer.json")

encoded = tokenizer.encode("Hello, world!")
print(encoded.ids)  # Token IDs
print(encoded.tokens)  # Original tokens

decoded = tokenizer.decode(encoded.ids)
print(decoded)  # Original text

Getting Started

To get started with huggingface/tokenizers:

  1. Install the library:
pip install tokenizers
  1. Import and use a pre-trained tokenizer:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer.encode("Hello, world!")
print(encoded.tokens)
  1. For more advanced usage, refer to the official documentation and examples in the repository.

Competitor Comparisons

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • Focuses on optimizing and scaling deep learning training, offering significant speed improvements for large models
  • Provides a comprehensive suite of optimization techniques, including ZeRO, pipeline parallelism, and 3D parallelism
  • Integrates well with popular deep learning frameworks like PyTorch and supports various hardware configurations

Cons of DeepSpeed

  • Has a steeper learning curve due to its complexity and wide range of features
  • Primarily designed for training optimization, while tokenizers is more specialized for text processing tasks
  • May require more setup and configuration for smaller projects or simpler use cases

Code Comparison

DeepSpeed (model initialization):

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters()
)

tokenizers (tokenizer creation):

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())
tokenizer.train(files=["path/to/files/*.txt"], vocab_size=30000)

While both libraries serve different purposes, this comparison highlights DeepSpeed's focus on model training optimization and tokenizers' specialization in text processing and tokenization tasks.

Unsupervised text tokenizer for Neural Network-based text generation.

Pros of SentencePiece

  • Language-agnostic and unsupervised text tokenization
  • Supports subword units (byte-pair-encoding and unigram language model)
  • Directly trainable from raw sentences

Cons of SentencePiece

  • Slower tokenization speed compared to Tokenizers
  • Less flexibility in customizing tokenization rules
  • Limited support for advanced features like BPE dropout

Code Comparison

SentencePiece:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("model.model")
tokens = sp.EncodeAsPieces("Hello world")

Tokenizers:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokens = tokenizer.encode("Hello world").tokens

Both libraries offer efficient tokenization solutions, but Tokenizers provides faster processing and more advanced features. SentencePiece excels in language-agnostic tokenization and is particularly useful for Asian languages. Tokenizers offers greater flexibility and customization options, making it more suitable for complex NLP tasks and fine-tuning tokenization strategies.

11,750

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

  • Comprehensive NLP toolkit with a wide range of pre-built models and tasks
  • Flexible and extensible architecture for building custom NLP pipelines
  • Strong integration with PyTorch and support for distributed training

Cons of AllenNLP

  • Steeper learning curve due to its broader scope and complexity
  • Potentially slower performance for specific tokenization tasks
  • Less focus on tokenization compared to Tokenizers' specialized approach

Code Comparison

AllenNLP tokenization example:

from allennlp.data.tokenizers import SpacyTokenizer

tokenizer = SpacyTokenizer()
tokens = tokenizer.tokenize("Hello, world!")

Tokenizers tokenization example:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

tokenizer = Tokenizer(WordPiece())
tokens = tokenizer.encode("Hello, world!")

While both libraries offer tokenization capabilities, Tokenizers provides a more specialized and efficient approach to tokenization, whereas AllenNLP offers a broader range of NLP functionalities beyond just tokenization. AllenNLP is better suited for complex NLP pipelines, while Tokenizers excels in fast and customizable tokenization tasks.

29,635

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

  • Comprehensive NLP library with a wide range of features beyond tokenization
  • Offers pre-trained models for various languages and tasks
  • Provides a user-friendly API and extensive documentation

Cons of spaCy

  • Can be slower for simple tokenization tasks compared to specialized tokenizers
  • Larger library size and potential overhead for projects only needing tokenization
  • May require more system resources for full functionality

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
tokens = [token.text for token in doc]

tokenizers:

from tokenizers import Tokenizer
from tokenizers.models import WordLevel

tokenizer = Tokenizer(WordLevel())
tokenizer.train_from_iterator(["Hello, world!"])
tokens = tokenizer.encode("Hello, world!").tokens

Both libraries offer tokenization capabilities, but spaCy provides a more comprehensive NLP toolkit, while tokenizers focuses specifically on efficient tokenization. The choice between them depends on the project's requirements and scope.

37,810

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Pioneered the transformer-based language model architecture
  • Provides pre-trained models for various languages and tasks
  • Extensive documentation and research papers available

Cons of BERT

  • Less flexible tokenization options compared to tokenizers
  • Slower tokenization process, especially for large datasets
  • Limited to specific vocabulary sizes and model architectures

Code Comparison

BERT tokenization:

from bert_tokenizer import FullTokenizer

tokenizer = FullTokenizer(vocab_file="vocab.txt", do_lower_case=True)
tokens = tokenizer.tokenize("Hello, how are you?")

tokenizers tokenization:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

tokenizer = Tokenizer(WordPiece(vocab="vocab.json", unk_token="[UNK]"))
tokens = tokenizer.encode("Hello, how are you?").tokens

Key Differences

  • tokenizers offers more customizable tokenization options
  • BERT focuses on the entire model architecture, while tokenizers specializes in tokenization
  • tokenizers provides faster tokenization, especially for large datasets
  • BERT includes pre-trained models, while tokenizers is primarily a tokenization library
  • tokenizers supports various tokenization algorithms, whereas BERT uses a specific WordPiece implementation

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Performances

Performances can vary depending on hardware, but running the ~/bindings/python/benches/test_tiktoken.py should give the following on a g6 aws instance: image

Bindings

We provide bindings to the following languages (more to come!):

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the documentation or the quicktour to learn more!