subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

2,244

470

2,244

View on GitHub

Top Related Projects

sentencepiece

11,136

Unsupervised text tokenizer for Neural Network-based text generation.

tokenizers

9,965

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

bert

39,267

TensorFlow code and pre-trained models for BERT

Quick Overview

Subword-nmt is a Python library for subword segmentation, primarily used in neural machine translation. It implements the Byte Pair Encoding (BPE) algorithm, which helps address the open vocabulary problem in machine translation by breaking down words into smaller, more manageable units.

Pros

Improves handling of rare and out-of-vocabulary words in machine translation
Reduces vocabulary size, leading to more efficient models
Applicable to various languages and can be used with different NMT frameworks
Well-documented and easy to integrate into existing NLP pipelines

Cons

May introduce segmentation errors that affect translation quality
Requires careful tuning of the number of merge operations
Can increase sequence length, potentially impacting model performance
Limited to BPE algorithm, while other subword segmentation methods exist

Code Examples

Learning BPE codes:

from subword_nmt.learn_bpe import learn_bpe
from io import StringIO

input_file = StringIO("Hello world\nHow are you?")
output_file = StringIO()
learn_bpe(input_file, output_file, num_symbols=10, min_frequency=2)
print(output_file.getvalue())

Applying BPE encoding:

from subword_nmt.apply_bpe import BPE

bpe = BPE(StringIO("e l\nl lo\nlo _\n_ w\nw o\no r\nr l\nl d\n"))
encoded = bpe.process_line("Hello world")
print(encoded)

Segmenting a file:

from subword_nmt.apply_bpe import BPE
import sys

with open('codes.bpe', 'r') as codes_file:
    bpe = BPE(codes_file)

for line in sys.stdin:
    print(bpe.process_line(line.strip()))

Getting Started

Install the library:

pip install subword-nmt

Learn BPE codes:

subword-nmt learn-bpe -s 10000 < train.txt > codes.bpe

Apply BPE encoding:

subword-nmt apply-bpe -c codes.bpe < test.txt > test.bpe

To use in Python scripts, import and use as shown in the code examples above.

Competitor Comparisons

sentencepiece

11,136

Unsupervised text tokenizer for Neural Network-based text generation.

Pros of SentencePiece

Supports multiple tokenization algorithms (BPE, unigram, char, word)
Offers direct integration with TensorFlow and PyTorch
Provides language-agnostic tokenization without pre-tokenization

Cons of SentencePiece

May require more computational resources for training
Less flexible for custom modifications compared to simpler implementations

Code Comparison

SentencePiece:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('model.model')
encoded = sp.encode('Hello, world!', out_type=str)

subword-nmt:

from subword_nmt import apply_bpe
bpe = apply_bpe.BPE(open('codes.txt'))
encoded = bpe.process_line('Hello, world!')

Key Differences

SentencePiece offers a more comprehensive solution with built-in training and various algorithms
subword-nmt focuses specifically on BPE and requires separate training steps
SentencePiece provides easier integration with popular deep learning frameworks
subword-nmt may be simpler to understand and modify for specific use cases

Both tools are valuable for subword tokenization, with SentencePiece offering more features and integration options, while subword-nmt provides a straightforward implementation of BPE.

tokenizers

9,965

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Pros of tokenizers

Faster tokenization and training due to Rust implementation
More diverse tokenization algorithms (BPE, Unigram, WordPiece, etc.)
Extensive pre-processing and post-processing options

Cons of tokenizers

More complex setup and usage compared to subword-nmt
Steeper learning curve for users unfamiliar with Rust or advanced NLP concepts

Code comparison

subword-nmt:

from subword_nmt.learn_bpe import learn_bpe
from subword_nmt.apply_bpe import BPE

learn_bpe(open('input.txt'), open('codes.txt', 'w'), num_symbols=10000)
bpe = BPE(open('codes.txt'))
output = bpe.process_line('Hello, world!')

tokenizers:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())
tokenizer.train(["input.txt"], vocab_size=10000)
output = tokenizer.encode("Hello, world!")

The tokenizers library offers more flexibility and performance but requires more setup. subword-nmt is simpler to use for basic BPE tokenization tasks but lacks advanced features and speed optimizations.

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

More advanced pre-training technique using masked language modeling and next sentence prediction
Achieves state-of-the-art performance on a wide range of NLP tasks
Supports fine-tuning for specific downstream tasks

Cons of BERT

Significantly more complex and resource-intensive to train and use
Requires more computational power and memory
Less flexible for custom tokenization schemes

Code Comparison

BERT (Python):

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

subword-nmt (Python):

from subword_nmt.learn_bpe import learn_bpe
from subword_nmt.apply_bpe import BPE

with open('train.txt', 'r') as train_file, open('codes.txt', 'w') as codes_file:
    learn_bpe(train_file, codes_file, num_symbols=10000)

bpe = BPE(open('codes.txt'))
encoded = bpe.process_line('Hello, my dog is cute')

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword units. The primary purpose is to facilitate the reproduction of our experiments on Neural Machine Translation with subword units (see below for reference).

INSTALLATION

install via pip (from PyPI):

pip install subword-nmt

install via pip (from Github):

pip install https://github.com/rsennrich/subword-nmt/archive/master.zip

alternatively, clone this repository; the scripts are executable stand-alone.

USAGE INSTRUCTIONS

Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

To segment rare words into character n-grams, do the following:

subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}

The original segmentation can be restored with a simple replacement:

sed -r 's/(@@ )|(@@ ?$)//g'

If you cloned the repository and did not install a package, you can also run the individual commands as scripts:

./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT

We found that for languages that share an alphabet, learning BPE on the concatenation of the (two or more) involved languages increases the consistency of segmentation, and reduces the problem of inserting/deleting characters when copying/transliterating names.

However, this introduces undesirable edge cases in that a word may be segmented in a way that has only been observed in the other language, and is thus unknown at test time. To prevent this, apply_bpe.py accepts a --vocabulary and a --vocabulary-threshold option so that the script will only produce symbols which also appear in the vocabulary (with at least some frequency).

To use this functionality, we recommend the following recipe (assuming L1 and L2 are the two languages):

Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:

cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 | subword-nmt get-vocab > {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 | subword-nmt get-vocab > {vocab_file}.L2

more conventiently, you can do the same with with this command:

subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

re-apply byte pair encoding with vocabulary filter:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2

as a last step, extract the vocabulary to be used by the neural network. Example with Nematus:

nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2

[you may want to take the union of all vocabularies to support multilingual systems]

for test/dev data, re-use the same options for consistency:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1

ADVANCED FEATURES

On top of the basic BPE implementation, this repository supports:

BPE dropout (Provilkov, Emelianenko and Voita, 2019): https://arxiv.org/abs/1910.13267 use the argument --dropout 0.1 for subword-nmt apply-bpe to randomly drop out possible merges. Doing this on the training corpus can improve quality of the final system; at test time, use BPE without dropout. In order to obtain reproducible results, argument --seed can be used to set the random seed.

Note: In the original paper, the authors used BPE-Dropout on each new batch separately. You can copy the training corpus several times to get similar behavior to obtain multiple segmentations for the same sentence.
support for glossaries: use the argument --glossaries for subword-nmt apply-bpe to provide a list of subwords and/or regular expressions that should always be passed to the output without subword segmentation

echo "I am flying to <country>Switzerland</country> at noon ." | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref
I am fl@@ y@@ ing to <@@ coun@@ tr@@ y@@ >@@ S@@ w@@ it@@ z@@ er@@ l@@ and@@ <@@ /@@ coun@@ tr@@ y@@ > at no@@ on .

echo "I am flying to <country>Switzerland</country> at noon ." | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref --glossaries "<country>\w*</country>" "fly"
I am fly@@ ing to <country>Switzerland</country> at no@@ on .

byte-level BPE: while BPE uses characters as basic units in Sennrich et al., 2016), Radford et al., 2019 use bytes as basic units. This can be enabled with the argument --bytes for subword-nmt learn-bpe. When applying BPE with subword-nmt apply-bpe, no argument is necessary: whether characters or bytes are the basic units is stored in the first line of the BPE file.

PUBLICATIONS

The segmentation methods are described in:

@inproceedings{sennrich-etal-2016-neural,
    title = "Neural Machine Translation of Rare Words with Subword Units",
    author = "Sennrich, Rico  and
      Haddow, Barry  and
      Birch, Alexandra",
    booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2016",
    address = "Berlin, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P16-1162",
    doi = "10.18653/v1/P16-1162",
    pages = "1715--1725",
}

The best practice advice is described in:

@inproceedings{sennrich-etal-2017-university,
    title = "The University of {E}dinburgh{'}s Neural {MT} Systems for {WMT}17",
    author = "Sennrich, Rico  and
      Birch, Alexandra  and
      Currey, Anna  and
      Germann, Ulrich  and
      Haddow, Barry  and
      Heafield, Kenneth  and
      Miceli Barone, Antonio Valerio  and
      Williams, Philip",
    booktitle = "Proceedings of the Second Conference on Machine Translation",
    month = sep,
    year = "2017",
    address = "Copenhagen, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W17-4739",
    doi = "10.18653/v1/W17-4739",
    pages = "389--399",
}

HOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)

This repository implements the subword segmentation as described in Sennrich et al. (2016), but since version 0.2, there is one core difference related to end-of-word tokens.

In Sennrich et al. (2016), the end-of-word token </w> is initially represented as a separate token, which can be merged with other subwords over time:

u n d </w>
f u n d </w>

Since 0.2, end-of-word tokens are initially concatenated with the word-final character:

u n d</w>
f u n d</w>

The new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit und is unambiguously word-final, and un is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation.

apply_bpe.py is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: #version: 0.2

ACKNOWLEDGMENTS

This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Unionâs Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot