Top Related Projects
TensorFlow code and pre-trained models for BERT
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Library for fast text representation and classification.
Unsupervised text tokenizer for Neural Network-based text generation.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
💫 Industrial-strength Natural Language Processing (NLP) in Python
Quick Overview
LASER (Language-Agnostic SEntence Representations) is a library for multilingual sentence embeddings developed by Facebook Research. It allows for the encoding of sentences from 93 different languages into a single vector space, enabling cross-lingual transfer learning and similarity comparison across languages.
Pros
- Supports a wide range of languages (93), including low-resource languages
- Produces language-agnostic sentence embeddings, allowing for cross-lingual applications
- Achieves state-of-the-art performance on various multilingual tasks
- Relatively small model size compared to other multilingual models
Cons
- Requires significant computational resources for training and fine-tuning
- May not perform as well on domain-specific tasks without additional fine-tuning
- Limited documentation and examples for advanced use cases
- Dependency on specific versions of libraries, which may cause compatibility issues
Code Examples
- Encoding sentences:
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(
['Hello, world!', 'Bonjour le monde!'],
lang=['en', 'fr']
)
- Computing sentence similarity:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarity = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity: {similarity}")
- Cross-lingual document retrieval:
documents = [
"This is an English document.",
"Ceci est un document français.",
"Dies ist ein deutsches Dokument."
]
query = "Find me documents about language"
doc_embeddings = laser.embed_sentences(documents, lang=['en', 'fr', 'de'])
query_embedding = laser.embed_sentences([query], lang=['en'])[0]
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar_index = np.argmax(similarities)
print(f"Most similar document: {documents[most_similar_index]}")
Getting Started
- Install LASER and its dependencies:
pip install laserembeddings
python -m laserembeddings download-models
- Use LASER in your Python code:
from laserembeddings import Laser
laser = Laser()
sentences = ["Hello, world!", "Bonjour le monde!", "Hola mundo!"]
languages = ['en', 'fr', 'es']
embeddings = laser.embed_sentences(sentences, lang=languages)
print(f"Shape of embeddings: {embeddings.shape}")
This will encode the given sentences in their respective languages and output the shape of the resulting embeddings array.
Competitor Comparisons
TensorFlow code and pre-trained models for BERT
Pros of BERT
- Pre-trained on a massive corpus of English text, allowing for excellent performance on various NLP tasks
- Supports fine-tuning for specific tasks, enabling adaptation to different domains
- Widely adopted and supported by the research community, with numerous extensions and improvements
Cons of BERT
- Primarily focused on English language processing, limiting its multilingual capabilities
- Computationally intensive, requiring significant resources for training and fine-tuning
- May struggle with long-range dependencies due to fixed-length input sequences
Code Comparison
BERT example:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
LASER example:
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(['Hello world'], lang='en')
BERT focuses on contextual word embeddings for English, while LASER aims to provide language-agnostic sentence embeddings for multiple languages. BERT offers more flexibility for task-specific fine-tuning, whereas LASER provides a simpler API for generating multilingual sentence embeddings out-of-the-box.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Broader scope: Supports a wide range of NLP tasks and models
- Active community: Frequent updates and contributions
- Extensive documentation and examples
Cons of transformers
- Larger library size: May have a steeper learning curve
- Higher computational requirements for some models
Code comparison
LASER:
from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline()
embeddings = encoder.encode_sentences(['Hello, world!'], lang='en')
transformers:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
Key differences
- LASER focuses on multilingual sentence embeddings
- transformers offers a broader range of pre-trained models and tasks
- LASER provides a simpler API for specific use cases
- transformers allows for more customization and fine-tuning
Use cases
- LASER: Cross-lingual document classification, multilingual semantic search
- transformers: Various NLP tasks like text classification, named entity recognition, question answering
Community and support
- LASER: Maintained by Facebook Research
- transformers: Large community, frequent updates, extensive documentation
Library for fast text representation and classification.
Pros of fastText
- Simpler and more lightweight, focusing on efficient text classification and word representation
- Faster training and inference times, especially for large datasets
- Better performance on tasks involving rare words or out-of-vocabulary terms
Cons of fastText
- Limited to single language processing, unlike LASER's multilingual capabilities
- Less effective for complex sentence-level tasks or cross-lingual transfer learning
- Doesn't provide sentence embeddings out-of-the-box, focusing more on word-level representations
Code Comparison
LASER example:
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(
['Hello world', 'Bonjour le monde'],
lang=['en', 'fr']
)
fastText example:
import fasttext
model = fasttext.train_supervised("train.txt")
result = model.predict("example sentence")
LASER focuses on multilingual sentence embeddings, while fastText is primarily used for text classification and word representations. LASER requires language specification, whereas fastText operates on a single language model. fastText's API is simpler, reflecting its more focused functionality compared to LASER's broader multilingual capabilities.
Unsupervised text tokenizer for Neural Network-based text generation.
Pros of SentencePiece
- Lightweight and efficient tokenization algorithm
- Supports a wide range of languages without language-specific rules
- Easy integration with various deep learning frameworks
Cons of SentencePiece
- Limited to tokenization and doesn't provide multilingual sentence embeddings
- May require additional preprocessing for certain languages or tasks
Code Comparison
SentencePiece:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('model.model')
encoded = sp.encode('Hello, world!', out_type=str)
LASER:
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(['Hello, world!'], lang='en')
Key Differences
LASER focuses on multilingual sentence embeddings, while SentencePiece specializes in subword tokenization. LASER provides a more comprehensive solution for cross-lingual NLP tasks, including language identification and similarity scoring. SentencePiece, on the other hand, offers a flexible tokenization approach that can be used as a preprocessing step for various NLP models.
Both projects have their strengths and can be complementary in multilingual NLP pipelines. LASER is better suited for tasks requiring semantic understanding across languages, while SentencePiece excels in efficient tokenization for a wide range of languages and models.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Pros of UniLM
- Supports a wider range of NLP tasks, including text generation, summarization, and question answering
- Utilizes a unified pre-training approach for both natural language understanding and generation
- Offers better performance on various downstream tasks due to its versatile architecture
Cons of UniLM
- May require more computational resources for training and fine-tuning
- Has a steeper learning curve due to its more complex architecture
- Less focused on multilingual capabilities compared to LASER
Code Comparison
LASER example:
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(["Hello world"], lang='en')
UniLM example:
from transformers import UniLMTokenizer, UniLMForConditionalGeneration
tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs)
Summary
While LASER focuses primarily on multilingual sentence embeddings, UniLM offers a more versatile approach to various NLP tasks. UniLM's unified pre-training method allows it to handle both understanding and generation tasks effectively. However, this versatility comes at the cost of increased complexity and potentially higher computational requirements. LASER remains a strong choice for multilingual applications, while UniLM excels in scenarios requiring a broader range of NLP capabilities.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- Comprehensive NLP library with a wide range of features (tokenization, POS tagging, named entity recognition, etc.)
- Optimized for production use with fast performance
- Extensive documentation and community support
Cons of spaCy
- Primarily focused on English and a limited number of other languages
- Requires more setup and configuration for specific tasks
- Larger memory footprint due to its comprehensive nature
Code Comparison
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
for token in doc:
print(token.text, token.pos_)
LASER:
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(
["This is a sentence."],
lang="en"
)
print(embeddings.shape)
Key Differences
- spaCy is a full-featured NLP library, while LASER focuses on multilingual sentence embeddings
- LASER supports 93+ languages, whereas spaCy has more limited language support
- spaCy provides detailed linguistic analysis, while LASER generates vector representations for cross-lingual tasks
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
LASER Language-Agnostic SEntence Representations
LASER is a library to calculate and use multilingual sentence embeddings.
NEWS
- 2023/11/30 Released P-xSIM, a dual approach extension to multilingual similarity search (xSIM)
- 2023/11/16 Released laser_encoders, a pip-installable package supporting LASER-2 and LASER-3 models
- 2023/06/26 xSIM++ evaluation pipeline and data released
- 2022/07/06 Updated LASER models with support for over 200 languages are now available
- 2022/07/06 Multilingual similarity search (xSIM) evaluation pipeline released
- 2022/05/03 Librivox S2S is available: Speech-to-Speech translations automatically mined in Librivox [9]
- 2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
- 2019/07/31 Gilles Bodard and Jérémy Rapin provided a Docker environment to use LASER
- 2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
- 2019/03/18 switch to BSD license
- 2019/02/13 The code to perform bitext mining is now available
CURRENT VERSION:
- We now provide updated LASER models which support over 200 languages. Please see here for more details including how to download the models and perform inference.
According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.
We have also some evidence that the encoder can generalize to other languages which have not been seen during training, but which are in a language family which is covered by other languages.
A detailed description of how the multilingual sentence embeddings are trained can be found here, together with an experimental evaluation.
The core sentence embedding package: laser_encoders
We provide a package laser_encoders
with minimal dependencies.
It supports LASER-2 (a single encoder for the languages listed below)
and LASER-3 (147 language-specific encoders described here).
The package can be installed simply with pip install laser_encoders
and used as below:
from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline(lang="eng_Latn")
embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."])
print(embeddings.shape) # (2, 1024)
The laser_encoders readme file provides more examples of its installation and usage.
The full LASER kit
Apart from the laser_encoders
, we provide support for LASER-1 (the original multilingual encoder)
and for various LASER applications listed below.
Dependencies
- Python >= 3.7
- PyTorch 1.0
- NumPy, tested with 1.15.4
- Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
- Faiss, for fast similarity search and bitext mining
- transliterate 1.10.2 (
pip install transliterate
) - jieba 0.39, Chinese segmenter (
pip install jieba
) - mecab 0.996, Japanese segmenter
- tokenization from the Moses encoder (installed automatically)
- FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
- Fairseq, sequence modeling toolkit (
pip install fairseq==0.12.1
) - tabulate, pretty-print tabular data (
pip install tabulate
) - pandas, data analysis toolkit (
pip install pandas
) - Sentencepiece, subword tokenization (installed automatically)
Installation
- install the
laser_encoders
package by e.g.pip install -e .
for installing it in the editable mode - set the environment variable 'LASER' to the root of the installation, e.g.
export LASER="${HOME}/projects/laser"
- download encoders from Amazon s3 by e.g.
bash ./nllb/download_models.sh
- download third party software by
bash ./install_external_tools.sh
- download the data used in the example tasks (see description for each task)
Applications
We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").
- Cross-lingual document classification using the MLDoc corpus [2,6]
- WikiMatrix Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
- Bitext mining using the BUCC corpus [3,5]
- Cross-lingual NLI using the XNLI corpus [4,5,6]
- Multilingual similarity search [1,6]
- Sentence embedding of text files example how to calculate sentence embeddings for arbitrary text files in any of the supported language.
For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.
License
LASER is BSD-licensed, as found in the LICENSE
file in the root directory of this source tree.
Supported languages
The original LASER model was trained on the following languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.
We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.
Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.
LASER3
Updated LASER models referred to as LASER3 supplement the above list with support for 147 languages. The full list of supported languages can be seen here.
References
[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017
[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.
[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018
[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.
[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.
[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.
[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.
[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
[9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk, Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,, NeurIPS 2021, pages 15748-15761.
[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk, Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
Top Related Projects
TensorFlow code and pre-trained models for BERT
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Library for fast text representation and classification.
Unsupervised text tokenizer for Neural Network-based text generation.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
💫 Industrial-strength Natural Language Processing (NLP) in Python
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot