LASER

Language-Agnostic SEntence Representations

3,647

462

3,647

View on GitHub

Top Related Projects

bert

39,558

TensorFlow code and pre-trained models for BERT

transformers

150,567

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

fastText

26,297

Library for fast text representation and classification.

sentencepiece

11,136

Unsupervised text tokenizer for Neural Network-based text generation.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

spaCy

32,582

💫 Industrial-strength Natural Language Processing (NLP) in Python

Quick Overview

LASER (Language-Agnostic SEntence Representations) is a library for multilingual sentence embeddings developed by Facebook Research. It allows for the encoding of sentences from 93 different languages into a single vector space, enabling cross-lingual transfer learning and similarity comparison across languages.

Pros

Supports a wide range of languages (93), including low-resource languages
Produces language-agnostic sentence embeddings, allowing for cross-lingual applications
Achieves state-of-the-art performance on various multilingual tasks
Relatively small model size compared to other multilingual models

Cons

Requires significant computational resources for training and fine-tuning
May not perform as well on domain-specific tasks without additional fine-tuning
Limited documentation and examples for advanced use cases
Dependency on specific versions of libraries, which may cause compatibility issues

Code Examples

Encoding sentences:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ['Hello, world!', 'Bonjour le monde!'],
    lang=['en', 'fr']
)

Computing sentence similarity:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity: {similarity}")

Cross-lingual document retrieval:

documents = [
    "This is an English document.",
    "Ceci est un document français.",
    "Dies ist ein deutsches Dokument."
]
query = "Find me documents about language"

doc_embeddings = laser.embed_sentences(documents, lang=['en', 'fr', 'de'])
query_embedding = laser.embed_sentences([query], lang=['en'])[0]

similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar_index = np.argmax(similarities)
print(f"Most similar document: {documents[most_similar_index]}")

Getting Started

Install LASER and its dependencies:

pip install laserembeddings
python -m laserembeddings download-models

Use LASER in your Python code:

from laserembeddings import Laser

laser = Laser()
sentences = ["Hello, world!", "Bonjour le monde!", "Hola mundo!"]
languages = ['en', 'fr', 'es']

embeddings = laser.embed_sentences(sentences, lang=languages)
print(f"Shape of embeddings: {embeddings.shape}")

This will encode the given sentences in their respective languages and output the shape of the resulting embeddings array.

Competitor Comparisons

bert

39,558

TensorFlow code and pre-trained models for BERT

Pros of BERT

Pre-trained on a massive corpus of English text, allowing for excellent performance on various NLP tasks
Supports fine-tuning for specific tasks, enabling adaptation to different domains
Widely adopted and supported by the research community, with numerous extensions and improvements

Cons of BERT

Primarily focused on English language processing, limiting its multilingual capabilities
Computationally intensive, requiring significant resources for training and fine-tuning
May struggle with long-range dependencies due to fixed-length input sequences

Code Comparison

BERT example:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

LASER example:

from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(['Hello world'], lang='en')

BERT focuses on contextual word embeddings for English, while LASER aims to provide language-agnostic sentence embeddings for multiple languages. BERT offers more flexibility for task-specific fine-tuning, whereas LASER provides a simpler API for generating multilingual sentence embeddings out-of-the-box.

transformers

150,567

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope: Supports a wide range of NLP tasks and models
Active community: Frequent updates and contributions
Extensive documentation and examples

Cons of transformers

Larger library size: May have a steeper learning curve
Higher computational requirements for some models

Code comparison

LASER:

from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline()
embeddings = encoder.encode_sentences(['Hello, world!'], lang='en')

transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Key differences

LASER focuses on multilingual sentence embeddings
transformers offers a broader range of pre-trained models and tasks
LASER provides a simpler API for specific use cases
transformers allows for more customization and fine-tuning

Use cases

LASER: Cross-lingual document classification, multilingual semantic search
transformers: Various NLP tasks like text classification, named entity recognition, question answering

Community and support

LASER: Maintained by Facebook Research
transformers: Large community, frequent updates, extensive documentation

fastText

26,297

Library for fast text representation and classification.

Pros of fastText

Simpler and more lightweight, focusing on efficient text classification and word representation
Faster training and inference times, especially for large datasets
Better performance on tasks involving rare words or out-of-vocabulary terms

Cons of fastText

Limited to single language processing, unlike LASER's multilingual capabilities
Less effective for complex sentence-level tasks or cross-lingual transfer learning
Doesn't provide sentence embeddings out-of-the-box, focusing more on word-level representations

Code Comparison

LASER example:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ['Hello world', 'Bonjour le monde'],
    lang=['en', 'fr']
)

fastText example:

import fasttext

model = fasttext.train_supervised("train.txt")
result = model.predict("example sentence")

LASER focuses on multilingual sentence embeddings, while fastText is primarily used for text classification and word representations. LASER requires language specification, whereas fastText operates on a single language model. fastText's API is simpler, reflecting its more focused functionality compared to LASER's broader multilingual capabilities.

sentencepiece

11,136

Unsupervised text tokenizer for Neural Network-based text generation.

Pros of SentencePiece

Lightweight and efficient tokenization algorithm
Supports a wide range of languages without language-specific rules
Easy integration with various deep learning frameworks

Cons of SentencePiece

Limited to tokenization and doesn't provide multilingual sentence embeddings
May require additional preprocessing for certain languages or tasks

Code Comparison

SentencePiece:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('model.model')
encoded = sp.encode('Hello, world!', out_type=str)

LASER:

from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(['Hello, world!'], lang='en')

Key Differences

LASER focuses on multilingual sentence embeddings, while SentencePiece specializes in subword tokenization. LASER provides a more comprehensive solution for cross-lingual NLP tasks, including language identification and similarity scoring. SentencePiece, on the other hand, offers a flexible tokenization approach that can be used as a preprocessing step for various NLP models.

Both projects have their strengths and can be complementary in multilingual NLP pipelines. LASER is better suited for tasks requiring semantic understanding across languages, while SentencePiece excels in efficient tokenization for a wide range of languages and models.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

Supports a wider range of NLP tasks, including text generation, summarization, and question answering
Utilizes a unified pre-training approach for both natural language understanding and generation
Offers better performance on various downstream tasks due to its versatile architecture

Cons of UniLM

May require more computational resources for training and fine-tuning
Has a steeper learning curve due to its more complex architecture
Less focused on multilingual capabilities compared to LASER

Code Comparison

LASER example:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(["Hello world"], lang='en')

UniLM example:

from transformers import UniLMTokenizer, UniLMForConditionalGeneration

tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs)

Summary

While LASER focuses primarily on multilingual sentence embeddings, UniLM offers a more versatile approach to various NLP tasks. UniLM's unified pre-training method allows it to handle both understanding and generation tasks effectively. However, this versatility comes at the cost of increased complexity and potentially higher computational requirements. LASER remains a strong choice for multilingual applications, while UniLM excels in scenarios requiring a broader range of NLP capabilities.

spaCy

32,582

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

Comprehensive NLP library with a wide range of features (tokenization, POS tagging, named entity recognition, etc.)
Optimized for production use with fast performance
Extensive documentation and community support

Cons of spaCy

Primarily focused on English and a limited number of other languages
Requires more setup and configuration for specific tasks
Larger memory footprint due to its comprehensive nature

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
for token in doc:
    print(token.text, token.pos_)

LASER:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ["This is a sentence."],
    lang="en"
)
print(embeddings.shape)

Key Differences

spaCy is a full-featured NLP library, while LASER focuses on multilingual sentence embeddings
LASER supports 93+ languages, whereas spaCy has more limited language support
spaCy provides detailed linguistic analysis, while LASER generates vector representations for cross-lingual tasks

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

2023/11/30 Released P-xSIM, a dual approach extension to multilingual similarity search (xSIM)
2023/11/16 Released laser_encoders, a pip-installable package supporting LASER-2 and LASER-3 models
2023/06/26 xSIM++ evaluation pipeline and data released
2022/07/06 Updated LASER models with support for over 200 languages are now available
2022/07/06 Multilingual similarity search (xSIM) evaluation pipeline released
2022/05/03 Librivox S2S is available: Speech-to-Speech translations automatically mined in Librivox [9]
2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
2019/03/18 switch to BSD license
2019/02/13 The code to perform bitext mining is now available

CURRENT VERSION:

We now provide updated LASER models which support over 200 languages. Please see here for more details including how to download the models and perform inference.

According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalize to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description of how the multilingual sentence embeddings are trained can be found here, together with an experimental evaluation.

The core sentence embedding package: `laser_encoders`

We provide a package laser_encoders with minimal dependencies. It supports LASER-2 (a single encoder for the languages listed below) and LASER-3 (147 language-specific encoders described here).

The package can be installed simply with pip install laser_encoders and used as below:

from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline(lang="eng_Latn")
embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."])
print(embeddings.shape)  # (2, 1024)

The laser_encoders readme file provides more examples of its installation and usage.

The full LASER kit

Apart from the laser_encoders, we provide support for LASER-1 (the original multilingual encoder) and for various LASER applications listed below.

Dependencies

Python >= 3.7
PyTorch 1.0
NumPy, tested with 1.15.4
Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
Faiss, for fast similarity search and bitext mining
transliterate 1.10.2 (pip install transliterate)
jieba 0.39, Chinese segmenter (pip install jieba)
mecab 0.996, Japanese segmenter
tokenization from the Moses encoder (installed automatically)
FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
Fairseq, sequence modeling toolkit (pip install fairseq==0.12.1)
tabulate, pretty-print tabular data (pip install tabulate)
pandas, data analysis toolkit (pip install pandas)
Sentencepiece, subword tokenization (installed automatically)

Installation

install the laser_encoders package by e.g. pip install -e . for installing it in the editable mode
set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
download encoders from Amazon s3 by e.g. bash ./nllb/download_models.sh
download third party software by bash ./install_external_tools.sh
download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

Cross-lingual document classification using the MLDoc corpus [2,6]
WikiMatrix Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
Bitext mining using the BUCC corpus [3,5]
Cross-lingual NLI using the XNLI corpus [4,5,6]
Multilingual similarity search [1,6]
Sentence embedding of text files example how to calculate sentence embeddings for arbitrary text files in any of the supported language.

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

The original LASER model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (BokmÃ¥l), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

LASER3

Updated LASER models referred to as LASER3 supplement the above list with support for 147 languages. The full list of supported languages can be seen here.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

[9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk, Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,, NeurIPS 2021, pages 15748-15761.

[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk, Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of BERT

Cons of BERT

Code Comparison

Pros of transformers

Cons of transformers

Code comparison

Key differences

Use cases

Community and support

Pros of fastText

Cons of fastText

Code Comparison

Pros of SentencePiece

Cons of SentencePiece

Code Comparison

Key Differences

Pros of UniLM

Cons of UniLM

Code Comparison

Summary

Pros of spaCy

Cons of spaCy

Code Comparison

Key Differences

Convert designs to code with AI

README

LASER Language-Agnostic SEntence Representations

The core sentence embedding package: laser_encoders

The full LASER kit

Dependencies

Installation

Applications

License

Supported languages

LASER3

References

Top Related Projects

Convert designs to code with AI

The core sentence embedding package: `laser_encoders`