Convert Figma logo to code with AI

facebookresearch logoLASER

Language-Agnostic SEntence Representations

3,572
459
3,572
48

Top Related Projects

37,810

TensorFlow code and pre-trained models for BERT

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

25,883

Library for fast text representation and classification.

Unsupervised text tokenizer for Neural Network-based text generation.

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

29,635

💫 Industrial-strength Natural Language Processing (NLP) in Python

Quick Overview

LASER (Language-Agnostic SEntence Representations) is a library for multilingual sentence embeddings developed by Facebook Research. It allows for the encoding of sentences from 93 different languages into a single vector space, enabling cross-lingual transfer learning and similarity comparison across languages.

Pros

  • Supports a wide range of languages (93), including low-resource languages
  • Produces language-agnostic sentence embeddings, allowing for cross-lingual applications
  • Achieves state-of-the-art performance on various multilingual tasks
  • Relatively small model size compared to other multilingual models

Cons

  • Requires significant computational resources for training and fine-tuning
  • May not perform as well on domain-specific tasks without additional fine-tuning
  • Limited documentation and examples for advanced use cases
  • Dependency on specific versions of libraries, which may cause compatibility issues

Code Examples

  1. Encoding sentences:
from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ['Hello, world!', 'Bonjour le monde!'],
    lang=['en', 'fr']
)
  1. Computing sentence similarity:
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity: {similarity}")
  1. Cross-lingual document retrieval:
documents = [
    "This is an English document.",
    "Ceci est un document français.",
    "Dies ist ein deutsches Dokument."
]
query = "Find me documents about language"

doc_embeddings = laser.embed_sentences(documents, lang=['en', 'fr', 'de'])
query_embedding = laser.embed_sentences([query], lang=['en'])[0]

similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar_index = np.argmax(similarities)
print(f"Most similar document: {documents[most_similar_index]}")

Getting Started

  1. Install LASER and its dependencies:
pip install laserembeddings
python -m laserembeddings download-models
  1. Use LASER in your Python code:
from laserembeddings import Laser

laser = Laser()
sentences = ["Hello, world!", "Bonjour le monde!", "Hola mundo!"]
languages = ['en', 'fr', 'es']

embeddings = laser.embed_sentences(sentences, lang=languages)
print(f"Shape of embeddings: {embeddings.shape}")

This will encode the given sentences in their respective languages and output the shape of the resulting embeddings array.

Competitor Comparisons

37,810

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Pre-trained on a massive corpus of English text, allowing for excellent performance on various NLP tasks
  • Supports fine-tuning for specific tasks, enabling adaptation to different domains
  • Widely adopted and supported by the research community, with numerous extensions and improvements

Cons of BERT

  • Primarily focused on English language processing, limiting its multilingual capabilities
  • Computationally intensive, requiring significant resources for training and fine-tuning
  • May struggle with long-range dependencies due to fixed-length input sequences

Code Comparison

BERT example:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

LASER example:

from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(['Hello world'], lang='en')

BERT focuses on contextual word embeddings for English, while LASER aims to provide language-agnostic sentence embeddings for multiple languages. BERT offers more flexibility for task-specific fine-tuning, whereas LASER provides a simpler API for generating multilingual sentence embeddings out-of-the-box.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope: Supports a wide range of NLP tasks and models
  • Active community: Frequent updates and contributions
  • Extensive documentation and examples

Cons of transformers

  • Larger library size: May have a steeper learning curve
  • Higher computational requirements for some models

Code comparison

LASER:

from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline()
embeddings = encoder.encode_sentences(['Hello, world!'], lang='en')

transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Key differences

  • LASER focuses on multilingual sentence embeddings
  • transformers offers a broader range of pre-trained models and tasks
  • LASER provides a simpler API for specific use cases
  • transformers allows for more customization and fine-tuning

Use cases

  • LASER: Cross-lingual document classification, multilingual semantic search
  • transformers: Various NLP tasks like text classification, named entity recognition, question answering

Community and support

  • LASER: Maintained by Facebook Research
  • transformers: Large community, frequent updates, extensive documentation
25,883

Library for fast text representation and classification.

Pros of fastText

  • Simpler and more lightweight, focusing on efficient text classification and word representation
  • Faster training and inference times, especially for large datasets
  • Better performance on tasks involving rare words or out-of-vocabulary terms

Cons of fastText

  • Limited to single language processing, unlike LASER's multilingual capabilities
  • Less effective for complex sentence-level tasks or cross-lingual transfer learning
  • Doesn't provide sentence embeddings out-of-the-box, focusing more on word-level representations

Code Comparison

LASER example:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ['Hello world', 'Bonjour le monde'],
    lang=['en', 'fr']
)

fastText example:

import fasttext

model = fasttext.train_supervised("train.txt")
result = model.predict("example sentence")

LASER focuses on multilingual sentence embeddings, while fastText is primarily used for text classification and word representations. LASER requires language specification, whereas fastText operates on a single language model. fastText's API is simpler, reflecting its more focused functionality compared to LASER's broader multilingual capabilities.

Unsupervised text tokenizer for Neural Network-based text generation.

Pros of SentencePiece

  • Lightweight and efficient tokenization algorithm
  • Supports a wide range of languages without language-specific rules
  • Easy integration with various deep learning frameworks

Cons of SentencePiece

  • Limited to tokenization and doesn't provide multilingual sentence embeddings
  • May require additional preprocessing for certain languages or tasks

Code Comparison

SentencePiece:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('model.model')
encoded = sp.encode('Hello, world!', out_type=str)

LASER:

from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(['Hello, world!'], lang='en')

Key Differences

LASER focuses on multilingual sentence embeddings, while SentencePiece specializes in subword tokenization. LASER provides a more comprehensive solution for cross-lingual NLP tasks, including language identification and similarity scoring. SentencePiece, on the other hand, offers a flexible tokenization approach that can be used as a preprocessing step for various NLP models.

Both projects have their strengths and can be complementary in multilingual NLP pipelines. LASER is better suited for tasks requiring semantic understanding across languages, while SentencePiece excels in efficient tokenization for a wide range of languages and models.

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

  • Supports a wider range of NLP tasks, including text generation, summarization, and question answering
  • Utilizes a unified pre-training approach for both natural language understanding and generation
  • Offers better performance on various downstream tasks due to its versatile architecture

Cons of UniLM

  • May require more computational resources for training and fine-tuning
  • Has a steeper learning curve due to its more complex architecture
  • Less focused on multilingual capabilities compared to LASER

Code Comparison

LASER example:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(["Hello world"], lang='en')

UniLM example:

from transformers import UniLMTokenizer, UniLMForConditionalGeneration

tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs)

Summary

While LASER focuses primarily on multilingual sentence embeddings, UniLM offers a more versatile approach to various NLP tasks. UniLM's unified pre-training method allows it to handle both understanding and generation tasks effectively. However, this versatility comes at the cost of increased complexity and potentially higher computational requirements. LASER remains a strong choice for multilingual applications, while UniLM excels in scenarios requiring a broader range of NLP capabilities.

29,635

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

  • Comprehensive NLP library with a wide range of features (tokenization, POS tagging, named entity recognition, etc.)
  • Optimized for production use with fast performance
  • Extensive documentation and community support

Cons of spaCy

  • Primarily focused on English and a limited number of other languages
  • Requires more setup and configuration for specific tasks
  • Larger memory footprint due to its comprehensive nature

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
for token in doc:
    print(token.text, token.pos_)

LASER:

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ["This is a sentence."],
    lang="en"
)
print(embeddings.shape)

Key Differences

  • spaCy is a full-featured NLP library, while LASER focuses on multilingual sentence embeddings
  • LASER supports 93+ languages, whereas spaCy has more limited language support
  • spaCy provides detailed linguistic analysis, while LASER generates vector representations for cross-lingual tasks

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

  • 2023/11/30 Released P-xSIM, a dual approach extension to multilingual similarity search (xSIM)
  • 2023/11/16 Released laser_encoders, a pip-installable package supporting LASER-2 and LASER-3 models
  • 2023/06/26 xSIM++ evaluation pipeline and data released
  • 2022/07/06 Updated LASER models with support for over 200 languages are now available
  • 2022/07/06 Multilingual similarity search (xSIM) evaluation pipeline released
  • 2022/05/03 Librivox S2S is available: Speech-to-Speech translations automatically mined in Librivox [9]
  • 2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
  • 2019/07/31 Gilles Bodard and Jérémy Rapin provided a Docker environment to use LASER
  • 2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
  • 2019/03/18 switch to BSD license
  • 2019/02/13 The code to perform bitext mining is now available

CURRENT VERSION:

  • We now provide updated LASER models which support over 200 languages. Please see here for more details including how to download the models and perform inference.

According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalize to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description of how the multilingual sentence embeddings are trained can be found here, together with an experimental evaluation.

The core sentence embedding package: laser_encoders

We provide a package laser_encoders with minimal dependencies. It supports LASER-2 (a single encoder for the languages listed below) and LASER-3 (147 language-specific encoders described here).

The package can be installed simply with pip install laser_encoders and used as below:

from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline(lang="eng_Latn")
embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."])
print(embeddings.shape)  # (2, 1024)

The laser_encoders readme file provides more examples of its installation and usage.

The full LASER kit

Apart from the laser_encoders, we provide support for LASER-1 (the original multilingual encoder) and for various LASER applications listed below.

Dependencies

  • Python >= 3.7
  • PyTorch 1.0
  • NumPy, tested with 1.15.4
  • Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
  • Faiss, for fast similarity search and bitext mining
  • transliterate 1.10.2 (pip install transliterate)
  • jieba 0.39, Chinese segmenter (pip install jieba)
  • mecab 0.996, Japanese segmenter
  • tokenization from the Moses encoder (installed automatically)
  • FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
  • Fairseq, sequence modeling toolkit (pip install fairseq==0.12.1)
  • tabulate, pretty-print tabular data (pip install tabulate)
  • pandas, data analysis toolkit (pip install pandas)
  • Sentencepiece, subword tokenization (installed automatically)

Installation

  • install the laser_encoders package by e.g. pip install -e . for installing it in the editable mode
  • set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
  • download encoders from Amazon s3 by e.g. bash ./nllb/download_models.sh
  • download third party software by bash ./install_external_tools.sh
  • download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

The original LASER model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

LASER3

Updated LASER models referred to as LASER3 supplement the above list with support for 147 languages. The full list of supported languages can be seen here.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

[9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk, Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,, NeurIPS 2021, pages 15748-15761.

[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk, Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages