Convert Figma logo to code with AI

UKPLab logosentence-transformers

State-of-the-Art Text Embeddings

14,785
2,432
14,785
1,163

Top Related Projects

25,835

Library for fast text representation and classification.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

37,810

TensorFlow code and pre-trained models for BERT

13,806

A very simple framework for state-of-the-art Natural Language Processing (NLP)

15,551

Topic Modelling for Humans

29,635

💫 Industrial-strength Natural Language Processing (NLP) in Python

Quick Overview

Sentence-transformers is a Python library that provides easy-to-use methods for computing dense vector representations of sentences, paragraphs, and images. It's built on top of PyTorch and Transformers, offering pre-trained models for various NLP tasks such as semantic search, clustering, and information retrieval.

Pros

  • Easy to use with a high-level API for sentence embedding tasks
  • Supports a wide range of pre-trained models and fine-tuning capabilities
  • Efficient implementation with support for GPU acceleration
  • Integrates well with other popular NLP libraries and frameworks

Cons

  • Requires significant computational resources for large-scale tasks
  • Limited customization options for advanced users
  • Dependency on large pre-trained models may not be suitable for all use cases
  • Documentation could be more comprehensive for some advanced features

Code Examples

  1. Basic sentence embedding:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)
  1. Semantic search:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
query = "A query sentence"
corpus = ["First document", "Second document", "Third document"]

query_embedding = model.encode(query)
corpus_embeddings = model.encode(corpus)

scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
for idx in top_results:
    print(f"{corpus[idx]} (Score: {scores[idx]:.4f})")
  1. Cross-encoder for sentence pair classification:
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/stsb-roberta-base')
sentences = [['This is a sentence', 'This is another sentence'],
             ['A completely different sentence', 'This is another sentence']]
scores = model.predict(sentences)
print(scores)

Getting Started

To get started with sentence-transformers, follow these steps:

  1. Install the library:
pip install sentence-transformers
  1. Import and use a pre-trained model:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Hello, World!", "Sentence Transformers is awesome."]
embeddings = model.encode(sentences)
print(embeddings)

This will output the vector representations of the input sentences, which can be used for various downstream NLP tasks.

Competitor Comparisons

25,835

Library for fast text representation and classification.

Pros of fastText

  • Lightweight and efficient, requiring less computational resources
  • Supports unsupervised learning of word vectors
  • Capable of handling out-of-vocabulary words

Cons of fastText

  • Limited in capturing complex semantic relationships
  • Less effective for sentence-level embeddings
  • May struggle with context-dependent word meanings

Code Comparison

fastText:

import fasttext

model = fasttext.train_unsupervised('data.txt', model='skipgram')
vector = model.get_word_vector('example')

sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embedding = model.encode('This is an example sentence')

Key Differences

  • fastText focuses on word-level embeddings, while sentence-transformers specializes in sentence-level embeddings
  • sentence-transformers leverages pre-trained transformer models, offering better performance on complex NLP tasks
  • fastText is more suitable for large-scale text classification and word similarity tasks, while sentence-transformers excels in semantic search and sentence similarity applications

Use Cases

fastText is ideal for:

  • Efficient word embeddings in resource-constrained environments
  • Multilingual text classification
  • Word similarity tasks

sentence-transformers is better for:

  • Semantic search and information retrieval
  • Sentence similarity and paraphrase detection
  • More advanced NLP tasks requiring contextual understanding

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope, covering a wide range of NLP tasks and models
  • Larger community and more frequent updates
  • Extensive documentation and examples

Cons of transformers

  • Steeper learning curve due to its comprehensive nature
  • Can be overkill for simple sentence embedding tasks
  • Potentially higher computational requirements

Code comparison

sentence-transformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(['Hello, world!', 'How are you?'])

transformers:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')
inputs = tokenizer(['Hello, world!', 'How are you?'], return_tensors='pt', padding=True)
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)

Summary

sentence-transformers is more focused on sentence embeddings and similarity tasks, offering a simpler API for these specific use cases. transformers provides a comprehensive toolkit for various NLP tasks, including sentence embeddings, but requires more setup and configuration. Choose based on your project's scope and complexity.

37,810

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Original implementation of BERT, providing a foundation for many NLP tasks
  • Extensive pre-training on large datasets, offering robust language understanding
  • Highly customizable for various downstream tasks

Cons of BERT

  • Requires more effort to fine-tune for specific tasks
  • Less optimized for sentence-level embeddings out of the box
  • Heavier computational requirements for training and inference

Code Comparison

sentence-transformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embeddings = model.encode(['Hello, world!', 'How are you?'])

BERT:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer(['Hello, world!', 'How are you?'], return_tensors='pt', padding=True)
outputs = model(**inputs)

Key Differences

sentence-transformers focuses on generating sentence embeddings efficiently, while BERT provides a more general-purpose language model. sentence-transformers offers easier implementation for sentence-level tasks, whereas BERT requires more setup but provides greater flexibility for various NLP applications.

13,806

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Pros of flair

  • Broader NLP capabilities, including named entity recognition and part-of-speech tagging
  • Supports a wider range of pre-trained models and embeddings
  • More flexible for custom NLP tasks and workflows

Cons of flair

  • Less focused on sentence embeddings and similarity tasks
  • May require more setup and configuration for specific use cases
  • Potentially steeper learning curve for beginners

Code Comparison

flair:

from flair.data import Sentence
from flair.embeddings import TransformerDocumentEmbeddings

embedder = TransformerDocumentEmbeddings('bert-base-uncased')
sentence = Sentence('Hello, world!')
embedder.embed(sentence)

sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence = 'Hello, world!'
embedding = model.encode(sentence)

Both libraries offer powerful NLP capabilities, but flair provides a more comprehensive toolkit for various NLP tasks, while sentence-transformers specializes in generating high-quality sentence embeddings with a simpler API. The choice between them depends on the specific requirements of your project and the depth of NLP functionality needed.

15,551

Topic Modelling for Humans

Pros of gensim

  • Broader scope for various NLP tasks, including topic modeling and document similarity
  • More mature project with a larger community and extensive documentation
  • Efficient implementation for handling large text corpora

Cons of gensim

  • Less focused on sentence embeddings compared to sentence-transformers
  • May require more manual setup and preprocessing for specific tasks
  • Generally slower for generating sentence embeddings

Code comparison

sentence-transformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(['Hello, world!', 'How are you?'])

gensim:

from gensim.models import Word2Vec
sentences = [['Hello', 'world'], ['How', 'are', 'you']]
model = Word2Vec(sentences, min_count=1)
embeddings = [model.wv['Hello'], model.wv['How']]

sentence-transformers is more straightforward for sentence-level embeddings, while gensim requires more preprocessing and manual averaging for sentence representations. However, gensim offers more flexibility for various NLP tasks beyond just sentence embeddings.

29,635

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

  • Comprehensive NLP toolkit with a wide range of features (tokenization, POS tagging, NER, etc.)
  • Optimized for production use with fast performance
  • Extensive documentation and community support

Cons of spaCy

  • Primarily focused on general NLP tasks, not specialized in sentence embeddings
  • Requires more setup and configuration for specific tasks
  • Larger library size, which may impact deployment in resource-constrained environments

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
for token in doc:
    print(token.text, token.pos_)

sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Hello, world!", "How are you?"]
embeddings = model.encode(sentences)
print(embeddings)

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

HF Models GitHub - License PyPI - Python Version PyPI - Package Version Docs - GitHub.io

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net.

Installation

We recommend Python 3.8+, PyTorch 1.11.0+, and transformers v4.34.0+.

Install with pip

pip install -U sentence-transformers

Install with conda

conda install -c conda-forge sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documentation.

First download a pretrained model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Then provide some sentences to the model.

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (3, 384)

And that's already it. We now have a numpy arrays with the embeddings, one for each text. We can use these to compute similarities.

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

  • Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
  • Multi-Lingual and multi-task learning
  • Evaluation during training to find optimal model
  • 20+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss, etc.

Application Examples

You can use this framework for:

and many more use-cases.

For all examples, see examples/applications.

Development setup

After cloning the repo (or a fork) to your machine, in a virtual environment, run:

python -m pip install -e ".[dev]"

pre-commit install

To test your changes, run:

pytest

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

Please have a look at Publications for our different publications that are integrated into SentenceTransformers.

Maintainer: Tom Aarsen, 🤗 Hugging Face

https://www.ukp.tu-darmstadt.de/

Don't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.