model2vec

Fast State-of-the-Art Static Embeddings

1,749

View on GitHub

Top Related Projects

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

fastText

26,297

Library for fast text representation and classification.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

GloVe

7,060

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Quick Overview

Model2Vec is a Python library that implements the Model2Vec algorithm, which is designed to generate vector representations of machine learning models. This approach allows for comparing and analyzing different models in a vector space, potentially enabling tasks like model similarity assessment and transfer learning.

Pros

Enables comparison and analysis of machine learning models in a vector space
Potentially useful for transfer learning and model selection tasks
Implements a novel approach to model representation

Cons

Limited documentation and examples available
May not be suitable for all types of machine learning models
Relatively new and untested in real-world applications

Code Examples

Here are a few code examples demonstrating the usage of Model2Vec:

Creating a Model2Vec instance:

from model2vec import Model2Vec

# Initialize Model2Vec with default parameters
m2v = Model2Vec()

Generating a vector representation for a model:

import tensorflow as tf

# Create a simple TensorFlow model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Generate vector representation
model_vector = m2v.vectorize(model)

Comparing two models:

# Assume we have two models: model1 and model2
vector1 = m2v.vectorize(model1)
vector2 = m2v.vectorize(model2)

# Calculate similarity between models
similarity = m2v.similarity(vector1, vector2)
print(f"Similarity between models: {similarity}")

Getting Started

To get started with Model2Vec, follow these steps:

Install the library:
```
pip install model2vec
```

Import and initialize Model2Vec:

from model2vec import Model2Vec
m2v = Model2Vec()

Vectorize your machine learning models:

model_vector = m2v.vectorize(your_model)

Use the generated vectors for analysis or comparison tasks as needed.

Competitor Comparisons

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

Comprehensive NLP library with extensive features and pre-trained models
Large and active community, extensive documentation, and regular updates
Optimized for production use with efficient performance

Cons of spaCy

Steeper learning curve due to its extensive feature set
Larger footprint and potentially slower for simple tasks
May be overkill for projects requiring only basic NLP functionality

Code Comparison

model2vec:

from model2vec import Model2Vec

model = Model2Vec(model_path="path/to/model")
vector = model.encode("Hello, world!")

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
vector = doc.vector

Key Differences

model2vec focuses specifically on converting models to vectors, while spaCy is a full-featured NLP library
spaCy offers more comprehensive language processing capabilities, including tokenization, part-of-speech tagging, and named entity recognition
model2vec may be more lightweight and easier to integrate for projects only needing vector representations

Use Cases

Choose model2vec for projects requiring simple model-to-vector conversions
Opt for spaCy when building complex NLP applications or requiring a wide range of language processing features

gensim

16,122

Topic Modelling for Humans

Pros of gensim

More comprehensive and mature library with a wider range of NLP and topic modeling functionalities
Larger community support and more frequent updates
Better documentation and extensive tutorials

Cons of gensim

Heavier and more complex, which may be overkill for simpler projects
Steeper learning curve for beginners
Potentially slower for specific word embedding tasks

Code comparison

model2vec:

from model2vec import Model2Vec

model = Model2Vec(size=100, window=5, min_count=5)
model.train(sentences)

gensim:

from gensim.models import Word2Vec

model = Word2Vec(sentences, vector_size=100, window=5, min_count=5)
model.train(sentences, total_examples=len(sentences), epochs=10)

Summary

gensim is a more comprehensive and well-established library for various NLP tasks, including word embeddings. It offers a wider range of functionalities and better community support. However, it may be more complex and potentially slower for specific word embedding tasks compared to model2vec.

model2vec, on the other hand, appears to be a more focused and lightweight library specifically for word embeddings. It might be easier to use for beginners and potentially faster for simple word embedding tasks, but it lacks the extensive features and community support of gensim.

fastText

26,297

Library for fast text representation and classification.

Pros of fastText

More mature and widely adopted project with extensive documentation
Supports multiple languages and pre-trained models
Offers both supervised and unsupervised learning capabilities

Cons of fastText

Larger codebase and potentially more complex to understand
May have higher memory requirements for large datasets
Less focused on specific model-to-vector conversion tasks

Code Comparison

model2vec:

def convert_model(model, output_file):
    vector = model_to_vector(model)
    np.save(output_file, vector)

fastText:

void FastText::loadModel(const std::string& filename) {
  std::ifstream ifs(filename, std::ifstream::binary);
  if (!ifs.is_open()) {
    throw std::invalid_argument(filename + " cannot be opened for loading!");
  }
  loadModel(ifs);
  ifs.close();
}

model2vec focuses on converting machine learning models to vector representations, while fastText provides a broader set of text classification and word embedding functionalities. model2vec is more specialized and potentially easier to use for specific model conversion tasks, whereas fastText offers a more comprehensive suite of text processing tools but may require more setup and configuration for specific use cases.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive library with support for numerous pre-trained models and architectures
Large community and frequent updates, ensuring compatibility with latest research
Comprehensive documentation and examples for various NLP tasks

Cons of transformers

Larger codebase and dependencies, potentially slower for simple tasks
Steeper learning curve for beginners due to its extensive features
May be overkill for projects requiring only basic word embeddings

Code comparison

model2vec:

from model2vec import Model2Vec

model = Model2Vec.load("path/to/model")
vector = model.get_vector("word")

transformers:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Summary

transformers offers a comprehensive solution for various NLP tasks with extensive model support, while model2vec focuses specifically on word embeddings. transformers provides more flexibility and features but may be more complex for simple use cases, whereas model2vec offers a straightforward approach for basic word vector operations.

GloVe

7,060

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Pros of GloVe

Well-established and widely used in NLP research and applications
Efficient training on large corpora with good performance on various tasks
Extensive documentation and community support

Cons of GloVe

Limited to word-level representations, not capturing subword information
May struggle with out-of-vocabulary words and rare terms
Requires retraining for new domains or languages

Code Comparison

GloVe:

for (a = 0; a < vocab_size; a++) {
    for (b = 0; b < vector_size; b++)
        fprintf(fout, "%lf ", W[a * vector_size + b]);
    fprintf(fout, "\n");
}

model2vec:

def save_embeddings(self, path):
    with open(path, 'w') as f:
        for word, vec in self.embeddings.items():
            f.write(f"{word} {' '.join(map(str, vec))}\n")

Both repositories provide methods for saving word embeddings, but GloVe uses C for efficiency, while model2vec uses Python for readability and ease of use. GloVe's implementation is more low-level and optimized for performance, whereas model2vec's approach is more Pythonic and easier to integrate into existing NLP pipelines.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Fast State-of-the-Art Static Embeddings

ð¤ Models | ð Tutorials | ð Blog | ð Results | ð Docs

Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our best model is the most performant static embedding model in the world. See our results here, or dive in to see how it works.

Quickstart â¢ Updates & Announcements â¢ Main Features â¢ Model List

Quickstart

Install the lightweight base package with:

pip install model2vec

You can start using Model2Vec by loading one of our flagship models from the HuggingFace hub. These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use for any task, such as text classification, retrieval, clustering, or building a RAG system:

from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the potion-base-8M model)
model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

# Make sequences of token embeddings
token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."])

Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. First, install the distillation extras with:

pip install model2vec[distill]

Then, you can distill a model in ~30 seconds on a CPU with the following code snippet:

from model2vec.distill import distill

# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256)

# Save the model
m2v_model.save_pretrained("m2v_model")

After distillation, you can also fine-tune your own classification models on top of the distilled model, or on a pre-trained model. First, make sure you install the training extras with:

pip install model2vec[training]

Then, you can fine-tune a model as follows:

import numpy as np
from datasets import load_dataset
from model2vec.train import StaticModelForClassification

# Initialize a classifier from a pre-trained model
classifier = StaticModelForClassification.from_pretrained(model_name="minishlab/potion-base-32M")

# Load a dataset. Note: both single and multi-label classification datasets are supported
ds = load_dataset("setfit/subj")

# Train the classifier on text (X) and labels (y)
classifier.fit(ds["train"]["text"], ds["train"]["label"])

# Evaluate the classifier
classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["label"])

For advanced usage, please refer to our usage documentation.

Updates & Announcements

23/05/2025: We released potion-multilingual-128M, a multilingual model trained on 101 languages. It is the best performing static embedding model for multilingual tasks, and is capable of generating embeddings for any text in any language. The results can be found in our results section.
01/05/2025: We released backend support for BPE and Unigram tokenizers, along with quantization and dimensionality reduction. New Model2Vec models are now 50% of the original models, and can be quantized to int8 to be 25% of the size, without loss of performance.
12/02/2025: We released Model2Vec training, allowing you to fine-tune your own classification models on top of Model2Vec models. Find out more in our training documentation and results.
30/01/2025: We released two new models: potion-base-32M and potion-retrieval-32M. potion-base-32M is our most performant model to date, using a larger vocabulary and higher dimensions. potion-retrieval-32M is a finetune of potion-base-32M that is optimized for retrieval tasks, and is the best performing static retrieval model currently available.
30/10/2024: We released three new models: potion-base-8M, potion-base-4M, and potion-base-2M. These models are trained using Tokenlearn. Find out more in our blog post. NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they perform better on all tasks.

Main Features

State-of-the-Art Performance: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our results.
Small: Model2Vec reduces the size of a Sentence Transformer model by a factor of up to 50. Our best model is just ~30 MB on disk, and our smallest model just ~8 MB (making it the smallest model on MTEB!).
Lightweight Dependencies: the base package's only major dependency is numpy.
Lightning-fast Inference: up to 500 times faster on CPU than the original model.
Fast, Dataset-free Distillation: distill your own model in 30 seconds on a CPU, without a dataset.
Fine-tuning: fine-tune your own classification models on top of Model2Vec models.
Integrated in many popular libraries: Model2Vec is integrated direclty into popular libraries such as Sentence Transformers and LangChain. For more information, see our integrations documentation.
Tightly integrated with HuggingFace hub: easily share and load models from the HuggingFace hub, using the familiar from_pretrained and push_to_hub. Our own models can be found here.

What is Model2Vec?

Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need any data, just a vocabulary and a model.

The core idea is to forward pass a vocabulary through a sentence transformer model, creating static embeddings for the indiviudal tokens. After this, there are a number of post-processing steps we do that results in our best models. For a more extensive deepdive, please refer to the following resources:

Our initial Model2Vec blog post. Note that, while this post gives a good overview of the core idea, we've made a number of substantial improvements since then.
Our Tokenlearn blog post. This post describes the Tokenlearn method we used to train our potion models.
Our official documentation. This document provides a high-level overview of how Model2Vec works.

Documentation

Our official documentation can be found here. This includes:

Usage documentation: provides a technical overview of how to use Model2Vec.
Integrations documentation: provides examples of how to use Model2Vec in various downstream libraries.
Model2Vec technical documentation: provides a high-level overview of how Model2Vec works.

Model List

We provide a number of models that can be used out of the box. These models are available on the HuggingFace hub and can be loaded using the from_pretrained method. The models are listed below.

Model	Language	Sentence Transformer	Params	Task
potion-base-32M	English	bge-base-en-v1.5	32.3M	General
potion-multilingual-128M	Multilingual	bge-m3	128M	General
potion-retrieval-32M	English	bge-base-en-v1.5	32.3M	Retrieval
potion-base-8M	English	bge-base-en-v1.5	7.5M	General
potion-base-4M	English	bge-base-en-v1.5	3.7M	General
potion-base-2M	English	bge-base-en-v1.5	1.8M	General

Results

We have performed extensive experiments to evaluate the performance of Model2Vec models. The results are documented in the results folder. The results are presented in the following sections:

License

MIT

Citing

If you use Model2Vec in your research, please cite the following:

@article{minishlab2024model2vec,
  author = {Tulkens, Stephan and {van Dongen}, Thomas},
  title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of spaCy

Cons of spaCy

Code Comparison

Key Differences

Use Cases

Pros of gensim

Cons of gensim

Code comparison

Summary

Pros of fastText

Cons of fastText

Code Comparison

Pros of transformers

Cons of transformers

Code comparison

Summary

Pros of GloVe

Cons of GloVe

Code Comparison

Convert designs to code with AI

README

Fast State-of-the-Art Static Embeddings

ð¤ Models | ð Tutorials | ð Blog | ð Results | ð Docs

Quickstart â¢ Updates & Announcements â¢ Main Features â¢ Model List

Quickstart

Updates & Announcements

Main Features

What is Model2Vec?

Documentation

Model List

Results

License

Citing

Top Related Projects

Convert designs to code with AI

ð¤ Models | ð Tutorials | ð Blog | ð Results | ð Docs

Quickstart â¢ Updates & Announcements â¢ Main Features â¢ Model List