similarity

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

1,021

106

1,021

View on GitHub

Top Related Projects

bert

39,267

TensorFlow code and pre-trained models for BERT

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

faiss

36,332

A library for efficient similarity search and clustering of dense vectors.

annoy

13,889

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

nmslib

3,507

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

umap

7,877

Uniform Manifold Approximation and Projection

Quick Overview

TensorFlow Similarity is a Python library for similarity learning and metric learning. It provides tools and utilities to train models that can learn similarity between inputs, which is useful for tasks like image retrieval, face recognition, and recommendation systems. The library is built on top of TensorFlow and Keras, making it easy to integrate with existing TensorFlow workflows.

Pros

Easy to use API for similarity learning tasks
Built on top of TensorFlow and Keras, allowing for seamless integration
Supports various loss functions and architectures for similarity learning
Includes pre-trained models and datasets for quick experimentation

Cons

Limited documentation and examples compared to more established libraries
Requires understanding of TensorFlow and Keras for advanced usage
May have a steeper learning curve for beginners in similarity learning
Fewer community contributions and support compared to larger projects

Code Examples

Creating a similarity model:

import tensorflow_similarity as tfsim

model = tfsim.models.SimilarityModel(
    backbone=tf.keras.applications.ResNet50(weights=None, include_top=False),
    dimensions=128
)

Training the model:

model.compile(optimizer="adam", loss="triplet")
model.fit(train_dataset, epochs=10)

Computing similarity between two inputs:

similarity = model.similarity(image1, image2)
print(f"Similarity score: {similarity}")

Performing nearest neighbor search:

index = tfsim.Index(model)
index.add(reference_images)
neighbors = index.query(query_image, k=5)

Getting Started

To get started with TensorFlow Similarity, follow these steps:

Install the library:

pip install tensorflow-similarity

Import the library and create a simple model:

import tensorflow as tf
import tensorflow_similarity as tfsim

model = tfsim.models.SimilarityModel(
    backbone=tf.keras.applications.MobileNetV2(weights=None, include_top=False),
    dimensions=128
)
model.compile(optimizer="adam", loss="triplet")

Prepare your dataset and train the model:

# Assume you have a dataset of images and labels
model.fit(train_dataset, epochs=10)

Use the trained model for similarity tasks:

similarity = model.similarity(image1, image2)
print(f"Similarity score: {similarity}")

Competitor Comparisons

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

More comprehensive and widely adopted for natural language processing tasks
Offers pre-trained models for various languages and domains
Extensive documentation and community support

Cons of BERT

Larger model size and higher computational requirements
More complex to fine-tune and adapt for specific tasks
Less focused on similarity-specific tasks

Code Comparison

BERT example:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

Similarity example:

from tensorflow_similarity.layers import MetricEmbedding
from tensorflow_similarity.models import SimilarityModel

model = SimilarityModel(
    backbone=tf.keras.Sequential([
        tf.keras.layers.Input((28, 28, 1)),
        tf.keras.layers.Conv2D(64, 3),
        tf.keras.layers.GlobalAveragePooling2D(),
        MetricEmbedding(64)
    ])
)

Both repositories offer valuable tools for different aspects of machine learning. BERT excels in general natural language processing tasks, while Similarity focuses on similarity-based learning and retrieval. The choice between them depends on the specific requirements of your project.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Broader scope, supporting a wide range of NLP tasks and models
Larger community and more frequent updates
Extensive documentation and examples

Cons of Transformers

Steeper learning curve due to its comprehensive nature
Potentially higher resource requirements for some tasks

Code Comparison

Transformers:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Similarity:

import tensorflow_similarity as tfsim

model = tfsim.models.SimilarityModel()
model.add(tfsim.layers.Dense(128, activation="relu"))
model.add(tfsim.layers.Dense(64, activation="relu"))
model.compile(optimizer="adam")

Transformers offers a higher-level API for working with pre-trained models, while Similarity provides a more flexible approach for building custom similarity models. Transformers is better suited for general NLP tasks, whereas Similarity excels in specific similarity-based applications.

faiss

36,332

A library for efficient similarity search and clustering of dense vectors.

Pros of FAISS

Highly optimized for large-scale similarity search and clustering
Supports GPU acceleration for faster processing
Offers a wide range of indexing algorithms for different use cases

Cons of FAISS

Steeper learning curve due to its low-level C++ implementation
Less integrated with TensorFlow ecosystem
Requires separate installation and setup

Code Comparison

FAISS:

import faiss

index = faiss.IndexFlatL2(d)
index.add(xb)
D, I = index.search(xq, k)

Similarity:

import tensorflow_similarity as tfsim

index = tfsim.Index(embedding_size)
index.add(embeddings, labels)
neighbors = index.query(queries)

Key Differences

FAISS is more focused on efficient similarity search and clustering, while Similarity is designed for broader machine learning tasks within the TensorFlow ecosystem.
FAISS offers more advanced indexing algorithms and optimizations, making it better suited for large-scale applications.
Similarity provides a higher-level API that integrates seamlessly with TensorFlow, making it easier to use for TensorFlow developers.
FAISS supports both CPU and GPU implementations, while Similarity primarily leverages TensorFlow's built-in device management.

annoy

13,889

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Pros of Annoy

Lightweight and efficient, with a focus on memory usage and speed
Supports multiple distance metrics (Euclidean, Manhattan, Cosine, Hamming)
Can be used from both C++ and Python

Cons of Annoy

Limited to approximate nearest neighbor search
Does not provide built-in support for GPU acceleration
Less integrated with machine learning workflows compared to Similarity

Code Comparison

Annoy:

from annoy import AnnoyIndex

f = 40  # Length of item vector
t = AnnoyIndex(f, 'angular')
for i in range(1000):
    v = [random() for _ in range(f)]
    t.add_item(i, v)
t.build(10)  # 10 trees

Similarity:

import tensorflow as tf
import tensorflow_similarity as tfsim

model = tfsim.models.SimilarityModel(
    backbone=tf.keras.applications.ResNet50(weights='imagenet', include_top=False),
    dimensions=2048
)
model.compile(optimizer='adam')
model.fit(x_train, y_train, epochs=10)

The code examples highlight Annoy's focus on efficient indexing and searching, while Similarity integrates more closely with TensorFlow's machine learning ecosystem.

nmslib

3,507

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

Pros of nmslib

Supports a wider range of distance metrics and index types
Generally faster for high-dimensional data and large datasets
More mature project with extensive documentation and benchmarks

Cons of nmslib

Less integration with TensorFlow ecosystem
Requires more manual tuning and parameter selection
Not as well-suited for deep learning-based similarity search

Code Comparison

nmslib:

import nmslib
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
index.createIndex({'post': 2})
ids, distances = index.knnQuery(query, k=10)

TensorFlow Similarity:

import tensorflow_similarity as tfsim
index = tfsim.Index(embedding_size, distance='cosine')
index.add(embeddings, labels)
distances, indices = index.query(queries, k=10)

Summary

nmslib is a more general-purpose similarity search library with broader algorithm support and better performance for large-scale datasets. TensorFlow Similarity is more tightly integrated with the TensorFlow ecosystem and offers easier implementation for deep learning-based similarity search tasks. The choice between the two depends on the specific use case, dataset size, and integration requirements with existing TensorFlow workflows.

umap

7,877

Uniform Manifold Approximation and Projection

Pros of UMAP

More general-purpose dimensionality reduction tool, not limited to similarity search
Faster runtime for large datasets compared to t-SNE and other methods
Preserves both local and global structure of data

Cons of UMAP

Requires more manual parameter tuning than Similarity
Less integrated with TensorFlow ecosystem
May be overkill for simple similarity search tasks

Code Comparison

UMAP:

import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)

Similarity:

import tensorflow_similarity as tfsim
model = tfsim.SimilarityModel()
model.fit(data)
similar = model.lookup(query, k=5)

Key Differences

UMAP focuses on dimensionality reduction and visualization
Similarity specializes in efficient similarity search and retrieval
UMAP is more flexible but requires more expertise to use effectively
Similarity integrates seamlessly with TensorFlow models and workflows

Use Cases

UMAP: Exploratory data analysis, visualization, and general dimensionality reduction
Similarity: Building recommendation systems, content-based search, and similarity-based clustering

Both libraries have their strengths, and the choice depends on the specific requirements of your project and your familiarity with the TensorFlow ecosystem.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TensorFlow Similarity: Metric Learning for Humans

TensorFlow Similarity is a TensorFlow library for similarity learning which includes techniques such as self-supervised learning, metric learning, similarity learning, and contrastive learning. TensorFlow Similarity is still in beta and we may push breaking changes.

Introduction

Tensorflow Similarity offers state-of-the-art algorithms for metric learning along with all the necessary components to research, train, evaluate, and serve similarity and contrastive based models. These components include models, losses, metrics, samplers, visualizers, and indexing subsystems to make this quick and easy.

Example of nearest neighbors search performed on the embedding generated by a similarity model trained on the Oxford IIIT Pet Dataset.

With Tensorflow Similarity you can train two main types of models:

Self-supervised models: Used to learn general data representations on unlabeled data to boost the accuracy of downstream tasks where you have few labels. For example, you can pre-train a model on a large number of unlabled images using one of the supported contrastive methods supported by TensorFlow Similarity, and then fine-tune it on a small labeled dataset to achieve higher accuracy. To get started training your own self-supervised model see this notebook.
Similarity models: Output embeddings that allow you to find and cluster similar examples such as images representing the same object within a large corpus of examples. For instance, as visible above, you can train a similarity model to find and cluster similar looking, unseen cat and dog images from the Oxford IIIT Pet Dataset while only training on a few of the dataset classes. To get started training your own similarity model see this notebook.

What's new

[Mar 2023]: 0.17 more losses and metric and massive refactoring
- Added VicReg Loss to contrastive losses.
- Added metrics used in retrieval papers such as Precision@K
- Native support for distributed training e.g SimClr now works correctly with distributed training.
- Multi-modal embedding initial support (CLIP)

For more details and previous releases information - see the changelog

Getting Started

Installation

Use pip to install the library.

NOTE: The Tensorflow extra_require key can be omitted if you already have tensorflow>=2.4 installed.

pip install --upgrade-strategy=only-if-needed tensorflow_similarity[tensorflow]

Documentation

The detailed and narrated notebooks are a good way to get started with TensorFlow Similarity. There is likely to be one that is similar to your data or your problem (if not, let us know). You can start working with the examples immediately in Google Colab by clicking the Google Colab icon.

For more information about specific functions, you can check the API documentation

For contributing to the project please check out the contribution guidelines

Minimal Example: MNIST similarity

Click to expand and see how to train a supervised similarity model on mnist using TF.Similarity

Here is a bare bones example demonstrating how to train a TensorFlow Similarity model on the MNIST data. This example illustrates some of the main components provided by TensorFlow Similarity and how they fit together. Please refer to the hello_world notebook for a more detailed introduction.

Preparing data

TensorFlow Similarity provides data samplers, for various dataset types, that balance the batches to ensure smoother training. In this example, we are using the multi-shot sampler that integrates directly from the TensorFlow dataset catalog.

from tensorflow_similarity.samplers import TFDatasetMultiShotMemorySampler

# Data sampler that generates balanced batches from MNIST dataset
sampler = TFDatasetMultiShotMemorySampler(dataset_name='mnist', classes_per_batch=10)

Building a Similarity model

Building a TensorFlow Similarity model is similar to building a standard Keras model, except the output layer is usually a MetricEmbedding() layer that enforces L2 normalization and the model is instantiated as a specialized subclass SimilarityModel() that supports additional functionality.

from tensorflow.keras import layers
from tensorflow_similarity.layers import MetricEmbedding
from tensorflow_similarity.models import SimilarityModel

# Build a Similarity model using standard Keras layers
inputs = layers.Input(shape=(28, 28, 1))
x = layers.experimental.preprocessing.Rescaling(1/255)(inputs)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.Flatten()(x)
x = layers.Dense(64, activation='relu')(x)
outputs = MetricEmbedding(64)(x)

# Build a specialized Similarity model
model = SimilarityModel(inputs, outputs)

Training model via contrastive learning

To output a metric embedding, that are searchable via approximate nearest neighbor search, the model needs to be trained using a similarity loss. Here we are using the MultiSimilarityLoss(), which is one of the most efficient loss functions.

from tensorflow_similarity.losses import MultiSimilarityLoss

# Train Similarity model using contrastive loss
model.compile('adam', loss=MultiSimilarityLoss())
model.fit(sampler, epochs=5)

Building images index and querying it

Once the model is trained, reference examples must be indexed via the model index API to be searchable. After indexing, you can use the model lookup API to search the index for the K most similar items.

from tensorflow_similarity.visualization import viz_neigbors_imgs

# Index 100 embedded MNIST examples to make them searchable
sx, sy = sampler.get_slice(0,100)
model.index(x=sx, y=sy, data=sx)

# Find the top 5 most similar indexed MNIST examples for a given example
qx, qy = sampler.get_slice(3713, 1)
nns = model.single_lookup(qx[0])

# Visualize the query example and its top 5 neighbors
viz_neigbors_imgs(qx[0], qy[0], nns)

Supported Algorithms

Self-Supervised Models

SimCLR
SimSiam
Barlow Twins

Supervised Losses

Triplet Loss
PN Loss
Multi Sim Loss
Circle Loss
Soft Nearest Neighbor Loss

Metrics

Tensorflow Similarity offers many of the most common metrics used for classification and retrieval evaluation. Including:

Name	Type	Description
Precision	Classification
Recall	Classification
F1 Score	Classification
Recall@K	Retrieval
Binary NDCG	Retrieval

Citing

Please cite this reference if you use any part of TensorFlow similarity in your research:

@article{EBSIM21,
  title={TensorFlow Similarity: A Usable, High-Performance Metric Learning Library},
  author={Elie Bursztein, James Long, Shun Lin, Owen Vallis, Francois Chollet},
  journal={Fixme},
  year={2021}
}

Disclaimer

This is not an official Google product.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot