Top Related Projects
Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
A cloud-native vector database, storage for next generation AI applications
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
Official Python client for Elasticsearch
Quick Overview
Chroma is an open-source embedding database designed for building AI applications with embeddings. It allows developers to store, search, and analyze vector embeddings efficiently, making it easier to create semantic search, recommendation systems, and other AI-powered features.
Pros
- Easy to use and integrate with existing AI/ML workflows
- Supports various embedding models and distance functions
- Offers both local and cloud-hosted options for flexibility
- Provides a simple API for querying and managing embeddings
Cons
- May have performance limitations for extremely large datasets
- Documentation could be more comprehensive for advanced use cases
- Limited built-in analytics and visualization tools
- Relatively new project, so the ecosystem is still developing
Code Examples
- Creating a collection and adding documents:
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_collection")
collection.add(
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)
- Querying the collection:
results = collection.query(
query_texts=["This is a query document"],
n_results=2
)
print(results)
- Updating and deleting documents:
collection.update(
ids=["id1"],
documents=["This is an updated document"],
metadatas=[{"source": "updated_source"}]
)
collection.delete(ids=["id2"])
Getting Started
To get started with Chroma, follow these steps:
- Install Chroma:
pip install chromadb
- Create a simple script:
import chromadb
client = chromadb.Client()
collection = client.create_collection("quickstart")
collection.add(
documents=["Hello world", "Goodbye world"],
metadatas=[{"source": "greeting"}, {"source": "farewell"}],
ids=["1", "2"]
)
results = collection.query(
query_texts=["hello"],
n_results=1
)
print(results)
- Run the script and explore the results. You can now start building more complex applications using Chroma's embedding database capabilities.
Competitor Comparisons
Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
Pros of Qdrant
- Written in Rust, offering high performance and memory safety
- Supports complex vector search queries with filtering
- Provides a distributed architecture for scalability
Cons of Qdrant
- Steeper learning curve due to more advanced features
- Requires more system resources for optimal performance
Code Comparison
Chroma:
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_collection")
collection.add(
documents=["This is a document", "This is another document"],
ids=["id1", "id2"]
)
Qdrant:
from qdrant_client import QdrantClient
client = QdrantClient("localhost", port=6333)
client.recreate_collection(
collection_name="my_collection",
vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
)
client.upsert(
collection_name="my_collection",
points=[
models.PointStruct(id=1, vector=[0.05, 0.61, 0.76], payload={"color": "red"}),
models.PointStruct(id=2, vector=[0.19, 0.81, 0.75], payload={"color": "blue"}),
]
)
A cloud-native vector database, storage for next generation AI applications
Pros of Milvus
- Highly scalable and distributed architecture for large-scale vector search
- Supports multiple index types and similarity metrics for diverse use cases
- Offers advanced features like data management and real-time search capabilities
Cons of Milvus
- More complex setup and configuration compared to Chroma
- Steeper learning curve due to its extensive feature set
- Requires more system resources for optimal performance
Code Comparison
Milvus (Python client):
from pymilvus import Collection, connections
connections.connect()
collection = Collection("example_collection")
results = collection.search(
data=[vector],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=5
)
Chroma:
import chromadb
client = chromadb.Client()
collection = client.create_collection("example_collection")
results = collection.query(
query_embeddings=[vector],
n_results=5
)
Both libraries offer similar functionality for vector search, but Milvus provides more advanced configuration options, while Chroma focuses on simplicity and ease of use. Milvus is better suited for large-scale, production deployments, whereas Chroma is ideal for quick prototyping and smaller-scale applications.
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
Pros of Weaviate
- More mature and feature-rich, with a wider range of functionalities
- Better scalability for large-scale production environments
- Supports multiple vector index types (HNSW, LSH, FLAT)
Cons of Weaviate
- Steeper learning curve due to more complex architecture
- Requires more resources to run and maintain
- Less straightforward setup compared to Chroma
Code Comparison
Weaviate (Python client):
import weaviate
client = weaviate.Client("http://localhost:8080")
client.schema.create_class({
"class": "Article",
"vectorizer": "text2vec-transformers"
})
Chroma:
import chromadb
client = chromadb.Client()
collection = client.create_collection("articles")
collection.add(
documents=["content1", "content2"],
metadatas=[{"source": "wiki"}, {"source": "book"}],
ids=["id1", "id2"]
)
Both Weaviate and Chroma are vector databases, but they differ in complexity and use cases. Weaviate offers more advanced features and scalability, making it suitable for large-scale production environments. Chroma, on the other hand, provides a simpler interface and easier setup, which can be advantageous for smaller projects or quick prototyping. The code comparison shows that Weaviate requires more configuration, while Chroma offers a more straightforward API for basic operations.
Official Python client for Elasticsearch
Pros of elasticsearch-py
- Mature and widely adopted Elasticsearch client for Python
- Comprehensive API coverage for Elasticsearch operations
- Extensive documentation and community support
Cons of elasticsearch-py
- Focused solely on Elasticsearch, lacking vector search capabilities
- Steeper learning curve for users new to Elasticsearch
Code Comparison
elasticsearch-py:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
doc = {"title": "Test Document", "content": "This is a test"}
es.index(index="my_index", body=doc)
Chroma:
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_collection")
collection.add(
documents=["This is a test"],
metadatas=[{"title": "Test Document"}],
ids=["1"]
)
Key Differences
- Chroma focuses on vector databases and similarity search, while elasticsearch-py is for general-purpose document indexing and search
- Chroma offers a simpler API for vector operations, making it easier for machine learning tasks
- elasticsearch-py provides more advanced querying capabilities and supports complex aggregations
Use Cases
- elasticsearch-py: Full-text search, log analysis, and complex data aggregations
- Chroma: Similarity search, recommendation systems, and AI-powered applications
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Chroma - the open-source embedding database.
The fastest way to build Python or JavaScript LLM apps with memory!
pip install chromadb # python client
# for javascript, npm install chromadb!
# for client-server mode, chroma run --path /chroma_db_path
The core API is only 4 functions (run our ð¡ Google Colab or Replit template):
import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()
# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")
# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
ids=["doc1", "doc2"], # unique for each doc
)
# Query/search 2 most similar results. You can also .get by id
results = collection.query(
query_texts=["This is a query document"],
n_results=2,
# where={"metadata_field": "is_equal_to_this"}, # optional filter
# where_document={"$contains":"search_string"} # optional filter
)
Features
- Simple: Fully-typed, fully-tested, fully-documented == happiness
- Integrations:
ð¦ï¸ð LangChain
(python and js),ð¦ LlamaIndex
and more soon - Dev, Test, Prod: the same API that runs in your python notebook, scales to your cluster
- Feature-rich: Queries, filtering, density estimation and more
- Free & Open Source: Apache 2.0 Licensed
Use case: ChatGPT for ______
For example, the "Chat your data"
use case:
- Add documents to your database. You can pass in your own embeddings, embedding function, or let Chroma embed them for you.
- Query relevant documents with natural language.
- Compose documents into the context window of an LLM like
GPT3
for additional summarization or analysis.
Embeddings?
What are embeddings?
- Read the guide from OpenAI
- Literal: Embedding something turns it from image/text/audio into a list of numbers. ð¼ï¸ or ð =>
[1.2, 2.1, ....]
. This process makes documents "understandable" to a machine learning model. - By analogy: An embedding represents the essence of a document. This enables documents and queries with the same essence to be "near" each other and therefore easy to find.
- Technical: An embedding is the latent-space position of a document at a layer of a deep neural network. For models trained specifically to embed data, this is the last layer.
- A small example: If you search your photos for "famous bridge in San Francisco". By embedding this query and comparing it to the embeddings of your photos and their metadata - it should return photos of the Golden Gate Bridge.
Embeddings databases (also known as vector databases) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database. By default, Chroma uses Sentence Transformers to embed for you but you can also use OpenAI embeddings, Cohere (multilingual) embeddings, or your own.
Get involved
Chroma is a rapidly developing project. We welcome PR contributors and ideas for how to improve the project.
- Join the conversation on Discord -
#contributing
channel - Review the ð£ï¸ Roadmap and contribute your ideas
- Grab an issue and open a PR -
Good first issue tag
- Read our contributing guide
Release Cadence
We currently release new tagged versions of the pypi
and npm
packages on Mondays. Hotfixes go out at any time during the week.
License
Top Related Projects
Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
A cloud-native vector database, storage for next generation AI applications
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
Official Python client for Elasticsearch
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot