modeldb

Open Source ML Model Versioning, Metadata, and Experiment Management

1,719

285

1,719

193

View on GitHub

Top Related Projects

mlflow

20,329

Open source platform for the machine learning lifecycle

dvc

14,591

🦉 Data Versioning and ML Experiments

clearml

6,065

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

pachyderm

6,235

Data-Centric Pipelines and Data Versioning

wandb

9,810

The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.

Quick Overview

ModelDB is an open-source machine learning model versioning, metadata, and experiment management system. It allows data scientists and machine learning engineers to track experiments, version models, and manage artifacts in a centralized repository, facilitating collaboration and reproducibility in ML workflows.

Pros

Comprehensive experiment tracking and model versioning
Integration with popular ML frameworks like TensorFlow, PyTorch, and scikit-learn
Support for both code-first and UI-based workflows
Scalable architecture suitable for individual projects and large teams

Cons

Steeper learning curve compared to some simpler experiment tracking tools
Limited built-in visualization capabilities
Requires additional setup and maintenance of a separate service
Some users report occasional stability issues with the open-source version

Code Examples

Logging an experiment:

from verta import Client

client = Client("http://localhost:3000")
run = client.set_experiment_run()

run.log_parameter("learning_rate", 0.001)
run.log_metric("accuracy", 0.95)
run.log_model("model.pkl")

Retrieving a logged experiment:

from verta import Client

client = Client("http://localhost:3000")
run = client.get_experiment_run(run_id="your-run-id")

learning_rate = run.get_parameter("learning_rate")
accuracy = run.get_metric("accuracy")
model = run.get_model()

Comparing experiments:

from verta import Client

client = Client("http://localhost:3000")
runs = client.expt_runs.find("project_name", "My Project")

for run in runs:
    print(f"Run ID: {run.id}")
    print(f"Accuracy: {run.get_metric('accuracy')}")
    print(f"Learning Rate: {run.get_parameter('learning_rate')}")
    print("---")

Getting Started

Install ModelDB:

pip install verta

Start a ModelDB server (for local development):

docker run -p 8080:8080 vertaaiofficial/modeldb

Connect to ModelDB in your Python script:

from verta import Client

client = Client("http://localhost:8080")
project = client.set_project("My First Project")
experiment = client.set_experiment("Initial Experiment")
run = client.set_experiment_run("First Run")

# Your ML code here
# ...

run.log_parameter("param_name", param_value)
run.log_metric("metric_name", metric_value)
run.log_model("model_name", model_object)

Competitor Comparisons

mlflow

20,329

Open source platform for the machine learning lifecycle

Pros of MLflow

More active community with frequent updates and contributions
Broader feature set including experiment tracking, model packaging, and deployment
Integrates well with popular ML frameworks and cloud platforms

Cons of MLflow

Can be complex to set up and configure for large-scale projects
Less focus on team collaboration and project management features
May require additional tools for advanced versioning and governance

Code Comparison

MLflow:

import mlflow

mlflow.start_run()
mlflow.log_param("param1", 5)
mlflow.log_metric("accuracy", 0.95)
mlflow.end_run()

ModelDB:

from verta import Client

client = Client()
run = client.set_experiment_run()
run.log_parameter("param1", 5)
run.log_metric("accuracy", 0.95)

Both libraries offer similar functionality for logging parameters and metrics, but MLflow's API is slightly more concise. ModelDB (through Verta) provides a client-based approach, which can be beneficial for team-based projects and centralized management.

MLflow's broader ecosystem and integration capabilities make it a popular choice for many data science teams. However, ModelDB's focus on versioning and collaboration can be advantageous for organizations prioritizing these aspects in their ML workflows.

dvc

14,591

🦉 Data Versioning and ML Experiments

Pros of DVC

Lightweight and integrates seamlessly with existing Git workflows
Supports a wide range of storage backends, including cloud services
Focuses on data versioning and pipeline management, making it more specialized for ML workflows

Cons of DVC

Lacks built-in experiment tracking and model versioning features
May require additional tools for comprehensive ML lifecycle management
Less emphasis on collaboration and team-oriented features

Code Comparison

DVC:

import dvc.api

with dvc.api.open('data/features.csv') as f:
    # Use the file object 'f'
    data = f.read()

ModelDB:

from verta import Client

client = Client()
run = client.set_experiment_run()
run.log_dataset_version("features", features_df)

Summary

DVC excels in data versioning and pipeline management, integrating well with Git workflows. It's lightweight and supports various storage options. However, it may lack some advanced ML lifecycle features.

ModelDB offers a more comprehensive solution for ML experiment tracking and model versioning, with a focus on collaboration. It provides a centralized platform for managing ML projects but may be more complex to set up and use compared to DVC.

Choose based on your specific needs: DVC for data-centric projects with existing Git workflows, or ModelDB for team-oriented ML projects requiring extensive experiment tracking and model management.

clearml

6,065

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Pros of ClearML

More comprehensive feature set, including experiment tracking, data management, and MLOps automation
Active development with frequent updates and new features
Extensive documentation and community support

Cons of ClearML

Steeper learning curve due to more complex architecture
Requires more setup and configuration compared to ModelDB
May be overkill for smaller projects or teams

Code Comparison

ModelDB (Python):

from verta import Client

client = Client("http://localhost:3000")
proj = client.set_project("My Project")
expt = client.set_experiment("My Experiment")
run = client.set_experiment_run("My Run")

ClearML (Python):

from clearml import Task

task = Task.init(project_name="My Project", task_name="My Task")
logger = task.get_logger()
params = task.connect(params_dict)

Both libraries offer similar functionality for tracking experiments, but ClearML provides more advanced features and integrations out of the box. ModelDB focuses on simplicity and ease of use, while ClearML aims to be a comprehensive MLOps platform.

pachyderm

6,235

Data-Centric Pipelines and Data Versioning

Pros of Pachyderm

Focuses on data versioning and lineage, providing robust data management capabilities
Offers built-in data parallelism and distributed processing for large-scale data pipelines
Integrates well with container ecosystems, allowing for flexible and scalable deployments

Cons of Pachyderm

Steeper learning curve due to its complex architecture and concepts
May be overkill for smaller projects or teams not dealing with massive datasets
Requires more infrastructure setup and maintenance compared to ModelDB

Code Comparison

ModelDB (Python client):

from verta import Client

client = Client("http://localhost:3000")
project = client.set_project("My Project")
experiment = client.set_experiment("My Experiment")
run = client.set_experiment_run("My Run")

Pachyderm (Go client):

import "github.com/pachyderm/pachyderm/src/client"

c, err := client.NewOnUserMachine(false, "")
if err != nil {
    return err
}

Both repositories provide client libraries for interacting with their respective systems, but Pachyderm's API is more focused on data management and pipeline operations, while ModelDB emphasizes experiment tracking and model versioning.

wandb

9,810

The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.

Pros of wandb

More extensive documentation and tutorials
Larger community and wider adoption in industry
Advanced visualization features for experiment tracking

Cons of wandb

Steeper learning curve for beginners
Primarily cloud-based, which may not suit all use cases

Code Comparison

wandb:

import wandb

wandb.init(project="my-project")
wandb.config.hyperparameters = {
    "learning_rate": 0.01,
    "epochs": 100
}
wandb.log({"accuracy": 0.9, "loss": 0.1})

modeldb:

from verta import Client

client = Client()
run = client.set_experiment_run()
run.log_hyperparameters({"learning_rate": 0.01, "epochs": 100})
run.log_metric("accuracy", 0.9)
run.log_metric("loss", 0.1)

Both wandb and modeldb offer similar functionality for experiment tracking and logging. wandb provides a more streamlined API for logging multiple metrics at once, while modeldb uses separate method calls for each metric. wandb's initialization is simpler, whereas modeldb requires creating a client and setting up an experiment run explicitly.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ModelDB: An open-source system for Machine Learning model versioning, metadata, and experiment management.

Quickstart Â· Workflow Â· Examples Â· Contribute Â· Support (Slack)

ModelDB is an open-source system to version machine learning models including their ingredients code, data, config, and environment and to track ML metadata across the model lifecycle.

Use ModelDB in order to:

Make your ML models reproducible
Manage your ML experiments, build performance dashboards, and share reports
Track models across their lifecycle including development, deployment, and live monitoring

Features:

Works on Docker, Kubernetes
Clients in Python and Scala
Beautiful dashboards for model performance and reporting
Git-like operations on any model
Flexible metadata logging including metrics, artifacts, tags and user information
Pluggable storage systems
Integration into state-of-the-art frameworks like Tensorflow and PyTorch
Battle-tested in production environments

If you are looking for a hosted version of ModelDB, please reach out at modeldb@verta.ai.

This version of ModelDB is built upon its predecessor from CSAIL, MIT. The previous version can be found on Github here. The ModelDB project is now maintained by Verta.ai.

Up and Running in 5 minutes

Install Docker (and Docker Compose)
Setup ModelDB via Docker Compose

docker-compose -f docker-compose-all.yaml up

Note: modeldb-backend service needs backend/config/config.yaml to run, either clone the repo before running docker-compose or create the file manually.

Install the ModelDB pip package. Note it comes packaged in the verta package.

pip install verta

Version a model or log a workflow. Alternatively, run any of the detailed examples in our repository.

from verta import Client
client = Client("http://localhost:3000")

proj = client.set_project("My first ModelDB project")
expt = client.set_experiment("Default Experiment")

# log the first run
run = client.set_experiment_run("First Run")
run.log_hyperparameters({"regularization" : 0.5})
# ... model training code goes here
run.log_metric('accuracy', 0.72)

# log the second run
run = client.set_experiment_run("Second Run")
run.log_hyperparameters({"regularization" : 0.8})
# ... model training code goes here
run.log_metric('accuracy', 0.83)

That's it! Navigate to http://localhost:3000 to find the ModelDB Web UI and check out the models you just logged.

For information on debugging the Docker-based ModelDB installation, check here.

Other ways to install ModelDB are:

Building the source code and deploying
Deploy on kubernetes via helm
Using a ModelDB ami
If you are looking for a hosted version of ModelDB, please reach out at modeldb@verta.ai.

Documentation

Official documentation for ModelDB can be found here.

Community

For Getting Started guides, Tutorials, and API reference check out our docs.

To report a bug, file a documentation issue, or submit a feature request, please open a GitHub issue.

For help, questions, contribution discussions and release announcements, please join us on Slack.

Architecture

At a high level the architecture of ModelDB in a Kubernetes cluster or a Docker application looks as below:

ModelDB Client available in Python and Scala which can instantiated in the user's model building code and exposes functions to store information to ModelDB.
ModelDB Frontend developed in JavaScript and typescript is the visual reporting module of ModelDB. It also acts as an entry point for the ModelDB cluster.
- It receives the request from client (1) and the browser and route them to the appropriate container.
- The gRPC calls (2) for creating, reading,updating or deleting Projects, Experiments, ExperimentRuns, Dataset, DatasetVersions or their metadata are routed to ModelDB Proxy.
- The HTTP calls (3) for storing and retrieving binary artifacts are forwarded directly to backend.
ModelDB Backend Proxy developed in golang is a light weight gRPC to Http convertor.
- It receives the gRPC request from the front end (2) and sends them to backend (4). In the other direction it converts the response from backend and sends it to the frontend.
ModelDB Backend developed in java is module which stores, retrieves or deletes information as triggered by user via the client or the front end.
- It exposes gRPC endpoints (4) for most of the operations which is used by the proxy.
- It has http endpoints (3) for storing, retrieving and deleting artifacts used directly by the frontend.
Database ModelDB Backend stores (5) the information from the requests it receive into a Relational database.
- Out of the box ModelDB is configured and verified to work against PostgreSQL, but since it uses Hibernate as a ORM and liquibase for change management, it should be easy to configure ModelDB to run on another SQL Database supported by the the tools.

Volumes : The relational database and the artifact store in backend need volumes attached to enable persistent storage.

Repo Structure

Each module in the architecture diagram has a designated folder in this repository, and has their own README covering in depth documentation and contribution guidelines.

protos has the protobuf definitions of the objects and endpoint used across ModelDB. More details here.
backend has the source code and tests for ModelDB Backend. It also holds the proxy at backend/proxy. More details here.
client has the source code and tests for ModelDB client. More details here.
webapp has the source and tests for ModelDB frontend. More details here.

Other supporting material for deployment and documentation is at:

chart has the helm chart to deploy ModelDB onto your Kubernetes cluster. More details here.
doc-resources has images for documentation.

Contributions

As seen from the Architecture ModelDB provides a full stack solution to tracking, versioning and auditing machine learning models. We are open to contributions to any of the modules in form of Pull Requests.

The main skill sets for each module are as below:

backend: If you are interested in Java development or are interested in database design using technologies like Hibernate and Liquibase please take a look at backend README for setup and development instructions.
client: If you are interested in Python or Scala development or are interested in building examples notebooks on various ML frameworks logging data to Modeldb please take a look at client README.
frontend: If you are interested in Node,React or Redux based development please take a look at webapp README

Please reach out to us in slack for any assistance in getting started with the development set up or any other feedback.

License

ModelDB is licensed under Apache 2.0.

Thanks

Thanks to our many contributors and users.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of MLflow

Cons of MLflow

Code Comparison

Pros of DVC

Cons of DVC

Code Comparison

Summary

Pros of ClearML

Cons of ClearML

Code Comparison

Pros of Pachyderm

Cons of Pachyderm

Code Comparison

Pros of wandb

Cons of wandb

Code Comparison

Convert designs to code with AI

README

ModelDB: An open-source system for Machine Learning model versioning, metadata, and experiment management.

Quickstart Â· Workflow Â· Examples Â· Contribute Â· Support (Slack)

Whatâs In This Document

Up and Running in 5 minutes

Documentation

Community

Architecture

Repo Structure

Contributions

License

Thanks

Top Related Projects

Convert designs to code with AI

Whatâs In This Document