deepops

Tools for building GPU clusters

1,365

341

1,365

View on GitHub

Top Related Projects

gpu-operator

2,179

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes

kubernetes

115,996

Production-Grade Container Scheduling and Management

charts

15,465

⚠️(OBSOLETE) Curated applications for Kubernetes

kubespray

16,989

Deploy a Production Ready Kubernetes Cluster

openshift-ansible

2,186

Install and config an OpenShift 3.x cluster

microk8s

8,879

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.

Quick Overview

DeepOps is an open-source project by NVIDIA that provides a set of Ansible playbooks and scripts for deploying and managing GPU-accelerated infrastructure for deep learning and AI workloads. It simplifies the process of setting up Kubernetes clusters, Slurm clusters, and other tools essential for machine learning operations.

Pros

Streamlines the deployment of complex AI infrastructure
Supports both on-premises and cloud environments
Integrates well with NVIDIA GPU technologies
Provides flexibility in cluster management (Kubernetes or Slurm)

Cons

Steep learning curve for those unfamiliar with Ansible and infrastructure management
Limited documentation for advanced use cases
May require significant customization for specific enterprise needs
Dependency on specific hardware (NVIDIA GPUs) for optimal performance

Getting Started

To get started with DeepOps:

Clone the repository:

git clone https://github.com/NVIDIA/deepops.git

Set up the virtual environment:
```
cd deepops
./scripts/setup.sh
```

Configure your inventory file:

cp config/inventory.example config/inventory
vi config/inventory

Deploy Kubernetes or Slurm cluster:

# For Kubernetes
ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml

# For Slurm
ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml

Note: Ensure you have Ansible installed and have properly configured your target nodes before running the playbooks.

Competitor Comparisons

gpu-operator

2,179

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes

Pros of gpu-operator

Focused specifically on GPU management in Kubernetes
Simpler setup and configuration for GPU-enabled clusters
Automated driver and CUDA toolkit installation

Cons of gpu-operator

Limited scope compared to DeepOps' broader infrastructure management
Less flexibility for customizing deployment options
Requires Kubernetes, not suitable for bare-metal or non-containerized environments

Code Comparison

gpu-operator:

apiVersion: "nvidia.com/v1"
kind: "ClusterPolicy"
metadata:
  name: "cluster-policy"
spec:
  dcgmExporter:
    enabled: true

DeepOps:

- hosts: all
  become: true
  roles:
    - nvidia.nvidia_driver
    - nvidia.nvidia_docker
    - k8s-gpu-plugin

Summary

gpu-operator is a specialized tool for managing NVIDIA GPUs in Kubernetes environments, offering simplified setup and automated driver management. DeepOps, on the other hand, provides a more comprehensive solution for deploying and managing GPU-accelerated infrastructure across various environments, including bare-metal and cloud platforms. While gpu-operator excels in Kubernetes-specific GPU management, DeepOps offers greater flexibility and broader infrastructure support.

kubernetes

115,996

Production-Grade Container Scheduling and Management

Pros of kubernetes

Widely adopted industry standard for container orchestration
Extensive ecosystem with numerous tools and integrations
Highly scalable and flexible for various deployment scenarios

Cons of kubernetes

Steeper learning curve and more complex setup
Requires more resources and overhead for small-scale deployments
Less focused on GPU and HPC workloads compared to DeepOps

Code comparison

kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx

DeepOps:

- hosts: all
  become: true
  tasks:
    - name: Install NVIDIA GPU driver
      include_role:
        name: nvidia.nvidia_driver

The kubernetes example shows a basic Deployment configuration, while the DeepOps example demonstrates an Ansible playbook for installing NVIDIA GPU drivers. This highlights the difference in focus between the two projects, with kubernetes being more general-purpose and DeepOps tailored for GPU-accelerated workloads.

charts

15,465

⚠️(OBSOLETE) Curated applications for Kubernetes

Pros of Charts

Broader scope with charts for various applications and services
Larger community and more frequent updates
More flexible and adaptable for different Kubernetes environments

Cons of Charts

Less focus on AI/ML and HPC workloads
May require more configuration for specialized deployments
Not optimized for NVIDIA hardware out-of-the-box

Code Comparison

Charts example (Helm chart structure):

mychart/
  Chart.yaml
  values.yaml
  templates/
    deployment.yaml
    service.yaml

DeepOps example (Ansible playbook structure):

playbooks/
  nvidia-docker.yml
  k8s-cluster.yml
  slurm-cluster.yml

Key Differences

Charts focuses on Kubernetes package management
DeepOps emphasizes AI/ML infrastructure deployment
Charts uses Helm for templating and deployment
DeepOps utilizes Ansible for configuration management

Use Cases

Charts:

General-purpose Kubernetes applications
Cloud-native microservices

DeepOps:

AI/ML and HPC cluster deployments
NVIDIA GPU-accelerated workloads

Community and Support

Charts:

Large, diverse community
Regular contributions from various organizations

DeepOps:

Focused NVIDIA support
Specialized for GPU-accelerated computing

kubespray

16,989

Deploy a Production Ready Kubernetes Cluster

Pros of kubespray

More flexible and adaptable for various Kubernetes deployments
Supports a wider range of operating systems and cloud providers
Larger community and more frequent updates

Cons of kubespray

Steeper learning curve for beginners
Requires more manual configuration and customization
Less focus on GPU and HPC-specific optimizations

Code Comparison

kubespray:

all:
  vars:
    ansible_user: ubuntu
    ansible_become: true
    kubeadm_enabled: true
    kube_network_plugin: calico

deepops:

all:
  vars:
    ansible_user: ubuntu
    ansible_become: true
    slurm_enabled: true
    k8s_gpu_plugin: nvidia

Summary

kubespray is a more general-purpose Kubernetes deployment tool, offering greater flexibility and broader support for various environments. It's ideal for users who need a customizable solution and have experience with Kubernetes.

deepops, on the other hand, is tailored for GPU and HPC workloads, providing out-of-the-box optimizations for NVIDIA hardware and integration with tools like Slurm. It's more suitable for users focusing on GPU-accelerated computing and scientific applications.

Choose kubespray for versatility and community support, or deepops for a streamlined GPU-centric deployment experience.

openshift-ansible

2,186

Install and config an OpenShift 3.x cluster

Pros of OpenShift-Ansible

Specifically designed for OpenShift, providing a more tailored deployment experience
Extensive documentation and community support for enterprise-grade Kubernetes deployments
Integrates well with Red Hat's ecosystem of tools and services

Cons of OpenShift-Ansible

Less flexible for general-purpose cluster deployments compared to DeepOps
May have a steeper learning curve for users not familiar with OpenShift
Limited focus on GPU and AI/ML workloads

Code Comparison

OpenShift-Ansible:

- name: Install OpenShift
  hosts: masters
  tasks:
    - name: Run OpenShift installer
      command: openshift-install create cluster

DeepOps:

- name: Deploy Kubernetes
  hosts: kube-master
  tasks:
    - name: Run Kubespray playbook
      include_role:
        name: kubespray-runner

Both repositories use Ansible for deployment, but DeepOps focuses on a broader range of deployment options, including GPU-accelerated clusters, while OpenShift-Ansible is tailored specifically for OpenShift deployments. DeepOps provides more flexibility for various infrastructure setups, while OpenShift-Ansible offers a more streamlined experience for OpenShift-specific deployments.

microk8s

8,879

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.

Pros of MicroK8s

Lightweight and easy to install, ideal for edge computing and IoT devices
Supports single-node and multi-node clusters with minimal configuration
Includes add-ons for common services like DNS, dashboard, and storage

Cons of MicroK8s

Limited GPU support compared to DeepOps' extensive NVIDIA GPU integration
Less focus on high-performance computing and AI/ML workloads
Smaller ecosystem of pre-configured tools for data science and HPC

Code Comparison

MicroK8s installation:

sudo snap install microk8s --classic
microk8s status --wait-ready
microk8s kubectl get nodes

DeepOps GPU setup:

./scripts/k8s/deploy_gpu_operator.sh
kubectl get pods -n gpu-operator-resources
kubectl describe node | grep -i nvidia

Both projects aim to simplify Kubernetes deployment, but DeepOps focuses on GPU-accelerated clusters for AI and HPC workloads, while MicroK8s targets lightweight, general-purpose Kubernetes deployments. DeepOps provides more extensive tooling for NVIDIA GPU management and optimization, whereas MicroK8s offers a more streamlined experience for quick Kubernetes setup across various environments.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DeepOps

Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GPUs.

DeepOps

Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as NVIDIA DGX Systems). DeepOps may also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
An existing cluster running Kubernetes where DeepOps scripts are used to deploy KubeFlow and connect NFS storage
An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm or Kubernetes
A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

Latest release: DeepOps 23.08 Release

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally functional but may change significantly between releases.

Deployment Requirements

Provisioning System

The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include:

NVIDIA DGX OS 4, 5
Ubuntu 18.04 LTS, 20.04, 22.04 LTS
CentOS 7, 8

Cluster System

The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required.

NVIDIA DGX OS 4, 5
Ubuntu 18.04 LTS, 20.04, 22.04 LTS
CentOS 7, 8

You may also install a supported operating system on all servers via a 3rd-party solution (i.e. MAAS, Foreman) or utilize the provided OS install container.

Kubernetes

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. The instantiation of a Kubernetes cluster is done by Kubespray. Kubespray runs on bare metal and most clouds, using Ansible as its substrate for provisioning and orchestration. For people with familiarity with Ansible, existing Ansible deployments or the desire to run a Kubernetes cluster across multiple platforms, Kubespray is a good choice. Kubespray does generic configuration management tasks from the "OS operators" ansible world, plus some initial K8s clustering (with networking plugins included) and control plane bootstrapping. DeepOps provides additional playbooks for orchestration and optimization of GPU environments.

Consult the DeepOps Kubernetes Deployment Guide for instructions on building a GPU-enabled Kubernetes cluster using DeepOps.

For more information on Kubernetes in general, refer to the official Kubernetes docs.

Slurm

Slurm is an open-source cluster resource management and job scheduling system that strives to be simple, scalable, portable, fault-tolerant, and interconnect agnostic. Slurm currently has been tested only under Linux.

As a cluster resource manager, Slurm provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. Slurm cluster instantiation is achieved through SchedMD

Consult the DeepOps Slurm Deployment Guide for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the official Slurm docs.

Hybrid clusters

DeepOps does not test or support a configuration where both Kubernetes and Slurm are deployed on the same physical cluster.

NVIDIA Bright Cluster Manager is recommended as an enterprise solution which enables managing multiple workload managers within a single cluster, including Kubernetes, Slurm, Univa Grid Engine, and PBS Pro.

DeepOps does not test or support a configuration where nodes have a heterogenous OS running. Additional modifications are needed if you plan to use unsupported operating systems such as RHEL.

Virtual

To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs.

Consult the Virtual DeepOps Deployment Guide to build a GPU-enabled virtual cluster with DeepOps.

Updating DeepOps

To update from a previous version of DeepOps to a newer release, please consult the DeepOps Update Guide.

Copyright and License

This project is released under the BSD 3-clause license.

Issues

NVIDIA DGX customers should file an NVES ticket via NVIDIA Enterprise Services.

Otherwise, bugs and feature requests can be made by filing a GitHub Issue.

Contributing

To contribute, please issue a signed pull request against the master branch from a local fork. See the contribution document for more information.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Getting Started

Competitor Comparisons

Pros of gpu-operator

Cons of gpu-operator

Code Comparison

Summary

Pros of kubernetes

Cons of kubernetes

Code comparison

Pros of Charts

Cons of Charts

Code Comparison

Key Differences

Use Cases

Community and Support

Pros of kubespray

Cons of kubespray

Code Comparison

Summary

Pros of OpenShift-Ansible

Cons of OpenShift-Ansible

Code Comparison

Pros of MicroK8s

Cons of MicroK8s

Code Comparison

Convert designs to code with AI

README

DeepOps

Table of Contents

Overview

Deployment Requirements

Provisioning System

Cluster System

Kubernetes

Slurm

Hybrid clusters

Virtual

Updating DeepOps

Copyright and License

Issues

Contributing

Top Related Projects

Convert designs to code with AI