pipelines

Machine Learning Pipelines for Kubeflow

3,867

1,748

3,867

303

View on GitHub

Top Related Projects

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

prefect

19,133

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

dagster

13,049

An orchestration platform for the development, production, and observation of data assets.

metaflow

8,753

Build, Manage and Deploy AI/ML Systems

mlflow

20,329

Open source platform for the machine learning lifecycle

Quick Overview

Kubeflow Pipelines is an open-source platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. It provides a user interface for managing and tracking experiments, jobs, and runs, making it easier to compose, deploy, and manage complex ML pipelines.

Pros

Seamless integration with Kubernetes for scalable and portable ML workflows
Supports end-to-end orchestration of ML pipelines, from data preparation to model deployment
Provides a user-friendly interface for visualizing and managing pipeline runs
Enables easy sharing and reuse of components and pipelines across teams and projects

Cons

Steep learning curve for users unfamiliar with Kubernetes and container technologies
Complex setup and maintenance, especially for on-premises deployments
Limited support for certain ML frameworks and libraries compared to some other platforms
Resource-intensive, which may lead to higher costs for small-scale projects

Code Examples

Defining a simple pipeline component:

from kfp.dsl import component

@component
def add_numbers(a: int, b: int) -> int:
    return a + b

Creating a pipeline using components:

from kfp.dsl import pipeline

@pipeline(name="Simple Addition Pipeline")
def addition_pipeline(a: int, b: int):
    add_op = add_numbers(a, b)
    print_op = print_result(add_op.output)

Compiling and running a pipeline:

from kfp import compiler

compiler.Compiler().compile(addition_pipeline, "addition_pipeline.yaml")

client = kfp.Client()
client.create_run_from_pipeline_func(addition_pipeline, arguments={"a": 5, "b": 7})

Getting Started

To get started with Kubeflow Pipelines:

Install the Kubeflow Pipelines SDK:
```
pip install kfp
```

Set up a Kubernetes cluster and install Kubeflow Pipelines:

kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=1.8.5"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=1.8.5"

Port-forward the Kubeflow Pipelines UI:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Access the Kubeflow Pipelines UI at http://localhost:8080

Competitor Comparisons

argo-workflows

15,564

Workflow Engine for Kubernetes

Pros of Argo Workflows

Simpler and more lightweight, focusing solely on workflow orchestration
More flexible and customizable, allowing for complex workflow patterns
Better support for GitOps practices and CI/CD integration

Cons of Argo Workflows

Less integrated with other ML-specific tools and frameworks
Requires more manual setup and configuration for ML-specific tasks
Steeper learning curve for data scientists without DevOps experience

Code Comparison

Argo Workflows:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

Kubeflow Pipelines:

import kfp
from kfp import dsl

@dsl.pipeline(
    name='Hello World Pipeline',
    description='A simple pipeline that prints "Hello, World!"'
)
def hello_world_pipeline():
    hello_op = dsl.ContainerOp(
        name='hello',
        image='library/bash:4.4.23',
        command=['echo', 'Hello, World!']
    )

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(hello_world_pipeline, 'hello_world_pipeline.yaml')

Both Argo Workflows and Kubeflow Pipelines are powerful tools for orchestrating workflows on Kubernetes. Argo Workflows offers more flexibility and is better suited for general-purpose workflow orchestration, while Kubeflow Pipelines is more tailored for machine learning workflows with integrated ML-specific features.

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

More mature and widely adopted in the industry
Extensive ecosystem with a large number of integrations and plugins
Flexible scheduling capabilities with cron-like expressions

Cons of Airflow

Steeper learning curve, especially for complex workflows
Less native support for machine learning and data science workflows
Can be resource-intensive for large-scale deployments

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def my_function():
    print("Hello from Airflow!")

dag = DAG('example_dag', start_date=datetime(2023, 1, 1), schedule_interval='@daily')
task = PythonOperator(task_id='example_task', python_callable=my_function, dag=dag)

Kubeflow Pipelines component definition:

from kfp import dsl

@dsl.component
def my_component():
    print("Hello from Kubeflow Pipelines!")

@dsl.pipeline
def my_pipeline():
    my_component()

Both Airflow and Pipelines offer workflow orchestration capabilities, but Pipelines is more focused on machine learning workflows and Kubernetes integration. Airflow provides a more general-purpose solution for data pipeline orchestration across various environments.

prefect

19,133

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

Pros of Prefect

More lightweight and flexible, easier to set up and use for smaller projects
Better support for local development and testing
More intuitive Python-based workflow definition

Cons of Prefect

Less integrated with Kubernetes and cloud-native ecosystems
Fewer built-in components for machine learning workflows
Smaller community and ecosystem compared to Kubeflow Pipelines

Code Comparison

Prefect workflow definition:

@task
def process_data(data):
    return data.upper()

@flow
def my_flow(input_data):
    result = process_data(input_data)
    return result

Kubeflow Pipelines workflow definition:

@dsl.pipeline(
    name='My pipeline',
    description='A simple pipeline'
)
def my_pipeline(input_data: str):
    process_op = dsl.ContainerOp(
        name='process-data',
        image='my-image:latest',
        command=['python', 'process.py'],
        arguments=[input_data]
    )

Both Prefect and Kubeflow Pipelines are powerful workflow orchestration tools, but they cater to different use cases. Prefect is more suitable for general-purpose data workflows and easier to get started with, while Kubeflow Pipelines is better integrated with Kubernetes and machine learning ecosystems, offering more robust features for large-scale, production ML pipelines.

dagster

13,049

An orchestration platform for the development, production, and observation of data assets.

Pros of Dagster

More flexible and lightweight, suitable for various environments (local, cloud, etc.)
Better support for testing and local development
Stronger focus on data quality and observability

Cons of Dagster

Less integrated with Kubernetes and cloud-native ecosystems
Smaller community and ecosystem compared to Kubeflow Pipelines
Steeper learning curve for complex workflows

Code Comparison

Dagster:

@solid
def process_data(context, data):
    return data.upper()

@pipeline
def my_pipeline():
    process_data()

Kubeflow Pipelines:

def process_data_op():
    return dsl.ContainerOp(
        name='Process Data',
        image='my-image:latest',
        command=['python', 'process.py']
    )

@dsl.pipeline(name='My Pipeline')
def my_pipeline():
    process_data_op()

Both Dagster and Kubeflow Pipelines are powerful tools for building data pipelines, but they have different strengths and use cases. Dagster is more flexible and focuses on data quality, while Kubeflow Pipelines is better integrated with Kubernetes and cloud-native environments. The choice between them depends on your specific requirements and infrastructure preferences.

metaflow

8,753

Build, Manage and Deploy AI/ML Systems

Pros of Metaflow

Simpler setup and easier to get started
More flexible and language-agnostic (supports Python, R, and more)
Better suited for data scientists with less DevOps experience

Cons of Metaflow

Less comprehensive ecosystem and integrations
Not as scalable for large, complex workflows
Limited built-in support for distributed training

Code Comparison

Metaflow:

from metaflow import FlowSpec, step

class MyFlow(FlowSpec):
    @step
    def start(self):
        self.data = 'Hello, World!'
        self.next(self.end)

    @step
    def end(self):
        print(self.data)

Kubeflow Pipelines:

import kfp
from kfp import dsl

@dsl.pipeline(name='My Pipeline')
def my_pipeline():
    op1 = dsl.ContainerOp(
        name='Print Data',
        image='python:3.7',
        command=['python', '-c'],
        arguments=['print("Hello, World!")']
    )

kfp.compiler.Compiler().compile(my_pipeline, 'pipeline.yaml')

Both Metaflow and Kubeflow Pipelines are powerful tools for building and managing machine learning workflows. Metaflow offers a more user-friendly approach, making it easier for data scientists to get started quickly. Kubeflow Pipelines, on the other hand, provides a more comprehensive ecosystem and better scalability for complex, production-grade workflows.

mlflow

20,329

Open source platform for the machine learning lifecycle

Pros of MLflow

Lightweight and easy to set up, with minimal dependencies
Language-agnostic, supporting Python, R, Java, and more
Flexible deployment options (local, cloud, or on-premise)

Cons of MLflow

Less comprehensive end-to-end ML workflow management
Limited native support for distributed training and hyperparameter tuning
Fewer built-in integrations with cloud services and ML frameworks

Code Comparison

MLflow:

import mlflow

mlflow.start_run()
mlflow.log_param("param1", 5)
mlflow.log_metric("accuracy", 0.85)
mlflow.end_run()

Kubeflow Pipelines:

from kfp import dsl

@dsl.pipeline(name='My pipeline')
def my_pipeline():
    task1 = dsl.ContainerOp(name='Task 1', image='image1')
    task2 = dsl.ContainerOp(name='Task 2', image='image2')
    task2.after(task1)

MLflow focuses on experiment tracking and model management, while Kubeflow Pipelines emphasizes defining and orchestrating complex ML workflows. MLflow's code is simpler for logging experiments, while Kubeflow Pipelines requires more setup but offers greater control over pipeline structure and execution.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Kubeflow Pipelines

Overview of the Kubeflow pipelines service

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

Kubeflow pipelines are reusable end-to-end ML workflows built using the Kubeflow Pipelines SDK.

The Kubeflow pipelines service has the following goals:

End to end orchestration: enabling and simplifying the orchestration of end to end machine learning pipelines
Easy experimentation: making it easy for you to try numerous ideas and techniques, and manage your various trials/experiments.
Easy re-use: enabling you to re-use components and pipelines to quickly cobble together end to end solutions, without having to re-build each time.

Installation

Kubeflow Pipelines can be installed as part of the Kubeflow Platform. Alternatively you can deploy Kubeflow Pipelines as a standalone service.
The Docker container runtime has been deprecated on Kubernetes 1.20+. Kubeflow Pipelines has switched to use Emissary Executor by default from Kubeflow Pipelines 1.8. Emissary executor is Container runtime agnostic, meaning you are able to run Kubeflow Pipelines on Kubernetes cluster with any Container runtimes.

Documentation

Get started with your first pipeline and read further information in the Kubeflow Pipelines overview.

See the various ways you can use the Kubeflow Pipelines SDK.

See the Kubeflow Pipelines API doc for API specification.

Consult the Python SDK reference docs when writing pipelines using the Python SDK.

Contributing to Kubeflow Pipelines

Before you start contributing to Kubeflow Pipelines, read the guidelines in How to Contribute. To learn how to build and deploy Kubeflow Pipelines from source code, read the developer guide.

Kubeflow Pipelines Community

Community Meeting

The Kubeflow Pipelines Community Meeting occurs every other Wed 10-11AM (PST).

Calendar Invite

Direct Meeting Link

Meeting notes

Slack

We also have a slack channel (#kubeflow-pipelines) on the Cloud Native Computing Foundation Slack workspace. You can find more details at https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels

Architecture

Details about the KFP Architecture can be found at Architecture.md

Blog posts

Getting started with Kubeflow Pipelines (By Amy Unruh)
How to create and deploy a Kubeflow Machine Learning Pipeline (By Lak Lakshmanan)
Tekton optimizations for Kubeflow Pipelines 2.0 (By Tommy Li)

Acknowledgments

Kubeflow pipelines uses Argo Workflows by default under the hood to orchestrate Kubernetes resources. The Argo community has been very supportive and we are very grateful. Additionally there is Tekton backend available as well. To access it, please refer to Kubeflow Pipelines with Tekton repository.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot