Convert Figma logo to code with AI

dagster-io logodagster

An orchestration platform for the development, production, and observation of data assets.

11,125
1,395
11,125
2,736

Top Related Projects

15,793

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

36,173

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Always know what to expect from your data.

1,775

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

9,823

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Quick Overview

Dagster is an open-source data orchestration platform for machine learning, analytics, and ETL. It provides a unified view of data pipelines and assets across the entire organization, allowing data practitioners to design, develop, and manage data workflows with ease.

Pros

  • Flexible and extensible architecture that supports various data processing frameworks
  • Strong emphasis on testing, observability, and maintainability of data pipelines
  • Powerful asset-based paradigm for modeling data dependencies and lineage
  • Intuitive UI for monitoring and debugging data workflows

Cons

  • Steeper learning curve compared to some simpler workflow management tools
  • Limited native integrations with certain cloud services and data platforms
  • Resource-intensive for smaller projects or organizations
  • Documentation can be overwhelming for beginners due to the platform's extensive features

Code Examples

  1. Defining a simple asset:
from dagster import asset

@asset
def my_data():
    return [1, 2, 3, 4, 5]
  1. Creating a job with multiple assets:
from dagster import asset, define_asset_job

@asset
def raw_data():
    return [1, 2, 3, 4, 5]

@asset
def processed_data(raw_data):
    return [x * 2 for x in raw_data]

job = define_asset_job("my_job", selection=[raw_data, processed_data])
  1. Configuring a schedule for a job:
from dagster import schedule

@schedule(cron_schedule="0 0 * * *", job=job)
def daily_job_schedule(context):
    return {}

Getting Started

To get started with Dagster, follow these steps:

  1. Install Dagster and its dependencies:
pip install dagster dagster-webserver
  1. Create a new Python file (e.g., hello_dagster.py) with a simple asset:
from dagster import asset

@asset
def hello_asset():
    return "Hello, Dagster!"
  1. Run the Dagster UI:
dagster dev -f hello_dagster.py
  1. Open your browser and navigate to http://localhost:3000 to see the Dagster UI and interact with your asset.

Competitor Comparisons

15,793

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

Pros of Prefect

  • More flexible and lightweight, allowing for easier integration with existing workflows
  • Offers a more intuitive API and user-friendly interface
  • Provides better support for dynamic workflows and real-time task execution

Cons of Prefect

  • Less robust built-in versioning and lineage tracking compared to Dagster
  • Fewer out-of-the-box integrations with data warehouses and analytics tools
  • Less emphasis on data quality and testing features

Code Comparison

Prefect example:

from prefect import task, Flow

@task
def say_hello(name):
    print(f"Hello, {name}!")

with Flow("My Flow") as flow:
    say_hello("World")

Dagster example:

from dagster import job, op

@op
def say_hello(name: str):
    print(f"Hello, {name}!")

@job
def my_job():
    say_hello("World")

Both Prefect and Dagster offer powerful workflow orchestration capabilities, but they differ in their approach and focus. Prefect emphasizes flexibility and ease of use, while Dagster provides more robust data engineering features and integrations. The choice between the two depends on specific project requirements and team preferences.

36,173

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

  • Mature ecosystem with extensive community support and a wide range of integrations
  • Rich UI for monitoring and managing workflows
  • Flexible scheduling options with cron-like syntax

Cons of Airflow

  • Steeper learning curve, especially for complex workflows
  • Less emphasis on data-aware pipelines and testing
  • Configuration can be verbose and repetitive

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def my_task():
    print("Hello, Airflow!")

dag = DAG('example_dag', start_date=datetime(2023, 1, 1))
task = PythonOperator(task_id='my_task', python_callable=my_task, dag=dag)

Dagster job definition:

from dagster import job, op

@op
def my_op():
    print("Hello, Dagster!")

@job
def example_job():
    my_op()

Dagster offers a more concise and type-safe approach to defining workflows, with built-in support for testing and data-aware pipelines. Airflow, on the other hand, provides a more traditional approach to workflow orchestration with a focus on scheduling and monitoring.

Always know what to expect from your data.

Pros of Great Expectations

  • Focused specifically on data quality and validation
  • Extensive library of built-in expectations for common data quality checks
  • Generates detailed data quality reports and documentation

Cons of Great Expectations

  • Limited to data validation and quality checks
  • Requires integration with other tools for full data pipeline orchestration
  • Steeper learning curve for complex data quality scenarios

Code Comparison

Great Expectations:

import great_expectations as ge

df = ge.read_csv("my_data.csv")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
df.expect_column_values_to_not_be_null("name")

Dagster:

from dagster import asset

@asset
def process_data(context):
    df = pd.read_csv("my_data.csv")
    context.log.info(f"Processed {len(df)} rows")
    return df

Great Expectations focuses on data validation, while Dagster provides a broader framework for data pipeline orchestration. Great Expectations excels in detailed data quality checks, whereas Dagster offers more flexibility in defining and managing entire data workflows.

1,775

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

Pros of Meltano

  • Simpler setup and configuration, especially for ETL/ELT workflows
  • Strong focus on Singer taps and targets, providing a wide range of pre-built connectors
  • Built-in CLI for easy management and execution of data pipelines

Cons of Meltano

  • Less flexible for complex data orchestration scenarios
  • Smaller community and ecosystem compared to Dagster
  • Limited support for advanced features like data lineage and observability

Code Comparison

Meltano pipeline configuration:

extractors:
  - name: tap-github
    pip_url: git+https://github.com/meltano/tap-github.git
loaders:
  - name: target-postgres
    pip_url: git+https://github.com/meltano/target-postgres.git

Dagster pipeline configuration:

@pipeline
def my_pipeline():
    raw_data = extract_github_data()
    transformed_data = transform_data(raw_data)
    load_to_postgres(transformed_data)

Both Meltano and Dagster offer powerful data pipeline management capabilities, but they cater to different use cases and levels of complexity. Meltano excels in simplicity and quick setup for ETL/ELT workflows, while Dagster provides more flexibility and advanced features for complex data orchestration scenarios.

9,823

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

Pros of Kedro

  • Simpler learning curve and easier setup for data science projects
  • Strong focus on data engineering best practices and modular code structure
  • Built-in support for data catalogs and versioning

Cons of Kedro

  • Less robust scheduling and orchestration capabilities
  • More limited ecosystem and integrations compared to Dagster
  • Fewer advanced features for complex workflows and monitoring

Code Comparison

Kedro pipeline definition:

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(preprocess, "raw_data", "preprocessed_data"),
            node(train_model, "preprocessed_data", "model"),
        ]
    )

Dagster pipeline definition:

@pipeline
def my_pipeline():
    preprocessed_data = preprocess(raw_data)
    model = train_model(preprocessed_data)

Both Kedro and Dagster are open-source data orchestration frameworks, but they have different strengths. Kedro excels in providing a structured approach to data science projects, while Dagster offers more advanced features for complex data workflows and integrations with other tools. The choice between them depends on the specific needs of your project and team.

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Pros of Ploomber

  • Lightweight and easy to set up, with minimal configuration required
  • Supports multiple execution environments (local, cloud, Kubernetes) out of the box
  • Integrates well with existing Python workflows and Jupyter notebooks

Cons of Ploomber

  • Smaller community and ecosystem compared to Dagster
  • Less extensive documentation and fewer learning resources available
  • Limited built-in integrations with external tools and services

Code Comparison

Ploomber pipeline definition:

from ploomber import DAG

dag = DAG()

dag.add_task('data_extraction', 'python', 'extract.py')
dag.add_task('data_processing', 'python', 'process.py', upstream=['data_extraction'])
dag.add_task('model_training', 'python', 'train.py', upstream=['data_processing'])

Dagster pipeline definition:

from dagster import job, op

@op
def data_extraction():
    # Implementation

@op
def data_processing(data):
    # Implementation

@op
def model_training(processed_data):
    # Implementation

@job
def ml_pipeline():
    data = data_extraction()
    processed_data = data_processing(data)
    model_training(processed_data)

Both frameworks offer declarative pipeline definitions, but Ploomber's approach is more file-centric, while Dagster uses Python functions with decorators.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.

With Dagster, you declare—as Python functions—the data assets that you want to build. Dagster then helps you run your functions at the right time and keep your assets up-to-date.

Here is an example of a graph of three assets defined in Python:

from dagster import asset
from pandas import DataFrame, read_html, get_dummies
from sklearn.linear_model import LinearRegression

@asset
def country_populations() -> DataFrame:
    df = read_html("https://tinyurl.com/mry64ebh")[0]
    df.columns = ["country", "pop2022", "pop2023", "change", "continent", "region"]
    df["change"] = df["change"].str.rstrip("%").str.replace("−", "-").astype("float")
    return df

@asset
def continent_change_model(country_populations: DataFrame) -> LinearRegression:
    data = country_populations.dropna(subset=["change"])
    return LinearRegression().fit(get_dummies(data[["continent"]]), data["change"])

@asset
def continent_stats(country_populations: DataFrame, continent_change_model: LinearRegression) -> DataFrame:
    result = country_populations.groupby("continent").sum()
    result["pop_change_factor"] = continent_change_model.coef_
    return result

The graph loaded into Dagster's web UI:

An example asset graph as rendered in the Dagster UI

Dagster is built to be used at every stage of the data development lifecycle - local development, unit tests, integration tests, staging environments, all the way up to production.

Quick Start:

If you're new to Dagster, we recommend reading about its core concepts or learning with the hands-on tutorial.

Dagster is available on PyPI and officially supports Python 3.8 through Python 3.12.

pip install dagster dagster-webserver

This installs two packages:

  • dagster: The core programming model.
  • dagster-webserver: The server that hosts Dagster's web UI for developing and operating Dagster jobs and assets.

Running on a Mac with an Apple silicon chip? Check the install details here.

Documentation

You can find the full Dagster documentation here, including the 'getting started' guide.


Key Features:

image

Dagster as a productivity platform

Identify the key assets you need to create using a declarative approach, or you can focus on running basic tasks. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early.

Dagster as a robust orchestration engine

Put your pipelines into production with a robust multi-tenant, multi-tool engine that scales technically and organizationally.

Dagster as a unified control plane

Maintain control over your data as the complexity scales. Centralize your metadata in one tool with built-in observability, diagnostics, cataloging, and lineage. Spot any issues and identify performance improvement opportunities.


Master the Modern Data Stack with integrations

Dagster provides a growing library of integrations for today’s most popular data tools. Integrate with the tools you already use, and deploy to your infrastructure.


image

Community

Connect with thousands of other data practitioners building with Dagster. Share knowledge, get help, and contribute to the open-source project. To see featured material and upcoming events, check out our Dagster Community page.

Join our community here:

Contributing

For details on contributing or running the project for development, check out our contributing guide.

License

Dagster is Apache 2.0 licensed.