dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

13,599

4,825

13,599

232

View on GitHub

Top Related Projects

airflow

41,350

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

prefect

19,925

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

dagster

13,694

An orchestration platform for the development, production, and observation of data assets.

Quick Overview

Apache DolphinScheduler is an open-source distributed and extensible workflow scheduler platform. It provides a visual interface for creating, scheduling, and managing complex workflows across multiple data systems. DolphinScheduler is designed to handle big data processing and cloud-native workflows efficiently.

Pros

User-friendly web UI for workflow design and management
Supports multiple task types and integrations with various data processing systems
Highly scalable and distributed architecture
Strong fault-tolerance and high availability features

Cons

Steep learning curve for advanced features and configurations
Documentation can be inconsistent or outdated in some areas
Limited built-in support for certain cloud providers
Some users report occasional stability issues in complex environments

Getting Started

To get started with Apache DolphinScheduler:

Ensure you have Java 8+ and MySQL 5.7+ installed.
Download the latest release from the official website.
Extract the package and configure conf/dolphinscheduler_env.sh.
Initialize the database:
```
sh script/create-dolphinscheduler.sh
```
Start the services:
```
sh bin/start-all.sh
```
Access the web UI at http://localhost:12345/dolphinscheduler (default credentials: admin/dolphinscheduler123)

For a more detailed setup, refer to the official documentation.

Competitor Comparisons

DataX

16,701

DataX是阿里云DataWorks数据集成的开源版本。

Pros of DataX

Specialized in efficient data synchronization between various data sources
Supports a wide range of data sources, including relational databases, NoSQL databases, and file systems
Lightweight and easy to deploy, with minimal dependencies

Cons of DataX

Limited workflow management capabilities compared to DolphinScheduler
Lacks built-in scheduling and task dependency features
Primarily focused on data transfer, with fewer features for complex data processing pipelines

Code Comparison

DataX (JSON configuration):

{
    "job": {
        "content": [{
            "reader": {"name": "mysqlreader", "parameter": {...}},
            "writer": {"name": "hdfswriter", "parameter": {...}}
        }]
    }
}

DolphinScheduler (XML workflow definition):

<process name="example_workflow">
    <tasks>
        <task name="mysql_to_hdfs" type="SHELL">
            <command>datax job.json</command>
        </task>
    </tasks>
</process>

While DataX focuses on data synchronization configuration, DolphinScheduler provides a more comprehensive workflow definition structure, allowing for complex task dependencies and scheduling.

airflow

41,350

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

Larger community and ecosystem with more plugins and integrations
More mature project with extensive documentation and resources
Flexible and powerful scheduling capabilities

Cons of Airflow

Steeper learning curve, especially for complex workflows
Can be resource-intensive for large-scale deployments
Potential performance issues with very large DAGs

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

dag = DAG('example_dag', start_date=datetime(2023, 1, 1))

task = PythonOperator(
    task_id='example_task',
    python_callable=my_function,
    dag=dag
)

DolphinScheduler task definition:

TaskNode shellTask = new ShellTask();
shellTask.setName("Shell Task");
shellTask.setCommand("echo 'Hello, DolphinScheduler!'");
shellTask.setTimeoutFlag(true);
shellTask.setTimeout("30");

Both Airflow and DolphinScheduler are powerful workflow scheduling and management tools. Airflow offers a more extensive ecosystem and flexibility, while DolphinScheduler provides a user-friendly interface and simpler setup. The choice between them depends on specific project requirements and team expertise.

flink

25,110

Apache Flink

Pros of Flink

More mature and widely adopted for large-scale data processing
Supports both batch and stream processing with a unified API
Offers advanced features like exactly-once processing semantics and stateful computations

Cons of Flink

Steeper learning curve due to its complexity and extensive feature set
Requires more resources and configuration for optimal performance
Less focused on workflow management compared to DolphinScheduler

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .sum(1);
counts.print();

DolphinScheduler (Java):

ProcessDefinition processDefinition = new ProcessDefinition();
processDefinition.setName("MyWorkflow");
processDefinition.setTenantId(1);
TaskDefinition task1 = new TaskDefinition();
task1.setName("Task1");
task1.setType("SHELL");

Both Apache Flink and DolphinScheduler are open-source projects, but they serve different purposes. Flink is primarily a distributed data processing engine, while DolphinScheduler focuses on workflow scheduling and management. Flink excels in real-time data processing and analytics, whereas DolphinScheduler is better suited for orchestrating complex workflows and task dependencies.

prefect

19,925

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

Pros of Prefect

More flexible and Pythonic workflow definition
Better support for dynamic and conditional workflows
Stronger focus on local development and testing

Cons of Prefect

Less mature ecosystem compared to DolphinScheduler
Fewer built-in integrations with big data tools
Steeper learning curve for non-Python users

Code Comparison

Prefect workflow definition:

@task
def process_data(data):
    return data * 2

@flow
def my_flow(x):
    result = process_data(x)
    return result

DolphinScheduler workflow definition (pseudo-code):

public class MyWorkflow extends AbstractTask {
    @Override
    public void handle() throws Exception {
        int result = processData(getInputData());
        setOutputData(result);
    }
}

The Prefect code showcases its Pythonic approach, allowing for more intuitive and flexible workflow definitions. DolphinScheduler, being Java-based, has a more structured and verbose syntax, which may be familiar to enterprise Java developers but less accessible to data scientists and Python-oriented teams.

Prefect's design emphasizes ease of use for Python developers, while DolphinScheduler provides a more comprehensive platform for large-scale, distributed workflow management. The choice between the two depends on the specific needs of the project, team expertise, and existing technology stack.

dagster

13,694

An orchestration platform for the development, production, and observation of data assets.

Pros of Dagster

More flexible and extensible architecture, allowing for easier integration with various data tools and platforms
Strong focus on data quality and testing, with built-in features for data validation and error handling
Better support for local development and debugging of data pipelines

Cons of Dagster

Steeper learning curve due to its more complex architecture and concepts
Less mature ecosystem compared to DolphinScheduler, with fewer out-of-the-box integrations

Code Comparison

DolphinScheduler (XML-based workflow definition):

<process name="example_workflow">
  <tasks>
    <task name="task1" taskType="SHELL">
      <shellParameters>
        <rawScript><![CDATA[echo "Hello, DolphinScheduler!"]]></rawScript>
      </shellParameters>
    </task>
  </tasks>
</process>

Dagster (Python-based workflow definition):

@solid
def hello_dagster():
    print("Hello, Dagster!")

@pipeline
def example_pipeline():
    hello_dagster()

The code comparison shows that Dagster uses a more Pythonic approach to defining workflows, which may be more intuitive for data engineers and scientists familiar with Python. DolphinScheduler, on the other hand, uses XML-based configurations, which can be more verbose but potentially easier to integrate with existing enterprise systems.

nifi

5,494

Apache NiFi

Pros of NiFi

More mature project with a larger community and extensive documentation
Supports a wider range of data sources and protocols out-of-the-box
Offers a user-friendly drag-and-drop interface for designing data flows

Cons of NiFi

Can be resource-intensive, especially for large-scale deployments
Steeper learning curve due to its extensive feature set
Less focused on workflow scheduling compared to DolphinScheduler

Code Comparison

NiFi uses a Java-based approach for custom processors:

public class MyProcessor extends AbstractProcessor {
    @Override
    public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
        // Custom logic here
    }
}

DolphinScheduler uses a Python-based approach for task definitions:

from pydolphinscheduler.tasks import Shell

shell_task = Shell(
    name="shell_task",
    command="echo hello dolphinscheduler",
    task_priority=2
)

Both projects are Apache Software Foundation projects focused on data processing and workflow management. NiFi is more oriented towards real-time data flow and processing, while DolphinScheduler is primarily designed for workflow scheduling and orchestration. The choice between them depends on specific use cases and requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Dolphinscheduler

About

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code. It also provided powerful user interface, dedicated to solving complex task dependencies in the data pipeline and providing various types of jobs available out of the box

The key features for DolphinScheduler are as follows:

Easy to deploy, provide four ways to deploy which includes Standalone, Cluster, Docker and Kubernetes.
Easy to use, workflow can be created and managed in four ways, including Web UI, Python SDK and Open API
Highly reliable and high availability, decentralized architecture with multi-master and multi-worker, native supports horizontal scaling.
High performance, its performance is N times faster than other orchestration platform and it can support tens of millions of tasks per day
Cloud Native, DolphinScheduler supports orchestrating multi-cloud/data center workflow, and supports custom task type
Versioning both workflow and workflow instance(including tasks)
Various state control of workflow and task, support pause/stop/recover them in any time
Multi-tenancy support
Others like backfill support(Web UI native), permission control including project and data source

QuickStart

For quick experience
- Want to start with standalone
- Want to start with Docker
For Kubernetes
- Start with Kubernetes
For Terraform
- Start with Terraform

User Interface Screenshots

Homepage: Project and workflow overview, including the latest workflow instance and task instance status statistics.
Workflow Definition: Create and manage workflow by drag and drop, easy to build and maintain complex workflow, support bulk of tasks out of box.
Workflow Tree View: Abstract tree structure could clearer understanding of the relationship between tasks
Data source: Manage support multiple external data sources, provide unified data access capabilities for such as MySQL, PostgreSQL, Hive, Trino, etc.
Monitor: View the status of the master, worker and database in real time, including server resource usage and load, do quick health check without logging in to the server.

Suggestions & Bug Reports

Follow this guide to report your suggestions or bugs.

Contributing

The community welcomes everyone to contribute, please refer to this page to find out more: How to contribute, find the good first issue in here if you are new to DolphinScheduler.

Community

Welcome to join the Apache DolphinScheduler community by:

Join the DolphinScheduler Slack to keep in touch with the community
Follow the DolphinScheduler Twitter and get the latest news
Subscribe DolphinScheduler mail list, users@dolphinscheduler.apache.org for user and dev@dolphinscheduler.apache.org for developer

Landscapes

cncf-landscape cncf-logo

DolphinScheduler enriches the CNCF CLOUD NATIVE Landscape.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot