Convert Figma logo to code with AI

apache logodolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

12,651
4,570
12,651
324

Top Related Projects

15,740

DataX是阿里云DataWorks数据集成的开源版本。

36,173

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

23,783

Apache Flink

15,793

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

11,125

An orchestration platform for the development, production, and observation of data assets.

4,733

Apache NiFi

Quick Overview

Apache DolphinScheduler is an open-source distributed and extensible workflow scheduler platform. It provides a visual interface for creating, scheduling, and managing complex workflows across multiple data systems. DolphinScheduler is designed to handle big data processing and cloud-native workflows efficiently.

Pros

  • User-friendly web UI for workflow design and management
  • Supports multiple task types and integrations with various data processing systems
  • Highly scalable and distributed architecture
  • Strong fault-tolerance and high availability features

Cons

  • Steep learning curve for advanced features and configurations
  • Documentation can be inconsistent or outdated in some areas
  • Limited built-in support for certain cloud providers
  • Some users report occasional stability issues in complex environments

Getting Started

To get started with Apache DolphinScheduler:

  1. Ensure you have Java 8+ and MySQL 5.7+ installed.
  2. Download the latest release from the official website.
  3. Extract the package and configure conf/dolphinscheduler_env.sh.
  4. Initialize the database:
    sh script/create-dolphinscheduler.sh
    
  5. Start the services:
    sh bin/start-all.sh
    
  6. Access the web UI at http://localhost:12345/dolphinscheduler (default credentials: admin/dolphinscheduler123)

For a more detailed setup, refer to the official documentation.

Competitor Comparisons

15,740

DataX是阿里云DataWorks数据集成的开源版本。

Pros of DataX

  • Specialized in efficient data synchronization between various data sources
  • Supports a wide range of data sources, including relational databases, NoSQL databases, and file systems
  • Lightweight and easy to deploy, with minimal dependencies

Cons of DataX

  • Limited workflow management capabilities compared to DolphinScheduler
  • Lacks built-in scheduling and task dependency features
  • Primarily focused on data transfer, with fewer features for complex data processing pipelines

Code Comparison

DataX (JSON configuration):

{
    "job": {
        "content": [{
            "reader": {"name": "mysqlreader", "parameter": {...}},
            "writer": {"name": "hdfswriter", "parameter": {...}}
        }]
    }
}

DolphinScheduler (XML workflow definition):

<process name="example_workflow">
    <tasks>
        <task name="mysql_to_hdfs" type="SHELL">
            <command>datax job.json</command>
        </task>
    </tasks>
</process>

While DataX focuses on data synchronization configuration, DolphinScheduler provides a more comprehensive workflow definition structure, allowing for complex task dependencies and scheduling.

36,173

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

  • Larger community and ecosystem with more plugins and integrations
  • More mature project with extensive documentation and resources
  • Flexible and powerful scheduling capabilities

Cons of Airflow

  • Steeper learning curve, especially for complex workflows
  • Can be resource-intensive for large-scale deployments
  • Potential performance issues with very large DAGs

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

dag = DAG('example_dag', start_date=datetime(2023, 1, 1))

task = PythonOperator(
    task_id='example_task',
    python_callable=my_function,
    dag=dag
)

DolphinScheduler task definition:

TaskNode shellTask = new ShellTask();
shellTask.setName("Shell Task");
shellTask.setCommand("echo 'Hello, DolphinScheduler!'");
shellTask.setTimeoutFlag(true);
shellTask.setTimeout("30");

Both Airflow and DolphinScheduler are powerful workflow scheduling and management tools. Airflow offers a more extensive ecosystem and flexibility, while DolphinScheduler provides a user-friendly interface and simpler setup. The choice between them depends on specific project requirements and team expertise.

23,783

Apache Flink

Pros of Flink

  • More mature and widely adopted for large-scale data processing
  • Supports both batch and stream processing with a unified API
  • Offers advanced features like exactly-once processing semantics and stateful computations

Cons of Flink

  • Steeper learning curve due to its complexity and extensive feature set
  • Requires more resources and configuration for optimal performance
  • Less focused on workflow management compared to DolphinScheduler

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .sum(1);
counts.print();

DolphinScheduler (Java):

ProcessDefinition processDefinition = new ProcessDefinition();
processDefinition.setName("MyWorkflow");
processDefinition.setTenantId(1);
TaskDefinition task1 = new TaskDefinition();
task1.setName("Task1");
task1.setType("SHELL");

Both Apache Flink and DolphinScheduler are open-source projects, but they serve different purposes. Flink is primarily a distributed data processing engine, while DolphinScheduler focuses on workflow scheduling and management. Flink excels in real-time data processing and analytics, whereas DolphinScheduler is better suited for orchestrating complex workflows and task dependencies.

15,793

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

Pros of Prefect

  • More flexible and Pythonic workflow definition
  • Better support for dynamic and conditional workflows
  • Stronger focus on local development and testing

Cons of Prefect

  • Less mature ecosystem compared to DolphinScheduler
  • Fewer built-in integrations with big data tools
  • Steeper learning curve for non-Python users

Code Comparison

Prefect workflow definition:

@task
def process_data(data):
    return data * 2

@flow
def my_flow(x):
    result = process_data(x)
    return result

DolphinScheduler workflow definition (pseudo-code):

public class MyWorkflow extends AbstractTask {
    @Override
    public void handle() throws Exception {
        int result = processData(getInputData());
        setOutputData(result);
    }
}

The Prefect code showcases its Pythonic approach, allowing for more intuitive and flexible workflow definitions. DolphinScheduler, being Java-based, has a more structured and verbose syntax, which may be familiar to enterprise Java developers but less accessible to data scientists and Python-oriented teams.

Prefect's design emphasizes ease of use for Python developers, while DolphinScheduler provides a more comprehensive platform for large-scale, distributed workflow management. The choice between the two depends on the specific needs of the project, team expertise, and existing technology stack.

11,125

An orchestration platform for the development, production, and observation of data assets.

Pros of Dagster

  • More flexible and extensible architecture, allowing for easier integration with various data tools and platforms
  • Strong focus on data quality and testing, with built-in features for data validation and error handling
  • Better support for local development and debugging of data pipelines

Cons of Dagster

  • Steeper learning curve due to its more complex architecture and concepts
  • Less mature ecosystem compared to DolphinScheduler, with fewer out-of-the-box integrations

Code Comparison

DolphinScheduler (XML-based workflow definition):

<process name="example_workflow">
  <tasks>
    <task name="task1" taskType="SHELL">
      <shellParameters>
        <rawScript><![CDATA[echo "Hello, DolphinScheduler!"]]></rawScript>
      </shellParameters>
    </task>
  </tasks>
</process>

Dagster (Python-based workflow definition):

@solid
def hello_dagster():
    print("Hello, Dagster!")

@pipeline
def example_pipeline():
    hello_dagster()

The code comparison shows that Dagster uses a more Pythonic approach to defining workflows, which may be more intuitive for data engineers and scientists familiar with Python. DolphinScheduler, on the other hand, uses XML-based configurations, which can be more verbose but potentially easier to integrate with existing enterprise systems.

4,733

Apache NiFi

Pros of NiFi

  • More mature project with a larger community and extensive documentation
  • Supports a wider range of data sources and protocols out-of-the-box
  • Offers a user-friendly drag-and-drop interface for designing data flows

Cons of NiFi

  • Can be resource-intensive, especially for large-scale deployments
  • Steeper learning curve due to its extensive feature set
  • Less focused on workflow scheduling compared to DolphinScheduler

Code Comparison

NiFi uses a Java-based approach for custom processors:

public class MyProcessor extends AbstractProcessor {
    @Override
    public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
        // Custom logic here
    }
}

DolphinScheduler uses a Python-based approach for task definitions:

from pydolphinscheduler.tasks import Shell

shell_task = Shell(
    name="shell_task",
    command="echo hello dolphinscheduler",
    task_priority=2
)

Both projects are Apache Software Foundation projects focused on data processing and workflow management. NiFi is more oriented towards real-time data flow and processing, while DolphinScheduler is primarily designed for workflow scheduling and orchestration. The choice between them depends on specific use cases and requirements.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Dolphinscheduler

License codecov Quality Gate Status Twitter Follow Slack Status CN doc

About

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code. It is also provided powerful user interface, dedicated to solving complex task dependencies in the data pipeline and providing various types of jobs available out of the box

The key features for DolphinScheduler are as follows:

  • Easy to deploy, provide four ways to deploy which including Standalone, Cluster, Docker and Kubernetes.
  • Easy to use, workflow can be created and managed by four ways, which including Web UI, Python SDK and Open API
  • Highly reliable and high availability, decentralized architecture with multi-master and multi-worker, native supports horizontal scaling.
  • High performance, its performance is N times faster than other orchestration platform and it can support tens of millions of tasks per day
  • Cloud Native, DolphinScheduler supports orchestrating multi-cloud/data center workflow, and supports custom task type
  • Versioning both workflow and workflow instance(including tasks)
  • Various state control of workflow and task, support pause/stop/recover them in any time
  • Multi-tenancy support
  • Others like backfill support(Web UI native), permission control including project and data source

QuickStart

User Interface Screenshots

  • Homepage: Project and workflow overview, including the latest workflow instance and task instance status statistics. home

  • Workflow Definition: Create and manage workflow by drag and drop, easy to build and maintain complex workflow, support bulk of tasks out of box. workflow-definition

  • Workflow Tree View: Abstract tree structure could clearer understanding of the relationship between tasks workflow-tree

  • Data source: Manage support multiple external data sources, provide unified data access capabilities for such as MySQL, PostgreSQL, Hive, Trino, etc. data-source

  • Monitor: View the status of the master, worker and database in real time, including server resource usage and load, do quick health check without logging in to the server. monitor

Suggestions & Bug Reports

Follow this guide to report your suggestions or bugs.

Contributing

The community welcomes everyone to contribute, please refer to this page to find out more: How to contribute, find the good first issue in here if you are new to DolphinScheduler.

Community

Welcome to join the Apache DolphinScheduler community by:

Landscapes



cncf-landscape  cncf-logo

DolphinScheduler enriches the CNCF CLOUD NATIVE Landscape.