dolphinscheduler
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
Top Related Projects
DataX是阿里云DataWorks数据集成的开源版本。
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Flink
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
An orchestration platform for the development, production, and observation of data assets.
Apache NiFi
Quick Overview
Apache DolphinScheduler is an open-source distributed and extensible workflow scheduler platform. It provides a visual interface for creating, scheduling, and managing complex workflows across multiple data systems. DolphinScheduler is designed to handle big data processing and cloud-native workflows efficiently.
Pros
- User-friendly web UI for workflow design and management
- Supports multiple task types and integrations with various data processing systems
- Highly scalable and distributed architecture
- Strong fault-tolerance and high availability features
Cons
- Steep learning curve for advanced features and configurations
- Documentation can be inconsistent or outdated in some areas
- Limited built-in support for certain cloud providers
- Some users report occasional stability issues in complex environments
Getting Started
To get started with Apache DolphinScheduler:
- Ensure you have Java 8+ and MySQL 5.7+ installed.
- Download the latest release from the official website.
- Extract the package and configure
conf/dolphinscheduler_env.sh
. - Initialize the database:
sh script/create-dolphinscheduler.sh
- Start the services:
sh bin/start-all.sh
- Access the web UI at
http://localhost:12345/dolphinscheduler
(default credentials: admin/dolphinscheduler123)
For a more detailed setup, refer to the official documentation.
Competitor Comparisons
DataX是阿里云DataWorks数据集成的开源版本。
Pros of DataX
- Specialized in efficient data synchronization between various data sources
- Supports a wide range of data sources, including relational databases, NoSQL databases, and file systems
- Lightweight and easy to deploy, with minimal dependencies
Cons of DataX
- Limited workflow management capabilities compared to DolphinScheduler
- Lacks built-in scheduling and task dependency features
- Primarily focused on data transfer, with fewer features for complex data processing pipelines
Code Comparison
DataX (JSON configuration):
{
"job": {
"content": [{
"reader": {"name": "mysqlreader", "parameter": {...}},
"writer": {"name": "hdfswriter", "parameter": {...}}
}]
}
}
DolphinScheduler (XML workflow definition):
<process name="example_workflow">
<tasks>
<task name="mysql_to_hdfs" type="SHELL">
<command>datax job.json</command>
</task>
</tasks>
</process>
While DataX focuses on data synchronization configuration, DolphinScheduler provides a more comprehensive workflow definition structure, allowing for complex task dependencies and scheduling.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Pros of Airflow
- Larger community and ecosystem with more plugins and integrations
- More mature project with extensive documentation and resources
- Flexible and powerful scheduling capabilities
Cons of Airflow
- Steeper learning curve, especially for complex workflows
- Can be resource-intensive for large-scale deployments
- Potential performance issues with very large DAGs
Code Comparison
Airflow DAG definition:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
dag = DAG('example_dag', start_date=datetime(2023, 1, 1))
task = PythonOperator(
task_id='example_task',
python_callable=my_function,
dag=dag
)
DolphinScheduler task definition:
TaskNode shellTask = new ShellTask();
shellTask.setName("Shell Task");
shellTask.setCommand("echo 'Hello, DolphinScheduler!'");
shellTask.setTimeoutFlag(true);
shellTask.setTimeout("30");
Both Airflow and DolphinScheduler are powerful workflow scheduling and management tools. Airflow offers a more extensive ecosystem and flexibility, while DolphinScheduler provides a user-friendly interface and simpler setup. The choice between them depends on specific project requirements and team expertise.
Apache Flink
Pros of Flink
- More mature and widely adopted for large-scale data processing
- Supports both batch and stream processing with a unified API
- Offers advanced features like exactly-once processing semantics and stateful computations
Cons of Flink
- Steeper learning curve due to its complexity and extensive feature set
- Requires more resources and configuration for optimal performance
- Less focused on workflow management compared to DolphinScheduler
Code Comparison
Flink (Java):
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1);
counts.print();
DolphinScheduler (Java):
ProcessDefinition processDefinition = new ProcessDefinition();
processDefinition.setName("MyWorkflow");
processDefinition.setTenantId(1);
TaskDefinition task1 = new TaskDefinition();
task1.setName("Task1");
task1.setType("SHELL");
Both Apache Flink and DolphinScheduler are open-source projects, but they serve different purposes. Flink is primarily a distributed data processing engine, while DolphinScheduler focuses on workflow scheduling and management. Flink excels in real-time data processing and analytics, whereas DolphinScheduler is better suited for orchestrating complex workflows and task dependencies.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Pros of Prefect
- More flexible and Pythonic workflow definition
- Better support for dynamic and conditional workflows
- Stronger focus on local development and testing
Cons of Prefect
- Less mature ecosystem compared to DolphinScheduler
- Fewer built-in integrations with big data tools
- Steeper learning curve for non-Python users
Code Comparison
Prefect workflow definition:
@task
def process_data(data):
return data * 2
@flow
def my_flow(x):
result = process_data(x)
return result
DolphinScheduler workflow definition (pseudo-code):
public class MyWorkflow extends AbstractTask {
@Override
public void handle() throws Exception {
int result = processData(getInputData());
setOutputData(result);
}
}
The Prefect code showcases its Pythonic approach, allowing for more intuitive and flexible workflow definitions. DolphinScheduler, being Java-based, has a more structured and verbose syntax, which may be familiar to enterprise Java developers but less accessible to data scientists and Python-oriented teams.
Prefect's design emphasizes ease of use for Python developers, while DolphinScheduler provides a more comprehensive platform for large-scale, distributed workflow management. The choice between the two depends on the specific needs of the project, team expertise, and existing technology stack.
An orchestration platform for the development, production, and observation of data assets.
Pros of Dagster
- More flexible and extensible architecture, allowing for easier integration with various data tools and platforms
- Strong focus on data quality and testing, with built-in features for data validation and error handling
- Better support for local development and debugging of data pipelines
Cons of Dagster
- Steeper learning curve due to its more complex architecture and concepts
- Less mature ecosystem compared to DolphinScheduler, with fewer out-of-the-box integrations
Code Comparison
DolphinScheduler (XML-based workflow definition):
<process name="example_workflow">
<tasks>
<task name="task1" taskType="SHELL">
<shellParameters>
<rawScript><![CDATA[echo "Hello, DolphinScheduler!"]]></rawScript>
</shellParameters>
</task>
</tasks>
</process>
Dagster (Python-based workflow definition):
@solid
def hello_dagster():
print("Hello, Dagster!")
@pipeline
def example_pipeline():
hello_dagster()
The code comparison shows that Dagster uses a more Pythonic approach to defining workflows, which may be more intuitive for data engineers and scientists familiar with Python. DolphinScheduler, on the other hand, uses XML-based configurations, which can be more verbose but potentially easier to integrate with existing enterprise systems.
Apache NiFi
Pros of NiFi
- More mature project with a larger community and extensive documentation
- Supports a wider range of data sources and protocols out-of-the-box
- Offers a user-friendly drag-and-drop interface for designing data flows
Cons of NiFi
- Can be resource-intensive, especially for large-scale deployments
- Steeper learning curve due to its extensive feature set
- Less focused on workflow scheduling compared to DolphinScheduler
Code Comparison
NiFi uses a Java-based approach for custom processors:
public class MyProcessor extends AbstractProcessor {
@Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
// Custom logic here
}
}
DolphinScheduler uses a Python-based approach for task definitions:
from pydolphinscheduler.tasks import Shell
shell_task = Shell(
name="shell_task",
command="echo hello dolphinscheduler",
task_priority=2
)
Both projects are Apache Software Foundation projects focused on data processing and workflow management. NiFi is more oriented towards real-time data flow and processing, while DolphinScheduler is primarily designed for workflow scheduling and orchestration. The choice between them depends on specific use cases and requirements.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Apache Dolphinscheduler
About
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code. It is also provided powerful user interface, dedicated to solving complex task dependencies in the data pipeline and providing various types of jobs available out of the box
The key features for DolphinScheduler are as follows:
- Easy to deploy, provide four ways to deploy which including Standalone, Cluster, Docker and Kubernetes.
- Easy to use, workflow can be created and managed by four ways, which including Web UI, Python SDK and Open API
- Highly reliable and high availability, decentralized architecture with multi-master and multi-worker, native supports horizontal scaling.
- High performance, its performance is N times faster than other orchestration platform and it can support tens of millions of tasks per day
- Cloud Native, DolphinScheduler supports orchestrating multi-cloud/data center workflow, and supports custom task type
- Versioning both workflow and workflow instance(including tasks)
- Various state control of workflow and task, support pause/stop/recover them in any time
- Multi-tenancy support
- Others like backfill support(Web UI native), permission control including project and data source
QuickStart
- For quick experience
- Want to start with standalone
- Want to start with Docker
- For Kubernetes
- For Terraform
User Interface Screenshots
-
Homepage: Project and workflow overview, including the latest workflow instance and task instance status statistics.
-
Workflow Definition: Create and manage workflow by drag and drop, easy to build and maintain complex workflow, support bulk of tasks out of box.
-
Workflow Tree View: Abstract tree structure could clearer understanding of the relationship between tasks
-
Data source: Manage support multiple external data sources, provide unified data access capabilities for such as MySQL, PostgreSQL, Hive, Trino, etc.
-
Monitor: View the status of the master, worker and database in real time, including server resource usage and load, do quick health check without logging in to the server.
Suggestions & Bug Reports
Follow this guide to report your suggestions or bugs.
Contributing
The community welcomes everyone to contribute, please refer to this page to find out more: How to contribute, find the good first issue in here if you are new to DolphinScheduler.
Community
Welcome to join the Apache DolphinScheduler community by:
- Join the DolphinScheduler Slack to keep in touch with the community
- Follow the DolphinScheduler Twitter and get the latest news
- Subscribe DolphinScheduler mail list, users@dolphinscheduler.apache.org for user and dev@dolphinscheduler.apache.org for developer
Landscapes
DolphinScheduler enriches the CNCF CLOUD NATIVE Landscape.
Top Related Projects
DataX是阿里云DataWorks数据集成的开源版本。
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Flink
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
An orchestration platform for the development, production, and observation of data assets.
Apache NiFi
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot