Convert Figma logo to code with AI

alibaba logoDataX

DataX是阿里云DataWorks数据集成的开源版本。

15,861
5,407
15,861
1,269

Top Related Projects

4,830

Apache NiFi

36,684

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

23,929

Apache Flink

39,274

Apache Spark - A unified analytics engine for large-scale data processing

15,871

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Quick Overview

DataX is an open-source, high-performance data synchronization tool developed by Alibaba. It efficiently transfers data between various heterogeneous data sources, including relational databases, NoSQL databases, and big data platforms. DataX supports more than 20 different data sources and can be easily extended to support new ones.

Pros

  • High performance and scalability, capable of handling large-scale data synchronization tasks
  • Supports a wide range of data sources out of the box
  • Easily extensible architecture for adding new data source plugins
  • Provides a simple and user-friendly configuration interface

Cons

  • Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
  • Some reported issues with stability and error handling in certain scenarios
  • Limited community support compared to some other data integration tools
  • Requires Java environment, which may not be ideal for all use cases

Getting Started

  1. Clone the DataX repository:

    git clone https://github.com/alibaba/DataX.git
    
  2. Build DataX:

    cd DataX
    mvn -U clean package assembly:assembly -Dmaven.test.skip=true
    
  3. Create a JSON configuration file (e.g., job.json) defining your data synchronization task:

    {
      "job": {
        "content": [
          {
            "reader": {
              "name": "mysqlreader",
              "parameter": {
                "username": "root",
                "password": "root",
                "column": ["*"],
                "connection": [
                  {
                    "table": ["user"],
                    "jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/database"]
                  }
                ]
              }
            },
            "writer": {
              "name": "hdfswriter",
              "parameter": {
                "defaultFS": "hdfs://localhost:9000",
                "fileType": "text",
                "path": "/user/hive/warehouse/result",
                "fileName": "user",
                "column": [
                  {
                    "name": "col1",
                    "type": "STRING"
                  },
                  {
                    "name": "col2",
                    "type": "LONG"
                  }
                ],
                "writeMode": "append",
                "fieldDelimiter": ","
              }
            }
          }
        ],
        "setting": {
          "speed": {
            "channel": 3
          }
        }
      }
    }
    
  4. Run DataX with your configuration:

    python bin/datax.py job.json
    

Competitor Comparisons

4,830

Apache NiFi

Pros of NiFi

  • More comprehensive data flow management with a web-based UI for designing, controlling, and monitoring data flows
  • Supports a wider range of data sources and destinations out-of-the-box
  • Offers real-time data processing and transformation capabilities

Cons of NiFi

  • Steeper learning curve due to its more complex architecture and extensive feature set
  • Requires more system resources for deployment and operation
  • May be overkill for simpler data integration tasks

Code Comparison

DataX configuration (JSON):

{
  "job": {
    "content": [
      {
        "reader": { "name": "mysqlreader", "parameter": { ... } },
        "writer": { "name": "hdfswriter", "parameter": { ... } }
      }
    ]
  }
}

NiFi flow configuration (XML):

<processor>
  <id>abc123</id>
  <name>GetFile</name>
  <config>
    <property name="Input Directory">/path/to/input</property>
  </config>
</processor>

Both projects use configuration files to define data flows, but NiFi's approach is more visual and interactive through its web UI, while DataX relies on JSON configuration files for job definitions.

36,684

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

  • More comprehensive workflow management system with a rich ecosystem of plugins and integrations
  • Provides a web-based UI for monitoring and managing workflows
  • Supports complex dependencies and scheduling options

Cons of Airflow

  • Steeper learning curve due to its extensive features and configuration options
  • Requires more resources to set up and maintain compared to DataX
  • Can be overkill for simple data transfer tasks

Code Comparison

DataX (JSON configuration):

{
  "job": {
    "content": [
      {
        "reader": { "name": "mysqlreader", "parameter": { ... } },
        "writer": { "name": "hdfswriter", "parameter": { ... } }
      }
    ]
  }
}

Airflow (Python DAG):

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def transfer_data():
    # Data transfer logic here

dag = DAG('data_transfer', default_args=default_args, schedule_interval='@daily')
transfer_task = PythonOperator(task_id='transfer_data', python_callable=transfer_data, dag=dag)
23,929

Apache Flink

Pros of Flink

  • Powerful stream processing capabilities with support for both batch and real-time data processing
  • Extensive ecosystem with built-in libraries for complex event processing, machine learning, and graph processing
  • High scalability and fault tolerance with exactly-once processing semantics

Cons of Flink

  • Steeper learning curve due to its complex architecture and concepts
  • Higher resource requirements, especially for smaller-scale data processing tasks
  • Less focus on data integration across diverse sources compared to DataX

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(0)
    .sum(1);
counts.print();

DataX (JSON configuration):

{
    "job": {
        "content": [{
            "reader": {"name": "txtfilereader", "parameter": {"path": ["input.txt"]}},
            "writer": {"name": "streamwriter", "parameter": {}}
        }]
    }
}

The code snippets showcase Flink's programming model for stream processing, while DataX uses a configuration-based approach for data transfer between different sources and destinations.

39,274

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

  • Powerful distributed computing engine for large-scale data processing
  • Supports multiple programming languages (Scala, Java, Python, R)
  • Rich ecosystem with libraries for machine learning, graph processing, and streaming

Cons of Spark

  • Steeper learning curve and more complex setup compared to DataX
  • Higher resource requirements, especially for smaller datasets
  • Less focused on data integration and ETL tasks specifically

Code Comparison

DataX (Java):

public static void main(String[] args) throws Exception {
    Engine.entry(args);
}

Spark (Scala):

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Example").getOrCreate()
val df = spark.read.json("example.json")
df.show()

Key Differences

  • DataX is primarily designed for efficient data transfer between various data sources and targets
  • Spark is a more general-purpose distributed computing framework with broader applications
  • DataX has a simpler architecture focused on ETL tasks, while Spark offers a wider range of data processing capabilities
  • Spark provides more advanced analytics features, including machine learning and graph processing
  • DataX may be easier to set up and use for straightforward data integration tasks
15,871

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Pros of Airbyte

  • More extensive connector library with over 300+ pre-built connectors
  • Active open-source community with frequent updates and contributions
  • User-friendly web interface for configuration and monitoring

Cons of Airbyte

  • Potentially higher resource consumption due to containerized architecture
  • Steeper learning curve for custom connector development
  • Less mature compared to DataX, which has been in development longer

Code Comparison

DataX (Java):

public static void main(String[] args) throws Exception {
    Engine engine = new Engine();
    engine.start(args);
}

Airbyte (Python):

def run():
    source = SourcePostgres()
    destination = DestinationBigQuery()
    orchestrator = Orchestrator(source, destination)
    orchestrator.run()

Both projects aim to facilitate data integration, but their implementations differ. DataX uses a Java-based engine for data synchronization, while Airbyte employs a modular Python-based approach with containerized connectors. Airbyte's code structure emphasizes flexibility and ease of adding new connectors, whereas DataX focuses on performance and stability for large-scale data transfer scenarios.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Datax-logo

DataX

Leaderboard

DataX 是阿里云 DataWorks数据集成 的开源版本,在阿里巴巴集团内被广泛使用的离线数据同步工具/平台。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS, databend 等各种异构数据源之间高效的数据同步功能。

DataX 商业版本

阿里云DataWorks数据集成是DataX团队在阿里云上的商业化产品,致力于提供复杂网络环境下、丰富的异构数据源之间高速稳定的数据移动能力,以及繁杂业务背景下的数据同步解决方案。目前已经支持云上近3000家客户,单日同步数据超过3万亿条。DataWorks数据集成目前支持离线50+种数据源,可以进行整库迁移、批量上云、增量同步、分库分表等各类同步解决方案。2020年更新实时同步能力,支持10+种数据源的读写任意组合。提供MySQL,Oracle等多种数据源到阿里云MaxCompute,Hologres等大数据引擎的一键全增量同步解决方案。

商业版本参见: https://www.aliyun.com/product/bigdata/ide

Features

DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

DataX详细介绍

请参考:DataX-Introduction

Quick Start

Download DataX下载地址
请点击:Quick Start

Support Data Channels

DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南

类型数据源Reader(读)Writer(写)文档
RDBMS 关系型数据库MySQL√√读 、写
Oracle√√读 、写
OceanBase√√读 、写
SQLServer√√读 、写
PostgreSQL√√读 、写
DRDS√√读 、写
Kingbase√√读 、写
通用RDBMS(支持所有关系型数据库)√√读 、写
阿里云数仓数据存储ODPS√√读 、写
ADB√写
ADS√写
OSS√√读 、写
OCS√写
Hologres√写
AnalyticDB For PostgreSQL√写
阿里云中间件datahub√√读 、写
SLS√√读 、写
图数据库阿里云 GDB√√读 、写
Neo4j√写
NoSQL数据存储OTS√√读 、写
Hbase0.94√√读 、写
Hbase1.1√√读 、写
Phoenix4.x√√读 、写
Phoenix5.x√√读 、写
MongoDB√√读 、写
Cassandra√√读 、写
数仓数据存储StarRocks√√读 、写
ApacheDoris√写
ClickHouse√√读 、写
Databend√写
Hive√√读 、写
kudu√写
selectdb√写
无结构化数据存储TxtFile√√读 、写
FTP√√读 、写
HDFS√√读 、写
Elasticsearch√写
时间序列数据库OpenTSDB√读
TSDB√√读 、写
TDengine√√读 、写

阿里云DataWorks数据集成

目前DataX的已有能力已经全部融和进阿里云的数据集成,并且比DataX更加高效、安全,同时数据集成具备DataX不具备的其它高级特性和功能。可以理解为数据集成是DataX的全面升级的商业化用版本,为企业可以提供稳定、可靠、安全的数据传输服务。与DataX相比,数据集成主要有以下几大突出特点:

支持实时同步:

离线同步数据源种类大幅度扩充:

我要开发新的插件

请点击:DataX插件开发宝典

重要版本更新说明

DataX 后续计划月度迭代更新,也欢迎感兴趣的同学提交 Pull requests,月度更新内容如下。

  • [datax_v202309](https://github.com/alibaba/DataX/releases/tag/datax_v202309)

    • 支持Phoenix 同步数据添加 where条件
    • 支持华为 GuassDB读写插件
    • 修复ClickReader 插件运行报错 Can't find bundle for base name
    • 增加 DataX调试模块
    • 修复 orc空文件报错问题
    • 优化obwriter性能
    • txtfilewriter 增加导出为insert语句功能支持
    • HdfsReader/HdfsWriter 支持parquet读写能力
  • [datax_v202308](https://github.com/alibaba/DataX/releases/tag/datax_v202308)

    • OTS 插件更新
    • databend 插件更新
    • Oceanbase驱动修复
  • [datax_v202306](https://github.com/alibaba/DataX/releases/tag/datax_v202306)

    • 精简代码
    • 新增插件(neo4jwriter、clickhousewriter)
    • 优化插件、修复问题(oceanbase、hdfs、databend、txtfile)
  • [datax_v202303](https://github.com/alibaba/DataX/releases/tag/datax_v202303)

    • 精简代码
    • 新增插件(adbmysqlwriter、databendwriter、selectdbwriter)
    • 优化插件、修复问题(sqlserver、hdfs、cassandra、kudu、oss)
    • fastjson 升级到 fastjson2
  • [datax_v202210](https://github.com/alibaba/DataX/releases/tag/datax_v202210)

    • 涉及通道能力更新(OceanBase、Tdengine、Doris等)
  • [datax_v202209](https://github.com/alibaba/DataX/releases/tag/datax_v202209)

    • 涉及通道能力更新(MaxCompute、Datahub、SLS等)、安全漏洞更新、通用打包更新等
  • [datax_v202205](https://github.com/alibaba/DataX/releases/tag/datax_v202205)

    • 涉及通道能力更新(MaxCompute、Hologres、OSS、Tdengine等)、安全漏洞更新、通用打包更新等

项目成员

核心Contributions: 言柏 、枕水、秋奇、青砾、一斅、云时

感谢天烬、光戈、祁然、巴真、静行对DataX做出的贡献。

License

This software is free to use under the Apache License Apache license.

请及时提出issue给我们。请前往:DataxIssue

开源版DataX企业用户

Datax-logo

长期招聘 联系邮箱:datax@alibabacloud.com
【JAVA开发职位】
职位名称:JAVA资深开发工程师/专家/高级专家
工作年限 : 2年以上
学历要求 : 本科(如果能力靠谱,这些都不是条件)
期望层级 : P6/P7/P8

岗位描述:
    1. 负责阿里云大数据平台(数加)的开发设计。 
    2. 负责面向政企客户的大数据相关产品开发;
    3. 利用大规模机器学习算法挖掘数据之间的联系,探索数据挖掘技术在实际场景中的产品应用 ;
    4. 一站式大数据开发平台
    5. 大数据任务调度引擎
    6. 任务执行引擎
    7. 任务监控告警
    8. 海量异构数据同步

岗位要求:
    1. 拥有3年以上JAVA Web开发经验;
    2. 熟悉Java的基础技术体系。包括JVM、类装载、线程、并发、IO资源管理、网络;
    3. 熟练使用常用Java技术框架、对新技术框架有敏锐感知能力;深刻理解面向对象、设计原则、封装抽象;
    4. 熟悉HTML/HTML5和JavaScript;熟悉SQL语言;
    5. 执行力强,具有优秀的团队合作精神、敬业精神;
    6. 深刻理解设计模式及应用场景者加分;
    7. 具有较强的问题分析和处理能力、比较强的动手能力,对技术有强烈追求者优先考虑;
    8. 对高并发、高稳定可用性、高性能、大数据处理有过实际项目及产品经验者优先考虑;
    9. 有大数据产品、云产品、中间件技术解决方案者优先考虑。

用户咨询支持:

钉钉群目前暂时受到了一些管控策略影响,建议大家有问题优先在这里提交问题 Issue,DataX研发和社区会定期回答Issue中的问题,知识库丰富后也能帮助到后来的使用者。