DataX

DataX是阿里云DataWorks数据集成的开源版本。

16,416

5,525

16,416

1,313

View on GitHub

Top Related Projects

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

airbyte

17,998

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Quick Overview

DataX is an open-source, high-performance data synchronization tool developed by Alibaba. It efficiently transfers data between various heterogeneous data sources, including relational databases, NoSQL databases, and big data platforms. DataX supports more than 20 different data sources and can be easily extended to support new ones.

Pros

High performance and scalability, capable of handling large-scale data synchronization tasks
Supports a wide range of data sources out of the box
Easily extensible architecture for adding new data source plugins
Provides a simple and user-friendly configuration interface

Cons

Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
Some reported issues with stability and error handling in certain scenarios
Limited community support compared to some other data integration tools
Requires Java environment, which may not be ideal for all use cases

Getting Started

Clone the DataX repository:

git clone https://github.com/alibaba/DataX.git

Build DataX:

cd DataX
mvn -U clean package assembly:assembly -Dmaven.test.skip=true

Create a JSON configuration file (e.g., job.json) defining your data synchronization task:

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "root",
            "password": "root",
            "column": ["*"],
            "connection": [
              {
                "table": ["user"],
                "jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/database"]
              }
            ]
          }
        },
        "writer": {
          "name": "hdfswriter",
          "parameter": {
            "defaultFS": "hdfs://localhost:9000",
            "fileType": "text",
            "path": "/user/hive/warehouse/result",
            "fileName": "user",
            "column": [
              {
                "name": "col1",
                "type": "STRING"
              },
              {
                "name": "col2",
                "type": "LONG"
              }
            ],
            "writeMode": "append",
            "fieldDelimiter": ","
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 3
      }
    }
  }
}

Run DataX with your configuration:
```
python bin/datax.py job.json
```

Competitor Comparisons

nifi

5,270

Apache NiFi

Pros of NiFi

More comprehensive data flow management with a web-based UI for designing, controlling, and monitoring data flows
Supports a wider range of data sources and destinations out-of-the-box
Offers real-time data processing and transformation capabilities

Cons of NiFi

Steeper learning curve due to its more complex architecture and extensive feature set
Requires more system resources for deployment and operation
May be overkill for simpler data integration tasks

Code Comparison

DataX configuration (JSON):

{
  "job": {
    "content": [
      {
        "reader": { "name": "mysqlreader", "parameter": { ... } },
        "writer": { "name": "hdfswriter", "parameter": { ... } }
      }
    ]
  }
}

NiFi flow configuration (XML):

<processor>
  <id>abc123</id>
  <name>GetFile</name>
  <config>
    <property name="Input Directory">/path/to/input</property>
  </config>
</processor>

Both projects use configuration files to define data flows, but NiFi's approach is more visual and interactive through its web UI, while DataX relies on JSON configuration files for job definitions.

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

More comprehensive workflow management system with a rich ecosystem of plugins and integrations
Provides a web-based UI for monitoring and managing workflows
Supports complex dependencies and scheduling options

Cons of Airflow

Steeper learning curve due to its extensive features and configuration options
Requires more resources to set up and maintain compared to DataX
Can be overkill for simple data transfer tasks

Code Comparison

DataX (JSON configuration):

{
  "job": {
    "content": [
      {
        "reader": { "name": "mysqlreader", "parameter": { ... } },
        "writer": { "name": "hdfswriter", "parameter": { ... } }
      }
    ]
  }
}

Airflow (Python DAG):

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def transfer_data():
    # Data transfer logic here

dag = DAG('data_transfer', default_args=default_args, schedule_interval='@daily')
transfer_task = PythonOperator(task_id='transfer_data', python_callable=transfer_data, dag=dag)

flink

24,808

Apache Flink

Pros of Flink

Powerful stream processing capabilities with support for both batch and real-time data processing
Extensive ecosystem with built-in libraries for complex event processing, machine learning, and graph processing
High scalability and fault tolerance with exactly-once processing semantics

Cons of Flink

Steeper learning curve due to its complex architecture and concepts
Higher resource requirements, especially for smaller-scale data processing tasks
Less focus on data integration across diverse sources compared to DataX

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(0)
    .sum(1);
counts.print();

DataX (JSON configuration):

{
    "job": {
        "content": [{
            "reader": {"name": "txtfilereader", "parameter": {"path": ["input.txt"]}},
            "writer": {"name": "streamwriter", "parameter": {}}
        }]
    }
}

The code snippets showcase Flink's programming model for stream processing, while DataX uses a configuration-based approach for data transfer between different sources and destinations.

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Powerful distributed computing engine for large-scale data processing
Supports multiple programming languages (Scala, Java, Python, R)
Rich ecosystem with libraries for machine learning, graph processing, and streaming

Cons of Spark

Steeper learning curve and more complex setup compared to DataX
Higher resource requirements, especially for smaller datasets
Less focused on data integration and ETL tasks specifically

Code Comparison

DataX (Java):

public static void main(String[] args) throws Exception {
    Engine.entry(args);
}

Spark (Scala):

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Example").getOrCreate()
val df = spark.read.json("example.json")
df.show()

Key Differences

DataX is primarily designed for efficient data transfer between various data sources and targets
Spark is a more general-purpose distributed computing framework with broader applications
DataX has a simpler architecture focused on ETL tasks, while Spark offers a wider range of data processing capabilities
Spark provides more advanced analytics features, including machine learning and graph processing
DataX may be easier to set up and use for straightforward data integration tasks

airbyte

17,998

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Pros of Airbyte

More extensive connector library with over 300+ pre-built connectors
Active open-source community with frequent updates and contributions
User-friendly web interface for configuration and monitoring

Cons of Airbyte

Potentially higher resource consumption due to containerized architecture
Steeper learning curve for custom connector development
Less mature compared to DataX, which has been in development longer

Code Comparison

DataX (Java):

public static void main(String[] args) throws Exception {
    Engine engine = new Engine();
    engine.start(args);
}

Airbyte (Python):

def run():
    source = SourcePostgres()
    destination = DestinationBigQuery()
    orchestrator = Orchestrator(source, destination)
    orchestrator.run()

Both projects aim to facilitate data integration, but their implementations differ. DataX uses a Java-based engine for data synchronization, while Airbyte employs a modular Python-based approach with containerized connectors. Airbyte's code structure emphasizes flexibility and ease of adding new connectors, whereas DataX focuses on performance and stability for large-scale data transfer scenarios.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DataX

DataX æ¯é¿éäº DataWorksæ°æ®éæ çå¼æºçæ¬ï¼å¨é¿éå·´å·´éå¢åè¢«å¹¿æ³ä½¿ç¨çç¦»çº¿æ°æ®åæ¥å·¥å·/å¹³å°ãDataX å®ç°äºåæ¬ MySQLãOracleãOceanBaseãSqlServerãPostgreãHDFSãHiveãADSãHBaseãTableStore(OTS)ãMaxCompute(ODPS)ãHologresãDRDS, databend çåç§å¼ææ°æ®æºä¹é´é«æçæ°æ®åæ¥åè½ã

DataX åä¸çæ¬

åä¸çæ¬åè§ï¼ https://www.aliyun.com/product/bigdata/ide

Features

DataXè¯¦ç»ä»ç»

è¯·åèï¼DataX-Introduction

Quick Start

Download DataXä¸è½½å°å

è¯·ç¹å»ï¼Quick Start

Support Data Channels

ç±»å	æ°æ®æº	Reader(è¯»)	Writer(å)	ææ¡£
RDBMS å³ç³»åæ°æ®åº	MySQL	â	â	è¯» ãå
	Oracle	â	â	è¯» ãå
	OceanBase	â	â	è¯» ãå
	SQLServer	â	â	è¯» ãå
	PostgreSQL	â	â	è¯» ãå
	DRDS	â	â	è¯» ãå
	Kingbase	â	â	è¯» ãå
	éç¨RDBMS(æ¯æææå³ç³»åæ°æ®åº)	â	â	è¯» ãå
é¿éäºæ°ä»æ°æ®åå¨	ODPS	â	â	è¯» ãå
	ADB		â	å
	ADS		â	å
	OSS	â	â	è¯» ãå
	OCS		â	å
	Hologres		â	å
	AnalyticDB For PostgreSQL		â	å
é¿éäºä¸é´ä»¶	datahub	â	â	è¯» ãå
	SLS	â	â	è¯» ãå
å¾æ°æ®åº	é¿éäº GDB	â	â	è¯» ãå
	Neo4j		â	å
NoSQLæ°æ®åå¨	OTS	â	â	è¯» ãå
	Hbase0.94	â	â	è¯» ãå
	Hbase1.1	â	â	è¯» ãå
	Phoenix4.x	â	â	è¯» ãå
	Phoenix5.x	â	â	è¯» ãå
	MongoDB	â	â	è¯» ãå
	Cassandra	â	â	è¯» ãå
æ°ä»æ°æ®åå¨	StarRocks	â	â	è¯» ãå
	ApacheDoris		â	å
	ClickHouse	â	â	è¯» ãå
	Databend		â	å
	Hive	â	â	è¯» ãå
	kudu		â	å
	selectdb		â	å
æ ç»æåæ°æ®åå¨	TxtFile	â	â	è¯» ãå
	FTP	â	â	è¯» ãå
	HDFS	â	â	è¯» ãå
	Elasticsearch		â	å
æ¶é´åºåæ°æ®åº	OpenTSDB	â		è¯»
	TSDB	â	â	è¯» ãå
	TDengine	â	â	è¯» ãå

é¿éäºDataWorksæ°æ®éæ

æ¯æå®æ¶åæ¥ï¼

åè½ç®ä»ï¼https://help.aliyun.com/document_detail/181912.html
æ¯æçæ°æ®æºï¼https://help.aliyun.com/document_detail/146778.html
æ¯ææ°æ®å¤çï¼https://help.aliyun.com/document_detail/146777.html

ç¦»çº¿åæ¥æ¯æçæ°æ®æºï¼https://help.aliyun.com/document_detail/137670.html
å·å¤åæ¥è§£å³æ¹æ¡ï¼
- è§£å³æ¹æ¡ç³»ç»ï¼https://help.aliyun.com/document_detail/171765.html
- ä¸é®å¨å¢éï¼https://help.aliyun.com/document_detail/175676.html
- æ´åºè¿ç§»ï¼https://help.aliyun.com/document_detail/137809.html
- æ¹éä¸äºï¼https://help.aliyun.com/document_detail/146671.html
- æ´æ°æ´å¤è½åè¯·è®¿é®ï¼https://help.aliyun.com/document_detail/137663.html

æè¦å¼åæ°çæä»¶

è¯·ç¹å»ï¼DataXæä»¶å¼åå®å¸

éè¦çæ¬æ´æ°è¯´æ

[datax_v202309]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202309)
- æ¯æPhoenix åæ¥æ°æ®æ·»å whereæ¡ä»¶
- æ¯æåä¸º GuassDBè¯»åæä»¶
- ä¿®å¤ClickReader æä»¶è¿è¡æ¥é Can't find bundle for base name
- å¢å DataXè°è¯æ¨¡å
- ä¿®å¤ orcç©ºæä»¶æ¥éé®é¢
- ä¼åobwriteræ§è½
- txtfilewriter å¢å å¯¼åºä¸ºinsertè¯å¥åè½æ¯æ
- HdfsReader/HdfsWriter æ¯æparquetè¯»åè½å
[datax_v202308]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202308)
- OTS æä»¶æ´æ°
- databend æä»¶æ´æ°
- Oceanbaseé©±å¨ä¿®å¤
[datax_v202306]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202306)
- ç²¾ç®ä»£ç
- æ°å¢æä»¶ï¼neo4jwriterãclickhousewriterï¼
- ä¼åæä»¶ãä¿®å¤é®é¢ï¼oceanbaseãhdfsãdatabendãtxtfileï¼
[datax_v202303]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202303)
- ç²¾ç®ä»£ç
- æ°å¢æä»¶ï¼adbmysqlwriterãdatabendwriterãselectdbwriterï¼
- ä¼åæä»¶ãä¿®å¤é®é¢ï¼sqlserverãhdfsãcassandraãkuduãossï¼
- fastjson åçº§å° fastjson2
[datax_v202210]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202210)
- æ¶åééè½åæ´æ°ï¼OceanBaseãTdengineãDorisçï¼
[datax_v202209]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202209)
- æ¶åééè½åæ´æ°ï¼MaxComputeãDatahubãSLSçï¼ãå®å¨æ¼æ´æ´æ°ãéç¨æåæ´æ°ç
[datax_v202205]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202205)
- æ¶åééè½åæ´æ°ï¼MaxComputeãHologresãOSSãTdengineçï¼ãå®å¨æ¼æ´æ´æ°ãéç¨æåæ´æ°ç

é¡¹ç®æå

æ ¸å¿Contributions: è¨æ ãææ°´ãç§å¥ãéç ¾ãä¸æãäºæ¶

License

This software is free to use under the Apache License Apache license.

è¯·åæ¶æåºissueç»æä»¬ãè¯·åå¾ï¼DataxIssue

å¼æºçDataXä¼ä¸ç¨æ·

Datax-logo

é¿ææè èç³»é®ç®±ï¼datax@alibabacloud.com
ãJAVAå¼åèä½ã
èä½åç§°ï¼JAVAèµæ·±å¼åå·¥ç¨å¸/ä¸å®¶/é«çº§ä¸å®¶
å·¥ä½å¹´é : 2å¹´ä»¥ä¸
å¦åè¦æ± : æ¬ç§ï¼å¦æè½åé è°±ï¼è¿äºé½ä¸æ¯æ¡ä»¶ï¼
ææå±çº§ : P6/P7/P8

å²ä½æè¿°ï¼
    1. è´è´£é¿éäºå¤§æ°æ®å¹³å°ï¼æ°å ï¼çå¼åè®¾è®¡ã 
    2. è´è´£é¢åæ¿ä¼å®¢æ·çå¤§æ°æ®ç¸å³äº§åå¼åï¼
    3. å©ç¨å¤§è§æ¨¡æºå¨å¦ä¹ ç®æ³æææ°æ®ä¹é´çèç³»ï¼æ¢ç´¢æ°æ®ææææ¯å¨å®éåºæ¯ä¸çäº§ååºç¨ ï¼
    4. ä¸ç«å¼å¤§æ°æ®å¼åå¹³å°
    5. å¤§æ°æ®ä»»å¡è°åº¦å¼æ
    6. ä»»å¡æ§è¡å¼æ
    7. ä»»å¡çæ§åè¦
    8. æµ·éå¼ææ°æ®åæ¥

å²ä½è¦æ±ï¼
    1. æ¥æ3å¹´ä»¥ä¸JAVA Webå¼åç»éªï¼
    2. çæJavaçåºç¡ææ¯ä½ç³»ãåæ¬JVMãç±»è£è½½ãçº¿ç¨ãå¹¶åãIOèµæºç®¡çãç½ç»ï¼
    3. çç»ä½¿ç¨å¸¸ç¨Javaææ¯æ¡æ¶ãå¯¹æ°ææ¯æ¡æ¶ææéæç¥è½åï¼æ·±å»çè§£é¢åå¯¹è±¡ãè®¾è®¡ååãå°è£æ½è±¡ï¼
    4. çæHTML/HTML5åJavaScriptï¼çæSQLè¯è¨ï¼
    5. æ§è¡åå¼ºï¼å·æä¼ç§çå¢éåä½ç²¾ç¥ãæ¬ä¸ç²¾ç¥ï¼
    6. æ·±å»çè§£è®¾è®¡æ¨¡å¼ååºç¨åºæ¯èå åï¼
    7. å·æè¾å¼ºçé®é¢åæåå¤çè½åãæ¯è¾å¼ºçå¨æè½åï¼å¯¹ææ¯æå¼ºçè¿½æ±èä¼åèèï¼
    8. å¯¹é«å¹¶åãé«ç¨³å®å¯ç¨æ§ãé«æ§è½ãå¤§æ°æ®å¤çæè¿å®éé¡¹ç®åäº§åç»éªèä¼åèèï¼
    9. æå¤§æ°æ®äº§åãäºäº§åãä¸é´ä»¶ææ¯è§£å³æ¹æ¡èä¼åèèã

ç¨æ·å¨è¯¢æ¯æï¼

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Getting Started

Competitor Comparisons

Pros of NiFi

Cons of NiFi

Code Comparison

Pros of Airflow

Cons of Airflow

Code Comparison

Pros of Flink

Cons of Flink

Code Comparison

Pros of Spark

Cons of Spark

Code Comparison

Key Differences

Pros of Airbyte

Cons of Airbyte

Code Comparison

Convert designs to code with AI

README

DataX

DataX åä¸çæ¬

Features

DataXè¯¦ç»ä»ç»

è¯·åèï¼DataX-Introduction

Quick Start

Download DataXä¸è½½å°å

è¯·ç¹å»ï¼Quick Start

Support Data Channels

é¿éäºDataWorksæ°æ®éæ

æè¦å¼åæ°çæä»¶

éè¦çæ¬æ´æ°è¯´æ

é¡¹ç®æå

License

å¼æºçDataXä¼ä¸ç¨æ·

Top Related Projects

Convert designs to code with AI

DataX åä¸çæ¬

DataXè¯¦ç»ä»ç»

è¯·åèï¼DataX-Introduction

Download DataXä¸è½½å°å

è¯·ç¹å»ï¼Quick Start

é¿éäºDataWorksæ°æ®éæ

æè¦å¼åæ°çæä»¶

éè¦çæ¬æ´æ°è¯´æ

é¡¹ç®æå

å¼æºçDataXä¼ä¸ç¨æ·