Top Related Projects
Apache NiFi
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Flink
Apache Spark - A unified analytics engine for large-scale data processing
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Quick Overview
DataX is an open-source, high-performance data synchronization tool developed by Alibaba. It efficiently transfers data between various heterogeneous data sources, including relational databases, NoSQL databases, and big data platforms. DataX supports more than 20 different data sources and can be easily extended to support new ones.
Pros
- High performance and scalability, capable of handling large-scale data synchronization tasks
- Supports a wide range of data sources out of the box
- Easily extensible architecture for adding new data source plugins
- Provides a simple and user-friendly configuration interface
Cons
- Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
- Some reported issues with stability and error handling in certain scenarios
- Limited community support compared to some other data integration tools
- Requires Java environment, which may not be ideal for all use cases
Getting Started
-
Clone the DataX repository:
git clone https://github.com/alibaba/DataX.git
-
Build DataX:
cd DataX mvn -U clean package assembly:assembly -Dmaven.test.skip=true
-
Create a JSON configuration file (e.g.,
job.json
) defining your data synchronization task:{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "root", "password": "root", "column": ["*"], "connection": [ { "table": ["user"], "jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/database"] } ] } }, "writer": { "name": "hdfswriter", "parameter": { "defaultFS": "hdfs://localhost:9000", "fileType": "text", "path": "/user/hive/warehouse/result", "fileName": "user", "column": [ { "name": "col1", "type": "STRING" }, { "name": "col2", "type": "LONG" } ], "writeMode": "append", "fieldDelimiter": "," } } } ], "setting": { "speed": { "channel": 3 } } } }
-
Run DataX with your configuration:
python bin/datax.py job.json
Competitor Comparisons
Apache NiFi
Pros of NiFi
- More comprehensive data flow management with a web-based UI for designing, controlling, and monitoring data flows
- Supports a wider range of data sources and destinations out-of-the-box
- Offers real-time data processing and transformation capabilities
Cons of NiFi
- Steeper learning curve due to its more complex architecture and extensive feature set
- Requires more system resources for deployment and operation
- May be overkill for simpler data integration tasks
Code Comparison
DataX configuration (JSON):
{
"job": {
"content": [
{
"reader": { "name": "mysqlreader", "parameter": { ... } },
"writer": { "name": "hdfswriter", "parameter": { ... } }
}
]
}
}
NiFi flow configuration (XML):
<processor>
<id>abc123</id>
<name>GetFile</name>
<config>
<property name="Input Directory">/path/to/input</property>
</config>
</processor>
Both projects use configuration files to define data flows, but NiFi's approach is more visual and interactive through its web UI, while DataX relies on JSON configuration files for job definitions.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Pros of Airflow
- More comprehensive workflow management system with a rich ecosystem of plugins and integrations
- Provides a web-based UI for monitoring and managing workflows
- Supports complex dependencies and scheduling options
Cons of Airflow
- Steeper learning curve due to its extensive features and configuration options
- Requires more resources to set up and maintain compared to DataX
- Can be overkill for simple data transfer tasks
Code Comparison
DataX (JSON configuration):
{
"job": {
"content": [
{
"reader": { "name": "mysqlreader", "parameter": { ... } },
"writer": { "name": "hdfswriter", "parameter": { ... } }
}
]
}
}
Airflow (Python DAG):
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def transfer_data():
# Data transfer logic here
dag = DAG('data_transfer', default_args=default_args, schedule_interval='@daily')
transfer_task = PythonOperator(task_id='transfer_data', python_callable=transfer_data, dag=dag)
Apache Flink
Pros of Flink
- Powerful stream processing capabilities with support for both batch and real-time data processing
- Extensive ecosystem with built-in libraries for complex event processing, machine learning, and graph processing
- High scalability and fault tolerance with exactly-once processing semantics
Cons of Flink
- Steeper learning curve due to its complex architecture and concepts
- Higher resource requirements, especially for smaller-scale data processing tasks
- Less focus on data integration across diverse sources compared to DataX
Code Comparison
Flink (Java):
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.keyBy(0)
.sum(1);
counts.print();
DataX (JSON configuration):
{
"job": {
"content": [{
"reader": {"name": "txtfilereader", "parameter": {"path": ["input.txt"]}},
"writer": {"name": "streamwriter", "parameter": {}}
}]
}
}
The code snippets showcase Flink's programming model for stream processing, while DataX uses a configuration-based approach for data transfer between different sources and destinations.
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- Powerful distributed computing engine for large-scale data processing
- Supports multiple programming languages (Scala, Java, Python, R)
- Rich ecosystem with libraries for machine learning, graph processing, and streaming
Cons of Spark
- Steeper learning curve and more complex setup compared to DataX
- Higher resource requirements, especially for smaller datasets
- Less focused on data integration and ETL tasks specifically
Code Comparison
DataX (Java):
public static void main(String[] args) throws Exception {
Engine.entry(args);
}
Spark (Scala):
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Example").getOrCreate()
val df = spark.read.json("example.json")
df.show()
Key Differences
- DataX is primarily designed for efficient data transfer between various data sources and targets
- Spark is a more general-purpose distributed computing framework with broader applications
- DataX has a simpler architecture focused on ETL tasks, while Spark offers a wider range of data processing capabilities
- Spark provides more advanced analytics features, including machine learning and graph processing
- DataX may be easier to set up and use for straightforward data integration tasks
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Pros of Airbyte
- More extensive connector library with over 300+ pre-built connectors
- Active open-source community with frequent updates and contributions
- User-friendly web interface for configuration and monitoring
Cons of Airbyte
- Potentially higher resource consumption due to containerized architecture
- Steeper learning curve for custom connector development
- Less mature compared to DataX, which has been in development longer
Code Comparison
DataX (Java):
public static void main(String[] args) throws Exception {
Engine engine = new Engine();
engine.start(args);
}
Airbyte (Python):
def run():
source = SourcePostgres()
destination = DestinationBigQuery()
orchestrator = Orchestrator(source, destination)
orchestrator.run()
Both projects aim to facilitate data integration, but their implementations differ. DataX uses a Java-based engine for data synchronization, while Airbyte employs a modular Python-based approach with containerized connectors. Airbyte's code structure emphasizes flexibility and ease of adding new connectors, whereas DataX focuses on performance and stability for large-scale data transfer scenarios.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
DataX
DataX æ¯é¿éäº DataWorksæ°æ®éæ çå¼æºçæ¬ï¼å¨é¿éå·´å·´éå¢å 被广æ³ä½¿ç¨ç离线æ°æ®åæ¥å·¥å ·/å¹³å°ãDataX å®ç°äºå æ¬ MySQLãOracleãOceanBaseãSqlServerãPostgreãHDFSãHiveãADSãHBaseãTableStore(OTS)ãMaxCompute(ODPS)ãHologresãDRDS, databend çåç§å¼ææ°æ®æºä¹é´é«æçæ°æ®åæ¥åè½ã
DataX åä¸çæ¬
é¿éäºDataWorksæ°æ®éææ¯DataXå¢éå¨é¿éäºä¸çåä¸å产åï¼è´åäºæä¾å¤æç½ç»ç¯å¢ä¸ã丰å¯çå¼ææ°æ®æºä¹é´é«é稳å®çæ°æ®ç§»å¨è½åï¼ä»¥åç¹æä¸å¡èæ¯ä¸çæ°æ®åæ¥è§£å³æ¹æ¡ãç®åå·²ç»æ¯æäºä¸è¿3000家客æ·ï¼åæ¥åæ¥æ°æ®è¶ è¿3ä¸äº¿æ¡ãDataWorksæ°æ®éæç®åæ¯æ离线50+ç§æ°æ®æºï¼å¯ä»¥è¿è¡æ´åºè¿ç§»ãæ¹éä¸äºãå¢éåæ¥ãååºå表çåç±»åæ¥è§£å³æ¹æ¡ã2020å¹´æ´æ°å®æ¶åæ¥è½åï¼æ¯æ10+ç§æ°æ®æºç读åä»»æç»åãæä¾MySQLï¼Oracleçå¤ç§æ°æ®æºå°é¿éäºMaxComputeï¼Hologresç大æ°æ®å¼æçä¸é®å ¨å¢éåæ¥è§£å³æ¹æ¡ã
åä¸çæ¬åè§ï¼ https://www.aliyun.com/product/bigdata/ide
Features
DataXæ¬èº«ä½ä¸ºæ°æ®åæ¥æ¡æ¶ï¼å°ä¸åæ°æ®æºçåæ¥æ½è±¡ä¸ºä»æºå¤´æ°æ®æºè¯»åæ°æ®çReaderæ件ï¼ä»¥ååç®æ 端åå ¥æ°æ®çWriteræ件ï¼ç论ä¸DataXæ¡æ¶å¯ä»¥æ¯æä»»ææ°æ®æºç±»åçæ°æ®åæ¥å·¥ä½ãåæ¶DataXæ件ä½ç³»ä½ä¸ºä¸å¥çæç³»ç», æ¯æ¥å ¥ä¸å¥æ°æ°æ®æºè¯¥æ°å å ¥çæ°æ®æºå³å¯å®ç°åç°æçæ°æ®æºäºéã
DataX详ç»ä»ç»
请åèï¼DataX-Introduction
Quick Start
Download DataXä¸è½½å°å
请ç¹å»ï¼Quick Start
Support Data Channels
DataXç®åå·²ç»æäºæ¯è¾å ¨é¢çæ件ä½ç³»ï¼ä¸»æµçRDBMSæ°æ®åºãNOSQLã大æ°æ®è®¡ç®ç³»ç»é½å·²ç»æ¥å ¥ï¼ç®åæ¯ææ°æ®å¦ä¸å¾ï¼è¯¦æ 请ç¹å»ï¼DataXæ°æ®æºåèæå
ç±»å | æ°æ®æº | Reader(读) | Writer(å) | ææ¡£ |
---|---|---|---|---|
RDBMS å ³ç³»åæ°æ®åº | MySQL | â | â | 读 ãå |
Oracle | â | â | 读 ãå | |
OceanBase | â | â | 读 ãå | |
SQLServer | â | â | 读 ãå | |
PostgreSQL | â | â | 读 ãå | |
DRDS | â | â | 读 ãå | |
Kingbase | â | â | 读 ãå | |
éç¨RDBMS(æ¯æææå ³ç³»åæ°æ®åº) | â | â | 读 ãå | |
é¿éäºæ°ä»æ°æ®åå¨ | ODPS | â | â | 读 ãå |
ADB | â | å | ||
ADS | â | å | ||
OSS | â | â | 读 ãå | |
OCS | â | å | ||
Hologres | â | å | ||
AnalyticDB For PostgreSQL | â | å | ||
é¿éäºä¸é´ä»¶ | datahub | â | â | 读 ãå |
SLS | â | â | 读 ãå | |
å¾æ°æ®åº | é¿éäº GDB | â | â | 读 ãå |
Neo4j | â | å | ||
NoSQLæ°æ®åå¨ | OTS | â | â | 读 ãå |
Hbase0.94 | â | â | 读 ãå | |
Hbase1.1 | â | â | 读 ãå | |
Phoenix4.x | â | â | 读 ãå | |
Phoenix5.x | â | â | 读 ãå | |
MongoDB | â | â | 读 ãå | |
Cassandra | â | â | 读 ãå | |
æ°ä»æ°æ®åå¨ | StarRocks | â | â | 读 ãå |
ApacheDoris | â | å | ||
ClickHouse | â | â | 读 ãå | |
Databend | â | å | ||
Hive | â | â | 读 ãå | |
kudu | â | å | ||
selectdb | â | å | ||
æ ç»æåæ°æ®åå¨ | TxtFile | â | â | 读 ãå |
FTP | â | â | 读 ãå | |
HDFS | â | â | 读 ãå | |
Elasticsearch | â | å | ||
æ¶é´åºåæ°æ®åº | OpenTSDB | â | 读 | |
TSDB | â | â | 读 ãå | |
TDengine | â | â | 读 ãå |
é¿éäºDataWorksæ°æ®éæ
ç®åDataXçå·²æè½åå·²ç»å ¨é¨èåè¿é¿éäºçæ°æ®éæï¼å¹¶ä¸æ¯DataXæ´å é«æãå®å ¨ï¼åæ¶æ°æ®éæå ·å¤DataXä¸å ·å¤çå ¶å®é«çº§ç¹æ§ååè½ãå¯ä»¥ç解为æ°æ®éææ¯DataXçå ¨é¢å级çåä¸åç¨çæ¬ï¼ä¸ºä¼ä¸å¯ä»¥æä¾ç¨³å®ãå¯é ãå®å ¨çæ°æ®ä¼ è¾æå¡ãä¸DataXç¸æ¯ï¼æ°æ®éæ主è¦æ以ä¸å 大çªåºç¹ç¹ï¼
æ¯æå®æ¶åæ¥ï¼
- åè½ç®ä»ï¼https://help.aliyun.com/document_detail/181912.html
- æ¯æçæ°æ®æºï¼https://help.aliyun.com/document_detail/146778.html
- æ¯ææ°æ®å¤çï¼https://help.aliyun.com/document_detail/146777.html
离线åæ¥æ°æ®æºç§ç±»å¤§å¹ 度æ©å ï¼
- æ°å¢æ¯å¦ï¼DB2ãKafkaãHologresãMetaQãSAPHANAã达梦ççï¼æç»æ©å ä¸
- 离线åæ¥æ¯æçæ°æ®æºï¼https://help.aliyun.com/document_detail/137670.html
- å
·å¤åæ¥è§£å³æ¹æ¡ï¼
- 解å³æ¹æ¡ç³»ç»ï¼https://help.aliyun.com/document_detail/171765.html
- ä¸é®å ¨å¢éï¼https://help.aliyun.com/document_detail/175676.html
- æ´åºè¿ç§»ï¼https://help.aliyun.com/document_detail/137809.html
- æ¹éä¸äºï¼https://help.aliyun.com/document_detail/146671.html
- æ´æ°æ´å¤è½å请访é®ï¼https://help.aliyun.com/document_detail/137663.html
æè¦å¼åæ°çæ件
请ç¹å»ï¼DataXæ件å¼åå®å ¸
éè¦çæ¬æ´æ°è¯´æ
DataX åç»è®¡åæ度è¿ä»£æ´æ°ï¼ä¹æ¬¢è¿æå ´è¶£çåå¦æ交 Pull requestsï¼æ度æ´æ°å 容å¦ä¸ã
-
[datax_v202309]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202309)
- æ¯æPhoenix åæ¥æ°æ®æ·»å whereæ¡ä»¶
- æ¯æå为 GuassDB读åæ件
- ä¿®å¤ClickReader æ件è¿è¡æ¥é Can't find bundle for base name
- å¢å DataXè°è¯æ¨¡å
- ä¿®å¤ orc空æ件æ¥éé®é¢
- ä¼åobwriteræ§è½
- txtfilewriter å¢å 导åºä¸ºinsertè¯å¥åè½æ¯æ
- HdfsReader/HdfsWriter æ¯æparquet读åè½å
-
[datax_v202308]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202308)
- OTS æ件æ´æ°
- databend æ件æ´æ°
- Oceanbase驱å¨ä¿®å¤
-
[datax_v202306]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202306)
- ç²¾ç®ä»£ç
- æ°å¢æ件ï¼neo4jwriterãclickhousewriterï¼
- ä¼åæ件ãä¿®å¤é®é¢ï¼oceanbaseãhdfsãdatabendãtxtfileï¼
-
[datax_v202303]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202303)
- ç²¾ç®ä»£ç
- æ°å¢æ件ï¼adbmysqlwriterãdatabendwriterãselectdbwriterï¼
- ä¼åæ件ãä¿®å¤é®é¢ï¼sqlserverãhdfsãcassandraãkuduãossï¼
- fastjson åçº§å° fastjson2
-
[datax_v202210]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202210)
- æ¶åééè½åæ´æ°ï¼OceanBaseãTdengineãDorisçï¼
-
[datax_v202209]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202209)
- æ¶åééè½åæ´æ°ï¼MaxComputeãDatahubãSLSçï¼ãå®å ¨æ¼æ´æ´æ°ãéç¨æå æ´æ°ç
-
[datax_v202205]ï¼https://github.com/alibaba/DataX/releases/tag/datax_v202205)
- æ¶åééè½åæ´æ°ï¼MaxComputeãHologresãOSSãTdengineçï¼ãå®å ¨æ¼æ´æ´æ°ãéç¨æå æ´æ°ç
项ç®æå
æ ¸å¿Contributions: è¨æ ãææ°´ãç§å¥ãéç ¾ãä¸æ ãäºæ¶
æ谢天ç¬ãå æãç¥ç¶ãå·´çãéè¡å¯¹DataXååºçè´¡ç®ã
License
This software is free to use under the Apache License Apache license.
请åæ¶æåºissueç»æ们ã请åå¾ï¼DataxIssue
å¼æºçDataXä¼ä¸ç¨æ·
é¿ææè èç³»é®ç®±ï¼datax@alibabacloud.com
ãJAVAå¼åèä½ã
èä½å称ï¼JAVAèµæ·±å¼åå·¥ç¨å¸/ä¸å®¶/é«çº§ä¸å®¶
å·¥ä½å¹´é : 2年以ä¸
å¦åè¦æ± : æ¬ç§ï¼å¦æè½åé è°±ï¼è¿äºé½ä¸æ¯æ¡ä»¶ï¼
ææå±çº§ : P6/P7/P8
å²ä½æè¿°ï¼
1. è´è´£é¿éäºå¤§æ°æ®å¹³å°ï¼æ°å ï¼çå¼å设计ã
2. è´è´£é¢åæ¿ä¼å®¢æ·ç大æ°æ®ç¸å
³äº§åå¼åï¼
3. å©ç¨å¤§è§æ¨¡æºå¨å¦ä¹ ç®æ³æææ°æ®ä¹é´çèç³»ï¼æ¢ç´¢æ°æ®ææææ¯å¨å®é
åºæ¯ä¸ç产ååºç¨ ï¼
4. ä¸ç«å¼å¤§æ°æ®å¼åå¹³å°
5. 大æ°æ®ä»»å¡è°åº¦å¼æ
6. ä»»å¡æ§è¡å¼æ
7. ä»»å¡çæ§åè¦
8. æµ·éå¼ææ°æ®åæ¥
å²ä½è¦æ±ï¼
1. æ¥æ3年以ä¸JAVA Webå¼åç»éªï¼
2. çæJavaçåºç¡ææ¯ä½ç³»ãå
æ¬JVMãç±»è£
è½½ã线ç¨ã并åãIOèµæºç®¡çãç½ç»ï¼
3. çç»ä½¿ç¨å¸¸ç¨Javaææ¯æ¡æ¶ã对æ°ææ¯æ¡æ¶ææéæç¥è½åï¼æ·±å»ç解é¢å对象ã设计ååãå°è£
æ½è±¡ï¼
4. çæHTML/HTML5åJavaScriptï¼çæSQLè¯è¨ï¼
5. æ§è¡å强ï¼å
·æä¼ç§çå¢éåä½ç²¾ç¥ãæ¬ä¸ç²¾ç¥ï¼
6. æ·±å»ç解设计模å¼ååºç¨åºæ¯è
å åï¼
7. å
·æè¾å¼ºçé®é¢åæåå¤çè½åãæ¯è¾å¼ºçå¨æè½åï¼å¯¹ææ¯æ强ç追æ±è
ä¼å
èèï¼
8. 对é«å¹¶åãé«ç¨³å®å¯ç¨æ§ãé«æ§è½ã大æ°æ®å¤çæè¿å®é
项ç®å产åç»éªè
ä¼å
èèï¼
9. æ大æ°æ®äº§åãäºäº§åãä¸é´ä»¶ææ¯è§£å³æ¹æ¡è
ä¼å
èèã
ç¨æ·å¨è¯¢æ¯æï¼
éé群ç®åææ¶åå°äºä¸äºç®¡æ§çç¥å½±åï¼å»ºè®®å¤§å®¶æé®é¢ä¼å å¨è¿éæ交é®é¢ Issueï¼DataXç åå社åºä¼å®æåçIssueä¸çé®é¢ï¼ç¥è¯åºä¸°å¯åä¹è½å¸®å©å°åæ¥ç使ç¨è ã
Top Related Projects
Apache NiFi
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Flink
Apache Spark - A unified analytics engine for large-scale data processing
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot