Convert Figma logo to code with AI

apache logodoris

Apache Doris is an easy-to-use, high performance and unified analytics database.

12,278
3,199
12,278
2,794

Top Related Projects

13,453

Apache Druid: a high performance real-time analytics database.

5,385

Apache Pinot - A realtime distributed OLAP datastore

ClickHouse® is a real-time analytics DBMS

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.

3,634

Apache Kylin

10,324

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Quick Overview

Apache Doris is an open-source, high-performance analytical database system that provides real-time data warehousing and analytics capabilities. It is designed to handle large-scale data processing and querying with low latency, making it suitable for various business intelligence and data analysis scenarios.

Pros

  • High performance and low latency for analytical queries
  • Scalable and distributed architecture for handling large datasets
  • SQL-compatible, making it easy for users familiar with SQL to adopt
  • Supports real-time data ingestion and updates

Cons

  • Relatively new compared to some established data warehousing solutions
  • Limited ecosystem and third-party tool integration compared to more mature platforms
  • Steeper learning curve for advanced features and optimizations
  • Documentation can be incomplete or outdated in some areas

Getting Started

To get started with Apache Doris, follow these steps:

  1. Download the latest release from the official Apache Doris website.
  2. Extract the downloaded package:
    tar -xzf apache-doris-x.x.x.tar.gz
    
  3. Set up the configuration files in the conf directory.
  4. Start the Doris cluster:
    ./bin/start_fe.sh
    ./bin/start_be.sh
    
  5. Connect to Doris using a MySQL client:
    mysql -h <fe_host> -P <query_port> -u root
    
  6. Create a database and table:
    CREATE DATABASE example_db;
    USE example_db;
    CREATE TABLE example_table (
      id INT,
      name VARCHAR(50),
      value DOUBLE
    ) ENGINE=OLAP
    DISTRIBUTED BY HASH(id) BUCKETS 10;
    
  7. Load data into the table:
    LOAD LABEL example_db.label1
    (
      DATA INFILE("hdfs://your-hdfs-path/example_data.csv")
      INTO TABLE example_table
      COLUMNS TERMINATED BY ","
    );
    

For more detailed instructions and advanced usage, refer to the official Apache Doris documentation.

Competitor Comparisons

13,453

Apache Druid: a high performance real-time analytics database.

Pros of Druid

  • Highly scalable and designed for real-time analytics on large datasets
  • Supports sub-second query latency for complex aggregations
  • Offers a rich set of built-in aggregators and post-aggregators

Cons of Druid

  • Steeper learning curve and more complex setup compared to Doris
  • Limited support for SQL-like queries, focusing more on specialized query language
  • Higher resource requirements for optimal performance

Code Comparison

Druid query example:

{
  "queryType": "timeseries",
  "dataSource": "sample_data",
  "granularity": "day",
  "aggregations": [
    { "type": "longSum", "name": "total_sales", "fieldName": "sales" }
  ],
  "intervals": ["2023-01-01/2023-12-31"]
}

Doris query example:

SELECT DATE_TRUNC('day', event_time) AS day,
       SUM(sales) AS total_sales
FROM sample_data
WHERE event_time BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY 1
ORDER BY 1;

Both Druid and Doris are powerful analytics databases, but they cater to different use cases. Druid excels in real-time analytics and complex aggregations, while Doris offers a more familiar SQL interface and easier setup. The choice between them depends on specific project requirements and team expertise.

5,385

Apache Pinot - A realtime distributed OLAP datastore

Pros of Pinot

  • Better suited for real-time analytics with low-latency queries
  • Supports a wider range of data ingestion methods, including streaming
  • More flexible schema design and data modeling capabilities

Cons of Pinot

  • Higher complexity in setup and maintenance
  • Less mature ecosystem compared to Doris
  • Steeper learning curve for new users

Code Comparison

Pinot query example:

SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN 1589980800000 AND 1589984400000
GROUP BY dimension1, dimension2
LIMIT 10

Doris query example:

SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN '2020-05-20 10:00:00' AND '2020-05-20 11:00:00'
GROUP BY dimension1, dimension2
LIMIT 10

Both systems use SQL-like syntax for querying, but Pinot often uses millisecond timestamps, while Doris typically uses datetime strings. Pinot's query language also includes some additional features for real-time analytics that may not be present in Doris.

Overall, Pinot excels in real-time analytics scenarios with its focus on low-latency queries and streaming data ingestion. Doris, on the other hand, offers a more straightforward setup and may be easier for teams already familiar with traditional data warehousing concepts. The choice between the two depends on specific use cases and team expertise.

ClickHouse® is a real-time analytics DBMS

Pros of ClickHouse

  • Extremely fast query performance, especially for analytical workloads
  • Highly scalable and can handle petabytes of data efficiently
  • Rich set of built-in functions and features for data analysis

Cons of ClickHouse

  • Steeper learning curve due to its unique query language and architecture
  • Limited support for real-time updates and deletes
  • Less mature ecosystem compared to some other database systems

Code Comparison

ClickHouse query example:

SELECT
    toYear(date) AS year,
    sum(revenue) AS total_revenue
FROM sales
GROUP BY year
ORDER BY year

Doris query example:

SELECT
    year(date) AS year,
    sum(revenue) AS total_revenue
FROM sales
GROUP BY year
ORDER BY year

Both ClickHouse and Doris are columnar databases designed for analytical workloads, but they have different query syntaxes and optimizations. ClickHouse often uses custom functions like toYear(), while Doris tends to use more standard SQL functions. Both systems aim to provide fast query performance and scalability for large datasets, but ClickHouse is generally considered to have an edge in raw performance for certain types of analytical queries.

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.

Pros of StarRocks

  • Higher query performance, especially for complex analytical queries
  • More advanced vectorized execution engine
  • Better support for real-time analytics and streaming data ingestion

Cons of StarRocks

  • Smaller community and ecosystem compared to Doris
  • Less mature documentation and learning resources
  • Fewer integrations with other data tools and platforms

Code Comparison

StarRocks query example:

SELECT user_id, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING order_count > 5
ORDER BY order_count DESC
LIMIT 10;

Doris query example:

SELECT user_id, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING order_count > 5
ORDER BY order_count DESC
LIMIT 10;

Both StarRocks and Doris use similar SQL syntax for querying data. The main differences lie in their internal processing and optimization techniques rather than the query language itself. StarRocks may execute this query faster due to its advanced vectorized execution engine, especially for large datasets or complex joins.

3,634

Apache Kylin

Pros of Kylin

  • Mature OLAP solution with a longer history and wider adoption
  • Strong support for multidimensional analysis and cube-based queries
  • Integration with various Hadoop ecosystem components

Cons of Kylin

  • Higher complexity in setup and maintenance
  • Longer query latency for non-cube queries
  • Limited support for real-time data ingestion

Code Comparison

Kylin query example:

SELECT year, SUM(price) AS total_price
FROM sales_fact
JOIN date_dim ON sales_fact.date_key = date_dim.date_key
GROUP BY year

Doris query example:

SELECT year, SUM(price) AS total_price
FROM sales_fact
GROUP BY year

Both systems support SQL queries, but Kylin often requires pre-built cubes for optimal performance, while Doris can handle ad-hoc queries more efficiently.

Kylin excels in scenarios with predefined, complex OLAP queries on large datasets, particularly when integrated with Hadoop ecosystems. Doris, on the other hand, offers more flexibility for real-time analytics and ad-hoc queries, with a simpler architecture that's easier to deploy and maintain.

While Kylin provides powerful multidimensional analysis capabilities, Doris focuses on delivering low-latency queries and real-time data ingestion, making it more suitable for scenarios requiring frequent updates and fast query responses.

10,324

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Pros of Trino

  • Broader ecosystem support with connectors for various data sources
  • More mature project with a larger community and extensive documentation
  • Better performance for complex queries across multiple data sources

Cons of Trino

  • Higher resource requirements, especially for memory
  • Steeper learning curve due to more complex architecture
  • Less optimized for OLAP workloads compared to Doris

Code Comparison

Trino SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000

Doris SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000

The SQL syntax for both Trino and Doris is similar in this example, as they both support standard SQL. However, the underlying execution and optimization strategies differ, with Doris being more optimized for OLAP workloads and Trino excelling in federated queries across multiple data sources.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Doris

License GitHub release OSSRank Commit activity EN doc CN doc

Official Website Quick Download

     

Apache Doris is an easy-to-use, high-performance and real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios.

All this makes Apache Doris an ideal tool for scenarios including report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Apache Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.

🎉 Version 2.1.4 released now. Check out the 🔗Release Notes here. The 2.1 verison delivers exceptional performance with 100% higher out-of-the-box queries proven by TPC-DS 1TB tests, enhanced data lake analytics that are 4-6 times speedier than Trino and Spark, solid support for semi-structured data analysis with new Variant types and suite of analytical functions, asynchronous materialized views for query acceleration, optimized real-time writing at scale, and better workload management with stability and runtime SQL resource tracking.

🎉 Version 2.0.12 is now released ! This fully evolved and stable release is ready for all users to upgrade. Check out the 🔗Release Notes here.

👀 Have a look at the 🔗Official Website for a comprehensive list of Apache Doris's core features, blogs and user cases.

📈 Usage Scenarios

As shown in the figure below, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Apache Doris and the offline data lake or data warehouse (in Apache Hive, Apache Iceberg or Apache Hudi).



Apache Doris is widely used in the following scenarios:

  • Reporting Analysis

    • Real-time dashboards
    • Reports for in-house analysts and managers
    • Highly concurrent user-oriented or customer-oriented report analysis: such as website analysis and ad reporting that usually require thousands of QPS and quick response times measured in milliseconds. A successful user case is that Doris has been used by the Chinese e-commerce giant JD.com in ad reporting, where it receives 10 billion rows of data per day, handles over 10,000 QPS, and delivers a 99 percentile query latency of 150 ms.
  • Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day.

  • Unified Data Warehouse Construction. Apache Doris allows users to build a unified data warehouse via one single platform and save the trouble of handling complicated software stacks. Chinese hot pot chain Haidilao has built a unified data warehouse with Doris to replace its old complex architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix.

  • Data Lake Query. Apache Doris avoids data copying by federating the data in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, and thus achieves outstanding query performance.

🖥️ Core Concepts

📂 Architecture of Apache Doris

The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.

  • Frontend (FE): user request access, query parsing and planning, metadata management, node management, etc.

  • Backend (BE): data storage and query plan execution

Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.


The overall architecture of Apache Doris


In terms of interfaces, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools.

💾 Storage Engine

Doris uses a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources. Doris supports various index structures to minimize data scans:

  • Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios.
  • MIN/MAX Indexing: This enables effective filtering of equivalence and range queries for numeric types.
  • Bloom Filter: very effective in equivalence filtering and pruning of high cardinality columns
  • Invert Index: This enables fast search for any field.

💿 Storage Models

Doris supports a variety of storage models and has optimized them for different scenarios:

  • Aggregate Key Model: able to merge the value columns with the same keys and significantly improve performance

  • Unique Key Model: Keys are unique in this model and data with the same key will be overwritten to achieve row-level data updates.

  • Duplicate Key Model: This is a detailed data model capable of detailed storage of fact tables.

Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users.

🔍 Query Engine

Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries.


Query Engine


The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5–10 times higher performance in wide table aggregation scenarios than non-vectorized engines.


Doris query engine


Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter.

🚅 Query Optimizer

In terms of optimizers, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction.

Technical Overview: 🔗Introduction to Apache Doris

🎆 Why choose Apache Doris?

  • 🎯 Easy to Use: Two processes, no other dependencies; online cluster scaling, automatic replica recovery; compatible with MySQL protocol, and using standard SQL.

  • 🚀 High Performance: Extremely fast performance for low-latency and high-throughput queries with columnar storage engine, modern MPP architecture, vectorized query engine, pre-aggregated materialized view and data index.

  • 🖥️ Single Unified: A single system can support real-time data serving, interactive data analysis and offline data processing scenarios.

  • ⚛️ Federated Querying: Supports federated querying of data lakes such as Hive, Iceberg, Hudi, and databases such as MySQL and Elasticsearch.

  • ⏩ Various Data Import Methods: Supports batch import from HDFS/S3 and stream import from MySQL Binlog/Kafka; supports micro-batch writing through HTTP interface and real-time writing using Insert in JDBC.

  • 🚙 Rich Ecology: Spark uses Spark-Doris-Connector to read and write Doris; Flink-Doris-Connector enables Flink CDC to implement exactly-once data writing to Doris; DBT Doris Adapter is provided to transform data in Doris with DBT.

🙌 Contributors

Apache Doris has graduated from Apache incubator successfully and become a Top-Level Project in June 2022.

Currently, the Apache Doris community has gathered more than 400 contributors from nearly 200 companies in different industries, and the number of active contributors is close to 100 per month.

Monthly Active Contributors

Contributor over time

We deeply appreciate 🔗community contributors for their contribution to Apache Doris.

👨‍👩‍👧‍👦 Users

Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in thousands of companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Sina, 360, Mihoyo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.

The users of Apache Doris: 🔗Users

Add your company logo at Apache Doris Website: 🔗Add Your Company

👣 Get Started

📚 Docs

All Documentation 🔗Docs

⬇️ Download

All release and binary version 🔗Download

🗄️ Compile

See how to compile 🔗Compilation

📮 Install

See how to install and deploy 🔗Installation and deployment

🧩 Components

📝 Doris Connector

Doris provides support for Spark/Flink to read data stored in Doris through Connector, and also supports to write data to Doris through Connector.

🔗apache/doris-flink-connector

🔗apache/doris-spark-connector

🌈 Community and Support

📤 Subscribe Mailing Lists

Mail List is the most recognized form of communication in Apache community. See how to 🔗Subscribe Mailing Lists

🙋 Report Issues or Submit Pull Request

If you meet any questions, feel free to file a 🔗GitHub Issue or post it in 🔗GitHub Discussion and fix it by submitting a 🔗Pull Request

🍻 How to Contribute

We welcome your suggestions, comments (including criticisms), comments and contributions. See 🔗How to Contribute and 🔗Code Submission Guide

⌨️ Doris Improvement Proposals (DSIP)

🔗Doris Improvement Proposal (DSIP) can be thought of as A Collection of Design Documents for all Major Feature Updates or Improvements.

🔑 Backend C++ Coding Specification

🔗 Backend C++ Coding Specification should be strictly followed, which will help us achieve better code quality.

💬 Contact Us

Contact us through the following mailing list.

NameScope
dev@doris.apache.orgDevelopment-related discussionsSubscribeUnsubscribeArchives

🧰 Links

📜 License

Apache License, Version 2.0

Note Some licenses of the third-party dependencies are not compatible with Apache 2.0 License. So you need to disable some Doris features to be complied with Apache 2.0 License. For details, refer to the thirdparty/LICENSE.txt