doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

13,866

3,478

13,866

3,213

View on GitHub

Top Related Projects

druid

13,782

Apache Druid: a high performance real-time analytics database.

pinot

5,817

Apache Pinot - A realtime distributed OLAP datastore

ClickHouse

42,076

ClickHouse® is a real-time analytics database management system

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

kylin

3,718

Apache Kylin

trino

11,671

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Quick Overview

Apache Doris is an open-source, high-performance analytical database system that provides real-time data warehousing and analytics capabilities. It is designed to handle large-scale data processing and querying with low latency, making it suitable for various business intelligence and data analysis scenarios.

Pros

High performance and low latency for analytical queries
Scalable and distributed architecture for handling large datasets
SQL-compatible, making it easy for users familiar with SQL to adopt
Supports real-time data ingestion and updates

Cons

Relatively new compared to some established data warehousing solutions
Limited ecosystem and third-party tool integration compared to more mature platforms
Steeper learning curve for advanced features and optimizations
Documentation can be incomplete or outdated in some areas

Getting Started

To get started with Apache Doris, follow these steps:

Download the latest release from the official Apache Doris website.
Extract the downloaded package:
```
tar -xzf apache-doris-x.x.x.tar.gz
```
Set up the configuration files in the conf directory.
Start the Doris cluster:
```
./bin/start_fe.sh
./bin/start_be.sh
```

Connect to Doris using a MySQL client:

mysql -h <fe_host> -P <query_port> -u root

Create a database and table:

CREATE DATABASE example_db;
USE example_db;
CREATE TABLE example_table (
  id INT,
  name VARCHAR(50),
  value DOUBLE
) ENGINE=OLAP
DISTRIBUTED BY HASH(id) BUCKETS 10;

Load data into the table:

LOAD LABEL example_db.label1
(
  DATA INFILE("hdfs://your-hdfs-path/example_data.csv")
  INTO TABLE example_table
  COLUMNS TERMINATED BY ","
);

For more detailed instructions and advanced usage, refer to the official Apache Doris documentation.

Competitor Comparisons

druid

13,782

Apache Druid: a high performance real-time analytics database.

Pros of Druid

Highly scalable and designed for real-time analytics on large datasets
Supports sub-second query latency for complex aggregations
Offers a rich set of built-in aggregators and post-aggregators

Cons of Druid

Steeper learning curve and more complex setup compared to Doris
Limited support for SQL-like queries, focusing more on specialized query language
Higher resource requirements for optimal performance

Code Comparison

Druid query example:

{
  "queryType": "timeseries",
  "dataSource": "sample_data",
  "granularity": "day",
  "aggregations": [
    { "type": "longSum", "name": "total_sales", "fieldName": "sales" }
  ],
  "intervals": ["2023-01-01/2023-12-31"]
}

Doris query example:

SELECT DATE_TRUNC('day', event_time) AS day,
       SUM(sales) AS total_sales
FROM sample_data
WHERE event_time BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY 1
ORDER BY 1;

Both Druid and Doris are powerful analytics databases, but they cater to different use cases. Druid excels in real-time analytics and complex aggregations, while Doris offers a more familiar SQL interface and easier setup. The choice between them depends on specific project requirements and team expertise.

pinot

5,817

Apache Pinot - A realtime distributed OLAP datastore

Pros of Pinot

Better suited for real-time analytics with low-latency queries
Supports a wider range of data ingestion methods, including streaming
More flexible schema design and data modeling capabilities

Cons of Pinot

Higher complexity in setup and maintenance
Less mature ecosystem compared to Doris
Steeper learning curve for new users

Code Comparison

Pinot query example:

SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN 1589980800000 AND 1589984400000
GROUP BY dimension1, dimension2
LIMIT 10

Doris query example:

SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN '2020-05-20 10:00:00' AND '2020-05-20 11:00:00'
GROUP BY dimension1, dimension2
LIMIT 10

Both systems use SQL-like syntax for querying, but Pinot often uses millisecond timestamps, while Doris typically uses datetime strings. Pinot's query language also includes some additional features for real-time analytics that may not be present in Doris.

Overall, Pinot excels in real-time analytics scenarios with its focus on low-latency queries and streaming data ingestion. Doris, on the other hand, offers a more straightforward setup and may be easier for teams already familiar with traditional data warehousing concepts. The choice between the two depends on specific use cases and team expertise.

ClickHouse

42,076

ClickHouse® is a real-time analytics database management system

Pros of ClickHouse

Extremely fast query performance, especially for analytical workloads
Highly scalable and can handle petabytes of data efficiently
Rich set of built-in functions and features for data analysis

Cons of ClickHouse

Steeper learning curve due to its unique query language and architecture
Limited support for real-time updates and deletes
Less mature ecosystem compared to some other database systems

Code Comparison

ClickHouse query example:

SELECT
    toYear(date) AS year,
    sum(revenue) AS total_revenue
FROM sales
GROUP BY year
ORDER BY year

Doris query example:

SELECT
    year(date) AS year,
    sum(revenue) AS total_revenue
FROM sales
GROUP BY year
ORDER BY year

Both ClickHouse and Doris are columnar databases designed for analytical workloads, but they have different query syntaxes and optimizations. ClickHouse often uses custom functions like toYear(), while Doris tends to use more standard SQL functions. Both systems aim to provide fast query performance and scalability for large datasets, but ClickHouse is generally considered to have an edge in raw performance for certain types of analytical queries.

starrocks

10,221

Pros of StarRocks

Higher query performance, especially for complex analytical queries
More advanced vectorized execution engine
Better support for real-time analytics and streaming data ingestion

Cons of StarRocks

Smaller community and ecosystem compared to Doris
Less mature documentation and learning resources
Fewer integrations with other data tools and platforms

Code Comparison

StarRocks query example:

SELECT user_id, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING order_count > 5
ORDER BY order_count DESC
LIMIT 10;

Doris query example:

SELECT user_id, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING order_count > 5
ORDER BY order_count DESC
LIMIT 10;

Both StarRocks and Doris use similar SQL syntax for querying data. The main differences lie in their internal processing and optimization techniques rather than the query language itself. StarRocks may execute this query faster due to its advanced vectorized execution engine, especially for large datasets or complex joins.

kylin

3,718

Apache Kylin

Pros of Kylin

Mature OLAP solution with a longer history and wider adoption
Strong support for multidimensional analysis and cube-based queries
Integration with various Hadoop ecosystem components

Cons of Kylin

Higher complexity in setup and maintenance
Longer query latency for non-cube queries
Limited support for real-time data ingestion

Code Comparison

Kylin query example:

SELECT year, SUM(price) AS total_price
FROM sales_fact
JOIN date_dim ON sales_fact.date_key = date_dim.date_key
GROUP BY year

Doris query example:

SELECT year, SUM(price) AS total_price
FROM sales_fact
GROUP BY year

Both systems support SQL queries, but Kylin often requires pre-built cubes for optimal performance, while Doris can handle ad-hoc queries more efficiently.

Kylin excels in scenarios with predefined, complex OLAP queries on large datasets, particularly when integrated with Hadoop ecosystems. Doris, on the other hand, offers more flexibility for real-time analytics and ad-hoc queries, with a simpler architecture that's easier to deploy and maintain.

While Kylin provides powerful multidimensional analysis capabilities, Doris focuses on delivering low-latency queries and real-time data ingestion, making it more suitable for scenarios requiring frequent updates and fast query responses.

trino

11,671

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Pros of Trino

Broader ecosystem support with connectors for various data sources
More mature project with a larger community and extensive documentation
Better performance for complex queries across multiple data sources

Cons of Trino

Higher resource requirements, especially for memory
Steeper learning curve due to more complex architecture
Less optimized for OLAP workloads compared to Doris

Code Comparison

Trino SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000

Doris SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000

The SQL syntax for both Trino and Doris is similar in this example, as they both support standard SQL. However, the underlying execution and optimization strategies differ, with Doris being more optimized for OLAP workloads and Trino excelling in federated queries across multiple data sources.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Doris

Apache Doris is an easy-to-use, high-performance and real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrency point query scenarios but also high-throughput complex analysis scenarios.

All this makes Apache Doris an ideal tool for scenarios including report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Apache Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.

ð Check out the ðAll releases, where you'll find a chronological summary of Apache Doris versions released over the past year.

ð Explore the ðOfficial Website to discover Apache Doris's core features, blogs, and user cases in detail.

ð Usage Scenarios

As shown in the figure below, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Apache Doris and the offline data lake or data warehouse (in Apache Hive, Apache Iceberg or Apache Hudi).

Apache Doris is widely used in the following scenarios:

Real-time Data Analysis:
- Real-time Reporting and Decision-making: Doris provides real-time updated reports and dashboards for both internal and external enterprise use, supporting real-time decision-making in automated processes.
- Ad Hoc Analysis: Doris offers multidimensional data analysis capabilities, enabling rapid business intelligence analysis and ad hoc queries to help users quickly uncover insights from complex data.
- User Profiling and Behavior Analysis: Doris can analyze user behaviors such as participation, retention, and conversion, while also supporting scenarios like population insights and crowd selection for behavior analysis.
Lakehouse Analytics:
- Lakehouse Query Acceleration: Doris accelerates lakehouse data queries with its efficient query engine.
- Federated Analytics: Doris supports federated queries across multiple data sources, simplifying architecture and eliminating data silos.
- Real-time Data Processing: Doris combines real-time data streams and batch data processing capabilities to meet the needs of high concurrency and low-latency complex business requirements.
SQL-based Observability:
- Log and Event Analysis: Doris enables real-time or batch analysis of logs and events in distributed systems, helping to identify issues and optimize performance.

Overall Architecture

Apache Doris uses the MySQL protocol, is highly compatible with MySQL syntax, and supports standard SQL. Users can access Apache Doris through various client tools, and it seamlessly integrates with BI tools.

Storage-Compute Integrated Architecture

The storage-compute integrated architecture of Apache Doris is streamlined and easy to maintain. As shown in the figure below, it consists of only two types of processes:

Frontend (FE): Primarily responsible for handling user requests, query parsing and planning, metadata management, and node management tasks.
Backend (BE): Primarily responsible for data storage and query execution. Data is partitioned into shards and stored with multiple replicas across BE nodes.

The overall architecture of Apache Doris

In a production environment, multiple FE nodes can be deployed for disaster recovery. Each FE node maintains a full copy of the metadata. The FE nodes are divided into three roles:

Role	Function
Master	The FE Master node is responsible for metadata read and write operations. When metadata changes occur in the Master, they are synchronized to Follower or Observer nodes via the BDB JE protocol.
Follower	The Follower node is responsible for reading metadata. If the Master node fails, a Follower node can be selected as the new Master.
Observer	The Observer node is responsible for reading metadata and is mainly used to increase query concurrency. It does not participate in cluster leadership elections.

Both FE and BE processes are horizontally scalable, enabling a single cluster to support hundreds of machines and tens of petabytes of storage capacity. The FE and BE processes use a consistency protocol to ensure high availability of services and high reliability of data. The storage-compute integrated architecture is highly integrated, significantly reducing the operational complexity of distributed systems.

Core Features of Apache Doris

High Availability: In Apache Doris, both metadata and data are stored with multiple replicas, synchronizing data logs via the quorum protocol. Data write is considered successful once a majority of replicas have completed the write, ensuring that the cluster remains available even if a few nodes fail. Apache Doris supports both same-city and cross-region disaster recovery, enabling dual-cluster master-slave modes. When some nodes experience failures, the cluster can automatically isolate the faulty nodes, preventing the overall cluster availability from being affected.
High Compatibility: Apache Doris is highly compatible with the MySQL protocol and supports standard SQL syntax, covering most MySQL and Hive functions. This high compatibility allows users to seamlessly migrate and integrate existing applications and tools. Apache Doris supports the MySQL ecosystem, enabling users to connect Doris using MySQL Client tools for more convenient operations and maintenance. It also supports MySQL protocol compatibility for BI reporting tools and data transmission tools, ensuring efficiency and stability in data analysis and data transmission processes.
Real-Time Data Warehouse: Based on Apache Doris, a real-time data warehouse service can be built. Apache Doris offers second-level data ingestion capabilities, capturing incremental changes from upstream online transactional databases into Doris within seconds. Leveraging vectorized engines, MPP architecture, and Pipeline execution engines, Doris provides sub-second data query capabilities, thereby constructing a high-performance, low-latency real-time data warehouse platform.
Unified Lakehouse: Apache Doris can build a unified lakehouse architecture based on external data sources such as data lakes or relational databases. The Doris unified lakehouse solution enables seamless integration and free data flow between data lakes and data warehouses, helping users directly utilize data warehouse capabilities to solve data analysis problems in data lakes while fully leveraging data lake data management capabilities to enhance data value.
Flexible Modeling: Apache Doris offers various modeling approaches, such as wide table models, pre-aggregation models, star/snowflake schemas, etc. During data import, data can be flattened into wide tables and written into Doris through compute engines like Flink or Spark, or data can be directly imported into Doris, performing data modeling operations through views, materialized views, or real-time multi-table joins.

Technical overview

Doris provides an efficient SQL interface and is fully compatible with the MySQL protocol. Its query engine is based on an MPP (Massively Parallel Processing) architecture, capable of efficiently executing complex analytical queries and achieving low-latency real-time queries. Through columnar storage technology for data encoding and compression, it significantly optimizes query performance and storage compression ratio.

Interface

Apache Doris adopts the MySQL protocol, supports standard SQL, and is highly compatible with MySQL syntax. Users can access Apache Doris through various client tools and seamlessly integrate it with BI tools, including but not limited to Smartbi, DataEase, FineBI, Tableau, Power BI, and Apache Superset. Apache Doris can work as the data source for any BI tools that support the MySQL protocol.

Storage engine

Apache Doris has a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high data compression ratio and largely reduces unnecessary data scanning, thus making more efficient use of IO and CPU resources.

Apache Doris supports various index structures to minimize data scans:

Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios.
Min/Max Index: This enables effective data filtering in equivalence and range queries of numeric types.
BloomFilter Index: This is very effective in equivalence filtering and pruning of high-cardinality columns.
Inverted Index: This enables fast searching for any field.

Apache Doris supports a variety of data models and has optimized them for different scenarios:

Detail Model (Duplicate Key Model): A detail data model designed to meet the detailed storage requirements of fact tables.
Primary Key Model (Unique Key Model): Ensures unique keys; data with the same key is overwritten, enabling row-level data updates.
Aggregate Model (Aggregate Key Model): Merges value columns with the same key, significantly improving performance through pre-aggregation.

Apache Doris also supports strongly consistent single-table materialized views and asynchronously refreshed multi-table materialized views. Single-table materialized views are automatically refreshed and maintained by the system, requiring no manual intervention from users. Multi-table materialized views can be refreshed periodically using in-cluster scheduling or external scheduling tools, reducing the complexity of data modeling.

ð Query Engine

Apache Doris has an MPP-based query engine for parallel execution between and within nodes. It supports distributed shuffle join for large tables to better handle complicated queries.

Query Engine

The query engine of Apache Doris is fully vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, increase cache hit rates, and make efficient use of SIMD instructions. Apache Doris delivers a 5~10 times higher performance in wide table aggregation scenarios than non-vectorized engines.

Doris query engine

Apache Doris uses adaptive query execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate a runtime filter and push it to the probe side. Specifically, it pushes the filters to the lowest-level scan node on the probe side, which largely reduces the data amount to be processed and increases join performance. The runtime filter of Apache Doris supports In/Min/Max/Bloom Filter.

Apache Doris uses a Pipeline execution engine that breaks down queries into multiple sub-tasks for parallel execution, fully leveraging multi-core CPU capabilities. It simultaneously addresses the thread explosion problem by limiting the number of query threads. The Pipeline execution engine reduces data copying and sharing, optimizes sorting and aggregation operations, thereby significantly improving query efficiency and throughput.

In terms of the optimizer, Apache Doris employs a combined optimization strategy of CBO (Cost-Based Optimizer), RBO (Rule-Based Optimizer), and HBO (History-Based Optimizer). RBO supports constant folding, subquery rewriting, predicate pushdown, and more. CBO supports join reordering and other optimizations. HBO recommends the optimal execution plan based on historical query information. These multiple optimization measures ensure that Doris can enumerate high-performance query plans across various types of queries.

ð Why choose Apache Doris?

ð¯ Easy to Use: Two processes, no other dependencies; online cluster scaling, automatic replica recovery; compatible with MySQL protocol, and using standard SQL.
ð High Performance: Extremely fast performance for low-latency and high-throughput queries with columnar storage engine, modern MPP architecture, vectorized query engine, pre-aggregated materialized view and data index.
ð¥ï¸ Single Unified: A single system can support real-time data serving, interactive data analysis and offline data processing scenarios.
âï¸ Federated Querying: Supports federated querying of data lakes such as Hive, Iceberg, Hudi, and databases such as MySQL and Elasticsearch.
â© Various Data Import Methods: Supports batch import from HDFS/S3 and stream import from MySQL Binlog/Kafka; supports micro-batch writing through HTTP interface and real-time writing using Insert in JDBC.
ð Rich Ecology: Spark uses Spark-Doris-Connector to read and write Doris; Flink-Doris-Connector enables Flink CDC to implement exactly-once data writing to Doris; DBT Doris Adapter is provided to transform data in Doris with DBT.

ð Contributors

Apache Doris has graduated from Apache incubator successfully and become a Top-Level Project in June 2022.

We deeply appreciate ðcommunity contributors for their contribution to Apache Doris.

Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in thousands of companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Sina, 360, Mihoyo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.

The users of Apache Doris: ðUsers

Add your company logo at Apache Doris Website: ðAdd Your Company

ð£ Get Started

ð Docs

All Documentation ðDocs

â¬ï¸ Download

All release and binary version ðDownload

ðï¸ Compile

See how to compile ðCompilation)

ð® Install

See how to install and deploy ðInstallation and deployment

ð§© Components

ð Doris Connector

Doris provides support for Spark/Flink to read data stored in Doris through Connector, and also supports to write data to Doris through Connector.

ðapache/doris-flink-connector

ðapache/doris-spark-connector

ð Community and Support

ð¤ Subscribe Mailing Lists

Mail List is the most recognized form of communication in Apache community. See how to ðSubscribe Mailing Lists

ð Report Issues or Submit Pull Request

If you meet any questions, feel free to file a ðGitHub Issue or post it in ðGitHub Discussion and fix it by submitting a ðPull Request

ð» How to Contribute

We welcome your suggestions, comments (including criticisms), comments and contributions. See ðHow to Contribute and ðCode Submission Guide

â¨ï¸ Doris Improvement Proposals (DSIP)

ðDoris Improvement Proposal (DSIP) can be thought of as A Collection of Design Documents for all Major Feature Updates or Improvements.

ð Backend C++ Coding Specification

ð Backend C++ Coding Specification should be strictly followed, which will help us achieve better code quality.

ð¬ Contact Us

Name	Scope
dev@doris.apache.org	Development-related discussions	Subscribe	Unsubscribe	Archives

ð§° Links

Apache Doris Official Website - Site
Developer Mailing list - dev@doris.apache.org. Mail to dev-subscribe@doris.apache.org, follow the reply to subscribe the mail list.
Slack channel - Join the Slack
Twitter - Follow @doris_apache

ð License

Apache License, Version 2.0

Note Some licenses of the third-party dependencies are not compatible with Apache 2.0 License. So you need to disable some Doris features to be complied with Apache 2.0 License. For details, refer to the thirdparty/LICENSE.txt

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Getting Started

Competitor Comparisons

Pros of Druid

Cons of Druid

Code Comparison

Pros of Pinot

Cons of Pinot

Code Comparison

Pros of ClickHouse

Cons of ClickHouse

Code Comparison

Pros of StarRocks

Cons of StarRocks

Code Comparison

Pros of Kylin

Cons of Kylin

Code Comparison

Pros of Trino

Cons of Trino

Code Comparison

Convert designs to code with AI

README

Apache Doris

ð Usage Scenarios

Overall Architecture

Storage-Compute Integrated Architecture

Core Features of Apache Doris

Technical overview

Interface

Storage engine

ð Query Engine

ð Why choose Apache Doris?

ð Contributors

ð¨âð©âð§âð¦ Users

ð£ Get Started

ð Docs

â¬ï¸ Download

ðï¸ Compile

ð® Install

ð§© Components

ð Doris Connector

ð Community and Support

ð¤ Subscribe Mailing Lists

ð Report Issues or Submit Pull Request

ð» How to Contribute

â¨ï¸ Doris Improvement Proposals (DSIP)

ð Backend C++ Coding Specification

ð¬ Contact Us

ð§° Links

ð License

Top Related Projects

Convert designs to code with AI

ð Usage Scenarios

ð Query Engine

ð Why choose Apache Doris?

ð Contributors

ð¨âð©âð§âð¦ Users

ð£ Get Started

ð Docs

â¬ï¸ Download

ðï¸ Compile

ð® Install

ð§© Components

ð Doris Connector

ð Community and Support

ð¤ Subscribe Mailing Lists

ð Report Issues or Submit Pull Request

ð» How to Contribute

â¨ï¸ Doris Improvement Proposals (DSIP)

ð Backend C++ Coding Specification

ð¬ Contact Us

ð§° Links

ð License