Top Related Projects
ClickHouse® is a real-time analytics DBMS
Apache Pinot - A realtime distributed OLAP datastore
Apache Kylin
Apache Doris is an easy-to-use, high performance and unified analytics database.
An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
The official home of the Presto distributed SQL query engine for big data
Quick Overview
Apache Druid is a high-performance, real-time analytics database designed for large-scale data ingestion and fast querying. It's particularly well-suited for OLAP workflows on event data and provides low-latency data exploration, aggregation, and ingestion for big data sets.
Pros
- Excellent performance for real-time analytics and data exploration
- Highly scalable and distributed architecture
- Supports both batch and streaming data ingestion
- Flexible query language with SQL support
Cons
- Steep learning curve for setup and configuration
- Resource-intensive, requiring significant hardware for large deployments
- Limited support for complex joins and some advanced SQL features
- Can be challenging to manage and maintain for smaller teams
Code Examples
- Querying data using Druid SQL:
SELECT
__time AS timestamp,
COUNT(*) AS event_count
FROM my_datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1
ORDER BY 1 ASC
This query counts events over the last day, grouped by timestamp.
- Ingesting data using Kafka Indexing Service:
{
"type": "kafka",
"dataSchema": {
"dataSource": "my_kafka_datasource",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": ["dimension1", "dimension2"]
}
}
},
"metricsSpec": [
{ "type": "count", "name": "count" },
{ "type": "doubleSum", "name": "value", "fieldName": "value" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "MINUTE"
}
},
"tuningConfig": {
"type": "kafka"
},
"ioConfig": {
"topic": "my_kafka_topic",
"consumerProperties": {
"bootstrap.servers": "localhost:9092"
}
}
}
This configuration sets up a Kafka ingestion task for streaming data into Druid.
- Creating a rollup query:
SELECT
TIME_FLOOR(__time, 'PT1H') AS hour,
dimension1,
SUM(metric1) AS total_metric1,
COUNT(DISTINCT user_id) AS unique_users
FROM my_datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY 1, 2
This query performs a rollup aggregation, summarizing data by hour and dimension1 over the last 7 days.
Getting Started
-
Download and extract Apache Druid:
wget https://downloads.apache.org/druid/24.0.0/apache-druid-24.0.0-bin.tar.gz tar -xzf apache-druid-24.0.0-bin.tar.gz cd apache-druid-24.0.0
-
Start Druid in single-server mode:
./bin/start-micro-quickstart
-
Access the Druid console at
http://localhost:8888
-
Load sample data using the web console or API, then start querying using Druid SQL or native queries.
Competitor Comparisons
ClickHouse® is a real-time analytics DBMS
Pros of ClickHouse
- Faster query performance for large-scale data analytics
- More flexible data model, supporting a wider range of data types
- Better support for real-time data ingestion and updates
Cons of ClickHouse
- Less mature ecosystem and community support
- More complex setup and configuration process
- Limited support for some advanced analytics features
Code Comparison
ClickHouse SQL query:
SELECT EventDate, count() AS count
FROM events
GROUP BY EventDate
ORDER BY EventDate
Druid SQL query:
SELECT __time AS EventDate, COUNT(*) AS count
FROM events
GROUP BY 1
ORDER BY 1
Both ClickHouse and Druid are powerful columnar databases designed for analytics workloads. ClickHouse excels in raw query performance and flexibility, while Druid offers a more mature ecosystem and easier setup. ClickHouse's SQL syntax is more standard, whereas Druid has some unique features like time-based partitioning built into its query language. The choice between the two often depends on specific use cases and performance requirements.
Apache Pinot - A realtime distributed OLAP datastore
Pros of Pinot
- Better suited for real-time analytics with low-latency queries
- More flexible schema design, allowing for dynamic schema changes
- Supports multi-tenancy out of the box
Cons of Pinot
- Steeper learning curve due to more complex architecture
- Requires more resources for optimal performance
- Less mature ecosystem compared to Druid
Code Comparison
Pinot query example:
SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN 1589674800000 AND 1589761200000
GROUP BY dimension1, dimension2
LIMIT 10
Druid query example:
{
"queryType": "groupBy",
"dataSource": "myTable",
"intervals": ["2020-05-17T00:00:00/2020-05-18T00:00:00"],
"granularity": "all",
"dimensions": ["dimension1", "dimension2"],
"aggregations": [{"type": "count", "name": "count"}],
"limit": 10
}
Both Pinot and Druid are powerful analytics databases designed for real-time data ingestion and fast queries. Pinot excels in real-time analytics and offers more flexibility in schema design, while Druid has a more mature ecosystem and may be easier to set up for simpler use cases. The code comparison shows that Pinot uses SQL-like syntax for queries, whereas Druid uses JSON-based query specifications.
Apache Kylin
Pros of Kylin
- Excellent performance for OLAP queries on large datasets
- Strong support for multidimensional analysis and cube-based computations
- Integration with various data sources and BI tools
Cons of Kylin
- Requires pre-computation of cubes, which can be time-consuming
- Less flexible for ad-hoc queries compared to Druid
- Steeper learning curve for cube design and optimization
Code Comparison
Kylin query example:
SELECT year, SUM(price) AS total_price
FROM sales_cube
WHERE country = 'USA'
GROUP BY year
Druid query example:
{
"queryType": "groupBy",
"dataSource": "sales",
"intervals": ["2000-01-01/2020-01-01"],
"filter": { "type": "selector", "dimension": "country", "value": "USA" },
"aggregations": [{ "type": "longSum", "name": "total_price", "fieldName": "price" }],
"granularity": "year"
}
Both Kylin and Druid are powerful analytics engines, but they have different approaches. Kylin excels in OLAP scenarios with pre-computed cubes, while Druid offers more flexibility for real-time analytics and ad-hoc queries. The code examples show that Kylin uses SQL-like syntax, whereas Druid uses JSON-based query specifications.
Apache Doris is an easy-to-use, high performance and unified analytics database.
Pros of Doris
- Better performance for real-time analytics on large-scale datasets
- Simpler architecture and easier deployment compared to Druid
- Native support for SQL, making it more accessible to analysts
Cons of Doris
- Less mature ecosystem and community support than Druid
- Limited support for streaming data ingestion
- Fewer built-in data rollup and aggregation features
Code Comparison
Doris query example:
SELECT user_id, SUM(order_amount) AS total_amount
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING total_amount > 1000;
Druid query example:
{
"queryType": "groupBy",
"dataSource": "orders",
"intervals": ["2023-01-01/2023-12-31"],
"granularity": "all",
"dimensions": ["user_id"],
"aggregations": [
{ "type": "longSum", "name": "total_amount", "fieldName": "order_amount" }
],
"having": {
"type": "greaterThan",
"aggregation": "total_amount",
"value": 1000
}
}
Both Doris and Druid are powerful analytics databases, but they have different strengths. Doris offers better performance for large-scale analytics and a simpler SQL interface, while Druid provides more advanced features for streaming data and complex aggregations. The choice between them depends on specific use cases and requirements.
An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
Pros of TimescaleDB
- Built on PostgreSQL, leveraging its robust ecosystem and SQL compatibility
- Automatic partitioning and scalability for time-series data
- Supports both time-series and relational data in a single database
Cons of TimescaleDB
- Limited support for real-time ingestion compared to Druid's streaming capabilities
- May require more manual optimization for complex queries on large datasets
- Less flexible for handling high-cardinality data
Code Comparison
TimescaleDB query example:
SELECT time_bucket('1 hour', time) AS hour,
avg(temperature)
FROM sensor_data
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY hour;
Druid query example:
{
"queryType": "timeseries",
"dataSource": "sensor_data",
"granularity": "hour",
"intervals": ["2023-05-01/2023-05-02"],
"aggregations": [{"type": "avg", "fieldName": "temperature", "name": "avg_temp"}]
}
Both TimescaleDB and Druid are powerful tools for time-series data analysis, but they have different strengths. TimescaleDB excels in scenarios requiring SQL compatibility and mixed relational/time-series workloads, while Druid shines in real-time analytics and high-cardinality data scenarios.
The official home of the Presto distributed SQL query engine for big data
Pros of Presto
- Supports a wider range of data sources, including relational databases, NoSQL databases, and file systems
- Better suited for complex, multi-table joins and analytical queries
- More flexible query language with full SQL support
Cons of Presto
- Generally slower for real-time data ingestion and querying compared to Druid
- Higher memory requirements, especially for large-scale queries
- Less optimized for time-series data and aggregations
Code Comparison
Presto query example:
SELECT date_trunc('day', timestamp) AS day,
COUNT(*) AS count
FROM events
WHERE timestamp >= DATE '2023-01-01'
GROUP BY 1
ORDER BY 1
Druid query example:
{
"queryType": "timeseries",
"dataSource": "events",
"granularity": "day",
"intervals": ["2023-01-01/2023-12-31"],
"aggregations": [{ "type": "count", "name": "count" }]
}
The code examples highlight the different query styles: Presto uses standard SQL, while Druid uses a JSON-based query language optimized for time-series data and aggregations.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Apache Druid
Druid is a high performance real-time analytics database. Druid's main value add is to reduce time to insight and action.
Druid is designed for workflows where fast queries and ingest really matter. Druid excels at powering UIs, running operational (ad-hoc) queries, or handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases. The design documentation explains the key concepts.
Getting started
You can get started with Druid with our local or Docker quickstart.
Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in web console (shown below).
Load data
Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.
Manage the cluster
Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.
Issue queries
Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.
Documentation
See the latest documentation for the documentation for the current official release. If you need information on a previous release, you can browse previous releases documentation.
Make documentation and tutorials updates in /docs
using Markdown or extended Markdown (MDX). Then, open a pull request.
To build the site locally, you need Node 16.14 or higher and to install Docusaurus 2 with npm|yarn install
in the website
directory. Then you can run npm|yarn start
to launch a local build of the docs.
If you're looking to update non-doc pages like Use Cases, those files are in the druid-website-src
repo.
Community
Visit the official project community page to read about getting involved in contributing to Apache Druid, and how we help one another use and operate Druid.
- Druid users can find help in the
druid-user
mailing list on Google Groups, and have more technical conversations in#troubleshooting
on Slack. - Druid development discussions take place in the
druid-dev
mailing list (dev@druid.apache.org). Subscribe by emailing dev-subscribe@druid.apache.org. For live conversations, join the#dev
channel on Slack.
Check out the official community page for details of how to join the community Slack channels.
Find articles written by community members and a calendar of upcoming events on the project site - contribute your own events and articles by submitting a PR in the apache/druid-website-src
repository.
Building from source
Please note that JDK 8 or JDK 11 is required to build Druid.
See the latest build guide for instructions on building Apache Druid from source.
Contributing
Please follow the community guidelines for contributing.
For instructions on setting up IntelliJ dev/intellij-setup.md
License
Top Related Projects
ClickHouse® is a real-time analytics DBMS
Apache Pinot - A realtime distributed OLAP datastore
Apache Kylin
Apache Doris is an easy-to-use, high performance and unified analytics database.
An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
The official home of the Presto distributed SQL query engine for big data
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot