druid

Apache Druid: a high performance real-time analytics database.

13,782

3,748

13,782

735

View on GitHub

Top Related Projects

ClickHouse

40,390

ClickHouse® is a real-time analytics database management system

pinot

5,817

Apache Pinot - A realtime distributed OLAP datastore

doris

13,866

Apache Doris is an easy-to-use, high performance and unified analytics database.

timescaledb

18,976

A time-series database for high-performance real-time analytics packaged as a Postgres extension

presto

16,420

The official home of the Presto distributed SQL query engine for big data

Quick Overview

Apache Druid is a high-performance, real-time analytics database designed for large-scale data ingestion and fast querying. It's particularly well-suited for OLAP workflows on event data and provides low-latency data exploration, aggregation, and ingestion for big data sets.

Pros

Excellent performance for real-time analytics and data exploration
Highly scalable and distributed architecture
Supports both batch and streaming data ingestion
Flexible query language with SQL support

Cons

Steep learning curve for setup and configuration
Resource-intensive, requiring significant hardware for large deployments
Limited support for complex joins and some advanced SQL features
Can be challenging to manage and maintain for smaller teams

Code Examples

Querying data using Druid SQL:

SELECT
  __time AS timestamp,
  COUNT(*) AS event_count
FROM my_datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1
ORDER BY 1 ASC

This query counts events over the last day, grouped by timestamp.

Ingesting data using Kafka Indexing Service:

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "my_kafka_datasource",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": ["dimension1", "dimension2"]
        }
      }
    },
    "metricsSpec": [
      { "type": "count", "name": "count" },
      { "type": "doubleSum", "name": "value", "fieldName": "value" }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE"
    }
  },
  "tuningConfig": {
    "type": "kafka"
  },
  "ioConfig": {
    "topic": "my_kafka_topic",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092"
    }
  }
}

This configuration sets up a Kafka ingestion task for streaming data into Druid.

Creating a rollup query:

SELECT
  TIME_FLOOR(__time, 'PT1H') AS hour,
  dimension1,
  SUM(metric1) AS total_metric1,
  COUNT(DISTINCT user_id) AS unique_users
FROM my_datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY 1, 2

This query performs a rollup aggregation, summarizing data by hour and dimension1 over the last 7 days.

Getting Started

Download and extract Apache Druid:

wget https://downloads.apache.org/druid/24.0.0/apache-druid-24.0.0-bin.tar.gz
tar -xzf apache-druid-24.0.0-bin.tar.gz
cd apache-druid-24.0.0

Start Druid in single-server mode:
```
./bin/start-micro-quickstart
```
Access the Druid console at http://localhost:8888
Load sample data using the web console or API, then start querying using Druid SQL or native queries.

Competitor Comparisons

ClickHouse

40,390

ClickHouse® is a real-time analytics database management system

Pros of ClickHouse

Faster query performance for large-scale data analytics
More flexible data model, supporting a wider range of data types
Better support for real-time data ingestion and updates

Cons of ClickHouse

Less mature ecosystem and community support
More complex setup and configuration process
Limited support for some advanced analytics features

Code Comparison

ClickHouse SQL query:

SELECT EventDate, count() AS count
FROM events
GROUP BY EventDate
ORDER BY EventDate

Druid SQL query:

SELECT __time AS EventDate, COUNT(*) AS count
FROM events
GROUP BY 1
ORDER BY 1

Both ClickHouse and Druid are powerful columnar databases designed for analytics workloads. ClickHouse excels in raw query performance and flexibility, while Druid offers a more mature ecosystem and easier setup. ClickHouse's SQL syntax is more standard, whereas Druid has some unique features like time-based partitioning built into its query language. The choice between the two often depends on specific use cases and performance requirements.

pinot

5,817

Apache Pinot - A realtime distributed OLAP datastore

Pros of Pinot

Better suited for real-time analytics with low-latency queries
More flexible schema design, allowing for dynamic schema changes
Supports multi-tenancy out of the box

Cons of Pinot

Steeper learning curve due to more complex architecture
Requires more resources for optimal performance
Less mature ecosystem compared to Druid

Code Comparison

Pinot query example:

SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN 1589674800000 AND 1589761200000
GROUP BY dimension1, dimension2
LIMIT 10

Druid query example:

{
  "queryType": "groupBy",
  "dataSource": "myTable",
  "intervals": ["2020-05-17T00:00:00/2020-05-18T00:00:00"],
  "granularity": "all",
  "dimensions": ["dimension1", "dimension2"],
  "aggregations": [{"type": "count", "name": "count"}],
  "limit": 10
}

Both Pinot and Druid are powerful analytics databases designed for real-time data ingestion and fast queries. Pinot excels in real-time analytics and offers more flexibility in schema design, while Druid has a more mature ecosystem and may be easier to set up for simpler use cases. The code comparison shows that Pinot uses SQL-like syntax for queries, whereas Druid uses JSON-based query specifications.

kylin

3,718

Apache Kylin

Pros of Kylin

Excellent performance for OLAP queries on large datasets
Strong support for multidimensional analysis and cube-based computations
Integration with various data sources and BI tools

Cons of Kylin

Requires pre-computation of cubes, which can be time-consuming
Less flexible for ad-hoc queries compared to Druid
Steeper learning curve for cube design and optimization

Code Comparison

Kylin query example:

SELECT year, SUM(price) AS total_price
FROM sales_cube
WHERE country = 'USA'
GROUP BY year

Druid query example:

{
  "queryType": "groupBy",
  "dataSource": "sales",
  "intervals": ["2000-01-01/2020-01-01"],
  "filter": { "type": "selector", "dimension": "country", "value": "USA" },
  "aggregations": [{ "type": "longSum", "name": "total_price", "fieldName": "price" }],
  "granularity": "year"
}

Both Kylin and Druid are powerful analytics engines, but they have different approaches. Kylin excels in OLAP scenarios with pre-computed cubes, while Druid offers more flexibility for real-time analytics and ad-hoc queries. The code examples show that Kylin uses SQL-like syntax, whereas Druid uses JSON-based query specifications.

doris

13,866

Apache Doris is an easy-to-use, high performance and unified analytics database.

Pros of Doris

Better performance for real-time analytics on large-scale datasets
Simpler architecture and easier deployment compared to Druid
Native support for SQL, making it more accessible to analysts

Cons of Doris

Less mature ecosystem and community support than Druid
Limited support for streaming data ingestion
Fewer built-in data rollup and aggregation features

Code Comparison

Doris query example:

SELECT user_id, SUM(order_amount) AS total_amount
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING total_amount > 1000;

Druid query example:

{
  "queryType": "groupBy",
  "dataSource": "orders",
  "intervals": ["2023-01-01/2023-12-31"],
  "granularity": "all",
  "dimensions": ["user_id"],
  "aggregations": [
    { "type": "longSum", "name": "total_amount", "fieldName": "order_amount" }
  ],
  "having": {
    "type": "greaterThan",
    "aggregation": "total_amount",
    "value": 1000
  }
}

Both Doris and Druid are powerful analytics databases, but they have different strengths. Doris offers better performance for large-scale analytics and a simpler SQL interface, while Druid provides more advanced features for streaming data and complex aggregations. The choice between them depends on specific use cases and requirements.

timescaledb

18,976

A time-series database for high-performance real-time analytics packaged as a Postgres extension

Pros of TimescaleDB

Built on PostgreSQL, leveraging its robust ecosystem and SQL compatibility
Automatic partitioning and scalability for time-series data
Supports both time-series and relational data in a single database

Cons of TimescaleDB

Limited support for real-time ingestion compared to Druid's streaming capabilities
May require more manual optimization for complex queries on large datasets
Less flexible for handling high-cardinality data

Code Comparison

TimescaleDB query example:

SELECT time_bucket('1 hour', time) AS hour,
       avg(temperature)
FROM sensor_data
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY hour;

Druid query example:

{
  "queryType": "timeseries",
  "dataSource": "sensor_data",
  "granularity": "hour",
  "intervals": ["2023-05-01/2023-05-02"],
  "aggregations": [{"type": "avg", "fieldName": "temperature", "name": "avg_temp"}]
}

Both TimescaleDB and Druid are powerful tools for time-series data analysis, but they have different strengths. TimescaleDB excels in scenarios requiring SQL compatibility and mixed relational/time-series workloads, while Druid shines in real-time analytics and high-cardinality data scenarios.

presto

16,420

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

Supports a wider range of data sources, including relational databases, NoSQL databases, and file systems
Better suited for complex, multi-table joins and analytical queries
More flexible query language with full SQL support

Cons of Presto

Generally slower for real-time data ingestion and querying compared to Druid
Higher memory requirements, especially for large-scale queries
Less optimized for time-series data and aggregations

Code Comparison

Presto query example:

SELECT date_trunc('day', timestamp) AS day,
       COUNT(*) AS count
FROM events
WHERE timestamp >= DATE '2023-01-01'
GROUP BY 1
ORDER BY 1

Druid query example:

{
  "queryType": "timeseries",
  "dataSource": "events",
  "granularity": "day",
  "intervals": ["2023-01-01/2023-12-31"],
  "aggregations": [{ "type": "count", "name": "count" }]
}

The code examples highlight the different query styles: Presto uses standard SQL, while Druid uses a JSON-based query language optimized for time-series data and aggregations.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Workflow	Status
âï¸ CodeQL Config
ð CodeQL
ð Cron Job ITS
ð·ï¸ Labeler
â»ï¸ Reusable Revised ITS
â»ï¸ Reusable Standard ITS
ð Revised ITS
ð§ Standard ITS
ð ï¸ Static Checks
ð§ª Unit and Integration Tests Unified

Apache Druid

Druid is a high performance real-time analytics database. Druid's main value add is to reduce time to insight and action.

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at powering UIs, running operational (ad-hoc) queries, or handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases. The design documentation explains the key concepts.

Getting started

You can get started with Druid with our local or Docker quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in web console (shown below).

Load data

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.

Issue queries

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

See the latest documentation for the documentation for the current official release. If you need information on a previous release, you can browse previous releases documentation.

Make documentation and tutorials updates in /docs using Markdown or extended Markdown (MDX). Then, open a pull request.

To build the site locally, you need Node 18 or higher and to install Docusaurus 3 with npm|yarn install in the website directory. Then you can run npm|yarn start to launch a local build of the docs.

If you're looking to update non-doc pages like Use Cases, those files are in the druid-website-src repo.

For more information, see the README in the ./website directory.

Community

Visit the official project community page to read about getting involved in contributing to Apache Druid, and how we help one another use and operate Druid.

Druid users can find help in the druid-user mailing list on Google Groups, and have more technical conversations in #troubleshooting on Slack.
Druid development discussions take place in the druid-dev mailing list (dev@druid.apache.org). Subscribe by emailing dev-subscribe@druid.apache.org. For live conversations, join the #dev channel on Slack.

Check out the official community page for details of how to join the community Slack channels.

Find articles written by community members and a calendar of upcoming events on the project site - contribute your own events and articles by submitting a PR in the apache/druid-website-src repository.

Building from source

Please note that JDK 11 or JDK 17 is required to build Druid.

See the latest build guide for instructions on building Apache Druid from source.

Contributing

Please follow the community guidelines for contributing.

For instructions on setting up IntelliJ dev/intellij-setup.md

License

Apache License, Version 2.0

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot