Convert Figma logo to code with AI

apache logodruid

Apache Druid: a high performance real-time analytics database.

13,453
3,696
13,453
812

Top Related Projects

ClickHouse® is a real-time analytics DBMS

5,558

Apache Pinot - A realtime distributed OLAP datastore

3,661

Apache Kylin

12,874

Apache Doris is an easy-to-use, high performance and unified analytics database.

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

16,153

The official home of the Presto distributed SQL query engine for big data

Quick Overview

Apache Druid is a high-performance, real-time analytics database designed for large-scale data ingestion and fast querying. It's particularly well-suited for OLAP workflows on event data and provides low-latency data exploration, aggregation, and ingestion for big data sets.

Pros

  • Excellent performance for real-time analytics and data exploration
  • Highly scalable and distributed architecture
  • Supports both batch and streaming data ingestion
  • Flexible query language with SQL support

Cons

  • Steep learning curve for setup and configuration
  • Resource-intensive, requiring significant hardware for large deployments
  • Limited support for complex joins and some advanced SQL features
  • Can be challenging to manage and maintain for smaller teams

Code Examples

  1. Querying data using Druid SQL:
SELECT
  __time AS timestamp,
  COUNT(*) AS event_count
FROM my_datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1
ORDER BY 1 ASC

This query counts events over the last day, grouped by timestamp.

  1. Ingesting data using Kafka Indexing Service:
{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "my_kafka_datasource",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": ["dimension1", "dimension2"]
        }
      }
    },
    "metricsSpec": [
      { "type": "count", "name": "count" },
      { "type": "doubleSum", "name": "value", "fieldName": "value" }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE"
    }
  },
  "tuningConfig": {
    "type": "kafka"
  },
  "ioConfig": {
    "topic": "my_kafka_topic",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092"
    }
  }
}

This configuration sets up a Kafka ingestion task for streaming data into Druid.

  1. Creating a rollup query:
SELECT
  TIME_FLOOR(__time, 'PT1H') AS hour,
  dimension1,
  SUM(metric1) AS total_metric1,
  COUNT(DISTINCT user_id) AS unique_users
FROM my_datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY 1, 2

This query performs a rollup aggregation, summarizing data by hour and dimension1 over the last 7 days.

Getting Started

  1. Download and extract Apache Druid:

    wget https://downloads.apache.org/druid/24.0.0/apache-druid-24.0.0-bin.tar.gz
    tar -xzf apache-druid-24.0.0-bin.tar.gz
    cd apache-druid-24.0.0
    
  2. Start Druid in single-server mode:

    ./bin/start-micro-quickstart
    
  3. Access the Druid console at http://localhost:8888

  4. Load sample data using the web console or API, then start querying using Druid SQL or native queries.

Competitor Comparisons

ClickHouse® is a real-time analytics DBMS

Pros of ClickHouse

  • Faster query performance for large-scale data analytics
  • More flexible data model, supporting a wider range of data types
  • Better support for real-time data ingestion and updates

Cons of ClickHouse

  • Less mature ecosystem and community support
  • More complex setup and configuration process
  • Limited support for some advanced analytics features

Code Comparison

ClickHouse SQL query:

SELECT EventDate, count() AS count
FROM events
GROUP BY EventDate
ORDER BY EventDate

Druid SQL query:

SELECT __time AS EventDate, COUNT(*) AS count
FROM events
GROUP BY 1
ORDER BY 1

Both ClickHouse and Druid are powerful columnar databases designed for analytics workloads. ClickHouse excels in raw query performance and flexibility, while Druid offers a more mature ecosystem and easier setup. ClickHouse's SQL syntax is more standard, whereas Druid has some unique features like time-based partitioning built into its query language. The choice between the two often depends on specific use cases and performance requirements.

5,558

Apache Pinot - A realtime distributed OLAP datastore

Pros of Pinot

  • Better suited for real-time analytics with low-latency queries
  • More flexible schema design, allowing for dynamic schema changes
  • Supports multi-tenancy out of the box

Cons of Pinot

  • Steeper learning curve due to more complex architecture
  • Requires more resources for optimal performance
  • Less mature ecosystem compared to Druid

Code Comparison

Pinot query example:

SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN 1589674800000 AND 1589761200000
GROUP BY dimension1, dimension2
LIMIT 10

Druid query example:

{
  "queryType": "groupBy",
  "dataSource": "myTable",
  "intervals": ["2020-05-17T00:00:00/2020-05-18T00:00:00"],
  "granularity": "all",
  "dimensions": ["dimension1", "dimension2"],
  "aggregations": [{"type": "count", "name": "count"}],
  "limit": 10
}

Both Pinot and Druid are powerful analytics databases designed for real-time data ingestion and fast queries. Pinot excels in real-time analytics and offers more flexibility in schema design, while Druid has a more mature ecosystem and may be easier to set up for simpler use cases. The code comparison shows that Pinot uses SQL-like syntax for queries, whereas Druid uses JSON-based query specifications.

3,661

Apache Kylin

Pros of Kylin

  • Excellent performance for OLAP queries on large datasets
  • Strong support for multidimensional analysis and cube-based computations
  • Integration with various data sources and BI tools

Cons of Kylin

  • Requires pre-computation of cubes, which can be time-consuming
  • Less flexible for ad-hoc queries compared to Druid
  • Steeper learning curve for cube design and optimization

Code Comparison

Kylin query example:

SELECT year, SUM(price) AS total_price
FROM sales_cube
WHERE country = 'USA'
GROUP BY year

Druid query example:

{
  "queryType": "groupBy",
  "dataSource": "sales",
  "intervals": ["2000-01-01/2020-01-01"],
  "filter": { "type": "selector", "dimension": "country", "value": "USA" },
  "aggregations": [{ "type": "longSum", "name": "total_price", "fieldName": "price" }],
  "granularity": "year"
}

Both Kylin and Druid are powerful analytics engines, but they have different approaches. Kylin excels in OLAP scenarios with pre-computed cubes, while Druid offers more flexibility for real-time analytics and ad-hoc queries. The code examples show that Kylin uses SQL-like syntax, whereas Druid uses JSON-based query specifications.

12,874

Apache Doris is an easy-to-use, high performance and unified analytics database.

Pros of Doris

  • Better performance for real-time analytics on large-scale datasets
  • Simpler architecture and easier deployment compared to Druid
  • Native support for SQL, making it more accessible to analysts

Cons of Doris

  • Less mature ecosystem and community support than Druid
  • Limited support for streaming data ingestion
  • Fewer built-in data rollup and aggregation features

Code Comparison

Doris query example:

SELECT user_id, SUM(order_amount) AS total_amount
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id
HAVING total_amount > 1000;

Druid query example:

{
  "queryType": "groupBy",
  "dataSource": "orders",
  "intervals": ["2023-01-01/2023-12-31"],
  "granularity": "all",
  "dimensions": ["user_id"],
  "aggregations": [
    { "type": "longSum", "name": "total_amount", "fieldName": "order_amount" }
  ],
  "having": {
    "type": "greaterThan",
    "aggregation": "total_amount",
    "value": 1000
  }
}

Both Doris and Druid are powerful analytics databases, but they have different strengths. Doris offers better performance for large-scale analytics and a simpler SQL interface, while Druid provides more advanced features for streaming data and complex aggregations. The choice between them depends on specific use cases and requirements.

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

Pros of TimescaleDB

  • Built on PostgreSQL, leveraging its robust ecosystem and SQL compatibility
  • Automatic partitioning and scalability for time-series data
  • Supports both time-series and relational data in a single database

Cons of TimescaleDB

  • Limited support for real-time ingestion compared to Druid's streaming capabilities
  • May require more manual optimization for complex queries on large datasets
  • Less flexible for handling high-cardinality data

Code Comparison

TimescaleDB query example:

SELECT time_bucket('1 hour', time) AS hour,
       avg(temperature)
FROM sensor_data
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY hour;

Druid query example:

{
  "queryType": "timeseries",
  "dataSource": "sensor_data",
  "granularity": "hour",
  "intervals": ["2023-05-01/2023-05-02"],
  "aggregations": [{"type": "avg", "fieldName": "temperature", "name": "avg_temp"}]
}

Both TimescaleDB and Druid are powerful tools for time-series data analysis, but they have different strengths. TimescaleDB excels in scenarios requiring SQL compatibility and mixed relational/time-series workloads, while Druid shines in real-time analytics and high-cardinality data scenarios.

16,153

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

  • Supports a wider range of data sources, including relational databases, NoSQL databases, and file systems
  • Better suited for complex, multi-table joins and analytical queries
  • More flexible query language with full SQL support

Cons of Presto

  • Generally slower for real-time data ingestion and querying compared to Druid
  • Higher memory requirements, especially for large-scale queries
  • Less optimized for time-series data and aggregations

Code Comparison

Presto query example:

SELECT date_trunc('day', timestamp) AS day,
       COUNT(*) AS count
FROM events
WHERE timestamp >= DATE '2023-01-01'
GROUP BY 1
ORDER BY 1

Druid query example:

{
  "queryType": "timeseries",
  "dataSource": "events",
  "granularity": "day",
  "intervals": ["2023-01-01/2023-12-31"],
  "aggregations": [{ "type": "count", "name": "count" }]
}

The code examples highlight the different query styles: Presto uses standard SQL, while Druid uses a JSON-based query language optimized for time-series data and aggregations.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Coverage Status Docker Helm

WorkflowStatus
⚙️ CodeQL Configcodeql-config
🔍 CodeQLcodeql
🕒 Cron Job ITScron-job-its
🏷️ Labelerlabeler
♻️ Reusable Revised ITSreusable-revised-its
♻️ Reusable Standard ITSreusable-standard-its
♻️ Reusable Unit Testsreusable-unit-tests
🔄 Revised ITSrevised-its
🔧 Standard ITSstandard-its
🛠️ Static Checksstatic-checks
🧪 Unit and Integration Tests Unifiedunit-and-integration-tests-unified
🔬 Unit Testsunit-tests

Website Twitter Download Get Started Documentation Community Build Contribute License


Apache Druid

Druid is a high performance real-time analytics database. Druid's main value add is to reduce time to insight and action.

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at powering UIs, running operational (ad-hoc) queries, or handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases. The design documentation explains the key concepts.

Getting started

You can get started with Druid with our local or Docker quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in web console (shown below).

Load data

data loader Kafka

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

management

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.

Issue queries

query view combo

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

See the latest documentation for the documentation for the current official release. If you need information on a previous release, you can browse previous releases documentation.

Make documentation and tutorials updates in /docs using Markdown or extended Markdown (MDX). Then, open a pull request.

To build the site locally, you need Node 16.14 or higher and to install Docusaurus 2 with npm|yarn install in the website directory. Then you can run npm|yarn start to launch a local build of the docs.

If you're looking to update non-doc pages like Use Cases, those files are in the druid-website-src repo.

Community

Visit the official project community page to read about getting involved in contributing to Apache Druid, and how we help one another use and operate Druid.

  • Druid users can find help in the druid-user mailing list on Google Groups, and have more technical conversations in #troubleshooting on Slack.
  • Druid development discussions take place in the druid-dev mailing list (dev@druid.apache.org). Subscribe by emailing dev-subscribe@druid.apache.org. For live conversations, join the #dev channel on Slack.

Check out the official community page for details of how to join the community Slack channels.

Find articles written by community members and a calendar of upcoming events on the project site - contribute your own events and articles by submitting a PR in the apache/druid-website-src repository.

Building from source

Please note that JDK 11 or JDK 17 is required to build Druid.

See the latest build guide for instructions on building Apache Druid from source.

Contributing

Please follow the community guidelines for contributing.

For instructions on setting up IntelliJ dev/intellij-setup.md

License

Apache License, Version 2.0