Top Related Projects
Apache Pinot - A realtime distributed OLAP datastore
Apache Druid: a high performance real-time analytics database.
Apache Doris is an easy-to-use, high performance and unified analytics database.
The official home of the Presto distributed SQL query engine for big data
Apache Calcite
Apache Hive
Quick Overview
Apache Kylin is an open-source distributed analytics engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets. It aims to bridge the gap between Big Data and traditional OLAP tools, enabling interactive analytics on massive datasets.
Pros
- Extreme OLAP Engine: Kylin enables sub-second query latency on datasets with trillions of rows
- SQL Interface: Provides a standard SQL interface for querying data, making it accessible to business users
- Seamless integration: Works well with various Big Data and visualization tools in the ecosystem
- Scalability: Designed to handle petabyte-scale datasets efficiently
Cons
- Complex setup: Initial configuration and cube design can be challenging for beginners
- Resource intensive: Building and maintaining cubes can be computationally expensive
- Limited ad-hoc analysis: Requires pre-built cubes, which can limit flexibility for unexpected queries
- Learning curve: Understanding cube design and optimization requires significant effort
Getting Started
To get started with Apache Kylin:
- Download and install Kylin from the official website
- Set up Hadoop and other dependencies
- Configure Kylin by editing the
conf/kylin.properties
file - Start Kylin server:
${KYLIN_HOME}/bin/kylin.sh start
- Access the web interface at
http://localhost:7070/kylin
- Create a project, define a data model, and build a cube
- Query your data using SQL through the web interface or JDBC
For detailed instructions, refer to the official Apache Kylin documentation.
Competitor Comparisons
Apache Pinot - A realtime distributed OLAP datastore
Pros of Pinot
- Better real-time analytics capabilities, especially for streaming data
- More flexible schema design and support for nested data structures
- Higher query performance for large-scale datasets
Cons of Pinot
- Steeper learning curve and more complex setup compared to Kylin
- Less mature OLAP cube functionality
- Requires more manual tuning for optimal performance
Code Comparison
Pinot query example:
SELECT COUNT(*) FROM myTable
WHERE timeColumn BETWEEN ? AND ?
GROUP BY dimension1, dimension2
LIMIT 100
Kylin query example:
SELECT dimension1, dimension2, SUM(metric1)
FROM my_cube
WHERE time_column BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY dimension1, dimension2
Key Differences
- Pinot uses a columnar storage format optimized for real-time ingestion and querying
- Kylin pre-builds OLAP cubes for faster query performance on predefined dimensions
- Pinot supports a wider range of data types and more flexible schema evolution
- Kylin integrates more tightly with the Hadoop ecosystem
Both projects are Apache Software Foundation top-level projects and offer robust solutions for big data analytics, but they cater to slightly different use cases and architectural preferences.
Apache Druid: a high performance real-time analytics database.
Pros of Druid
- Designed for real-time analytics and sub-second query performance
- Highly scalable and can handle massive datasets efficiently
- Supports streaming ingestion and real-time data updates
Cons of Druid
- Steeper learning curve and more complex setup compared to Kylin
- Less optimized for OLAP cube-based analytics
- May require more hardware resources for optimal performance
Code Comparison
Druid query example:
SELECT COUNT(*) AS count
FROM my_datasource
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY time_floor(__time, 'PT1H')
Kylin query example:
SELECT COUNT(*) AS count
FROM my_cube
WHERE part_dt >= TRUNC(SYSDATE) - 1
GROUP BY TRUNC(part_dt, 'HH24')
Both systems use SQL-like syntax, but Druid focuses on time-series data and real-time aggregations, while Kylin is optimized for pre-calculated OLAP cubes.
Summary
Druid excels in real-time analytics and scalability, making it suitable for large-scale streaming data scenarios. Kylin, on the other hand, is better suited for traditional OLAP workloads with pre-calculated cubes. The choice between the two depends on specific use cases, data volumes, and query patterns.
Apache Doris is an easy-to-use, high performance and unified analytics database.
Pros of Doris
- Better real-time analytics performance, especially for large-scale datasets
- More flexible and scalable architecture, supporting both MPP and vectorized execution
- Easier to deploy and maintain, with a simpler system architecture
Cons of Doris
- Less mature ecosystem compared to Kylin
- Limited support for complex pre-aggregation scenarios
- Steeper learning curve for users familiar with traditional OLAP systems
Code Comparison
Doris query example:
SELECT user_id, SUM(order_amount)
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY user_id;
Kylin query example:
SELECT user_id, SUM(order_amount)
FROM orders_cube
WHERE order_date >= '2023-01-01'
GROUP BY user_id;
The main difference is that Kylin typically uses pre-built cubes (e.g., orders_cube
) for faster query performance, while Doris can efficiently query the base table directly. Doris's MPP architecture allows for fast ad-hoc queries without the need for extensive pre-aggregation, making it more flexible for real-time analytics scenarios.
The official home of the Presto distributed SQL query engine for big data
Pros of Presto
- Faster query execution for large-scale data processing
- More flexible architecture supporting various data sources
- Wider adoption and larger community support
Cons of Presto
- Higher memory consumption
- Steeper learning curve for configuration and optimization
- Less optimized for OLAP-specific workloads
Code Comparison
Kylin query example:
SELECT SUM(price) AS total_price
FROM sales
WHERE country = 'USA'
GROUP BY product_category
Presto query example:
SELECT product_category, SUM(price) AS total_price
FROM hive.sales
WHERE country = 'USA'
GROUP BY product_category
Both examples show similar SQL syntax, but Presto's query includes the data source (hive) in the table reference. Kylin typically uses pre-built cubes for faster OLAP queries, while Presto can query various data sources directly.
Key Differences
- Kylin focuses on OLAP workloads with pre-built cubes, while Presto is a more general-purpose SQL query engine
- Presto supports a wider range of data sources out-of-the-box
- Kylin offers better performance for specific OLAP scenarios, while Presto provides more flexibility for diverse query types
Apache Calcite
Pros of Calcite
- More versatile and adaptable to various data processing systems
- Stronger focus on SQL optimization and query planning
- Wider adoption and integration with other Apache projects
Cons of Calcite
- Steeper learning curve due to its complexity
- May require more configuration and setup for specific use cases
- Less out-of-the-box functionality for OLAP-specific operations
Code Comparison
Calcite (SQL parsing):
SqlParser.Config parserConfig = SqlParser.config()
.withCaseSensitive(false)
.withQuotedCasing(Casing.UNCHANGED)
.withUnquotedCasing(Casing.TO_UPPER);
SqlParser parser = SqlParser.create(sql, parserConfig);
SqlNode sqlNode = parser.parseQuery();
Kylin (Cube building):
CubeInstance cube = cubeManager.getCube(cubeName);
CubeSegment newSeg = cube.getNextSegment();
CubeBuilder cubeBuilder = new CubeBuilder(cube, newSeg);
cubeBuilder.buildCube(jobId, buildType);
While Calcite focuses on SQL parsing and optimization, Kylin specializes in OLAP cube operations. Calcite provides a more flexible foundation for various data processing tasks, whereas Kylin offers more specific functionality for multidimensional analysis and cube management out of the box.
Apache Hive
Pros of Hive
- Mature and widely adopted data warehousing solution with extensive ecosystem support
- Supports a wide range of data formats and storage systems
- Provides SQL-like query language (HiveQL) for easy data manipulation
Cons of Hive
- Can be slower for real-time queries compared to Kylin's OLAP cube approach
- Requires more manual optimization for complex queries
- Less efficient for high-concurrency scenarios
Code Comparison
Hive query example:
SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Kylin query example:
SELECT customer_id, SUM(order_total)
FROM orders_cube
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
The main difference is that Kylin queries are executed on pre-built OLAP cubes, which can provide faster query performance for complex aggregations and high-concurrency scenarios. Hive queries are executed directly on the raw data, which may require more processing time but offers more flexibility for ad-hoc queries.
Both systems use SQL-like syntax, making it easier for users familiar with traditional databases to work with big data. However, Kylin's approach is more suited for scenarios where query patterns are known in advance and can be optimized through cube design, while Hive is more flexible for exploratory data analysis and ad-hoc querying.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Intro to Kylin 5
Comparison with Kylin 4.0
- New metadata design New metadata design article (Chinese ver) and New metadata definition
- Support Table Index
- Support schema change
- Support computed column
- New CuboidScheduler
- New Job engine etc.
For more detail, please check our roadmap .
Quick Start
- Build maven artifact with following command:
mvn clean package -DskipTests
- Run unit test with following command:
sh dev-support/unit_testing.sh
- Build a Kylin 5 binary
./build/release/release.sh
Top Related Projects
Apache Pinot - A realtime distributed OLAP datastore
Apache Druid: a high performance real-time analytics database.
Apache Doris is an easy-to-use, high performance and unified analytics database.
The official home of the Presto distributed SQL query engine for big data
Apache Calcite
Apache Hive
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot