drill

Apache Drill is a distributed MPP query layer for self describing data

1,985

985

1,985

129

View on GitHub

Top Related Projects

presto

16,420

The official home of the Presto distributed SQL query engine for big data

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

dremio-oss

1,435

Dremio - the missing link in modern data

Quick Overview

Apache Drill is an open-source, schema-free SQL query engine for big data exploration. It's designed to scale to petabytes of data and thousands of machines, allowing users to query data in various formats (like JSON, Parquet, and HBase) without requiring pre-defined schemas.

Pros

Schema-free querying: Allows querying of complex, nested data without predefined schemas
Support for multiple data sources: Can query data from various sources including Hadoop, NoSQL databases, and cloud storage
ANSI SQL compatibility: Supports standard SQL syntax, making it easy for SQL users to adapt
Low latency: Designed for interactive queries and analytics on large-scale datasets

Cons

Steep learning curve: Can be complex to set up and optimize for specific use cases
Resource intensive: Requires significant memory and CPU resources for optimal performance
Limited ACID transaction support: Not ideal for transactional workloads
Smaller community compared to some other big data tools: May result in fewer resources and slower issue resolution

Getting Started

Download and install Apache Drill:

wget https://downloads.apache.org/drill/drill-1.20.0/apache-drill-1.20.0.tar.gz
tar -xvzf apache-drill-1.20.0.tar.gz
cd apache-drill-1.20.0

Start Drill in embedded mode:
```
bin/drill-embedded
```

Run a simple query:

SELECT * FROM cp.`employee.json` LIMIT 5;

This will start Drill and run a query on the sample employee.json file included with Drill. For more advanced usage, refer to the official Apache Drill documentation.

Competitor Comparisons

calcite

4,874

Apache Calcite

Pros of Calcite

More flexible and extensible SQL parser and optimizer framework
Supports a wider range of data sources and query languages
Easier integration with existing data processing systems

Cons of Calcite

Steeper learning curve due to its more abstract nature
May require more configuration and customization for specific use cases
Less out-of-the-box functionality compared to Drill

Code Comparison

Calcite query optimization example:

RelNode logicalPlan = planner.rel(sqlNode).project();
HepProgram program = new HepProgramBuilder()
    .addRuleInstance(FilterJoinRule.FILTER_ON_JOIN)
    .build();
HepPlanner hepPlanner = new HepPlanner(program);
hepPlanner.setRoot(logicalPlan);
RelNode optimizedPlan = hepPlanner.findBestExp();

Drill query execution example:

QueryWorkUnit workUnit = queryContext.getCurrentWorkUnit();
PhysicalOperator rootOperator = workUnit.getRootOperator();
RootExec rootExec = ImplCreator.getExec(context, rootOperator);
rootExec.setup();
while (rootExec.next()) {
    // Process results
}

Both Calcite and Drill are Apache projects focused on SQL processing and optimization. Calcite provides a more flexible foundation for building query engines, while Drill offers a more complete out-of-the-box solution for distributed query execution. The choice between them depends on specific project requirements and the level of customization needed.

impala

1,229

Apache Impala

Pros of Impala

Faster query execution for interactive analytics due to its MPP architecture
Better performance on complex joins and aggregations
Native support for Hadoop file formats and metadata

Cons of Impala

Limited support for user-defined functions compared to Drill
Less flexibility in querying diverse data sources
Requires more memory resources for optimal performance

Code Comparison

Impala SQL query:

SELECT customer_id, SUM(order_total) AS total_sales
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;

Drill SQL query:

SELECT customer_id, SUM(order_total) AS total_sales
FROM dfs.`/path/to/orders`
WHERE order_date BETWEEN CAST('2023-01-01' AS DATE) AND CAST('2023-12-31' AS DATE)
GROUP BY customer_id
HAVING total_sales > 1000;

The main difference in these queries is how data sources are referenced. Impala uses table names directly, while Drill requires specifying the storage plugin and file path.

Both Drill and Impala are powerful SQL query engines for big data, but they have different strengths. Impala excels in performance for interactive queries on Hadoop data, while Drill offers more flexibility in querying diverse data sources. The choice between them depends on specific use cases and infrastructure requirements.

presto

16,420

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

Better performance for large-scale data processing and analytics
More extensive ecosystem with broader community support
Wider range of connectors and integrations with various data sources

Cons of Presto

Steeper learning curve and more complex setup compared to Drill
Higher resource requirements, especially for memory-intensive operations
Less flexibility for ad-hoc querying of local files and directories

Code Comparison

Presto query example:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;

Drill query example:

SELECT customer_name, SUM(order_total)
FROM dfs.`/path/to/orders.parquet` orders
JOIN dfs.`/path/to/customers.parquet` customers
  ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;

Both Presto and Drill use SQL-like syntax for querying data, but Drill's ability to directly query files in various formats (like Parquet in this example) is more apparent in its syntax. Presto, on the other hand, typically works with predefined tables and schemas, which can be more efficient for large-scale data processing but may require additional setup steps.

hive

5,749

Apache Hive

Pros of Hive

Better support for complex data types and nested structures
More mature ecosystem with extensive tooling and integration options
Stronger compatibility with traditional SQL syntax and semantics

Cons of Hive

Generally slower query performance, especially for large-scale data processing
Less flexible for real-time or interactive querying scenarios
More complex setup and configuration process

Code Comparison

Hive query example:

SELECT customer_id, SUM(order_total) AS total_spent
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_spent > 1000;

Drill query example:

SELECT customer_id, SUM(order_total) AS total_spent
FROM dfs.`/path/to/orders`
WHERE order_date BETWEEN CAST('2023-01-01' AS DATE) AND CAST('2023-12-31' AS DATE)
GROUP BY customer_id
HAVING SUM(order_total) > 1000;

Both Hive and Drill support SQL-like syntax, but Drill offers more flexibility in querying various data sources without predefined schemas. Hive requires table definitions and schema management, while Drill can query data directly from files or other sources. Drill's syntax is more similar to standard SQL, whereas Hive has some unique constructs and functions.

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Faster processing speed for large-scale data analytics
More extensive ecosystem with libraries for machine learning, graph processing, and streaming
Better support for iterative algorithms and in-memory computing

Cons of Spark

Steeper learning curve, especially for complex use cases
Higher memory requirements, which can be challenging for resource-constrained environments
Less efficient for small datasets or simple queries compared to Drill

Code Comparison

Spark (Scala):

val df = spark.read.json("data.json")
df.filter($"age" > 21).groupBy("city").count().show()

Drill (SQL):

SELECT city, COUNT(*) as count
FROM dfs.`data.json`
WHERE age > 21
GROUP BY city

Summary

Spark excels in large-scale data processing and advanced analytics, offering a rich ecosystem and faster performance for complex tasks. Drill, on the other hand, provides a more SQL-like experience and can be more efficient for simpler queries or smaller datasets. The choice between the two depends on specific use cases, data sizes, and team expertise.

dremio-oss

1,435

Dremio - the missing link in modern data

Pros of Dremio OSS

More modern architecture with a focus on cloud-native deployments
Advanced data catalog and metadata management capabilities
Better support for data virtualization and data lake analytics

Cons of Dremio OSS

Smaller community and ecosystem compared to Apache Drill
Less mature and potentially less stable than Apache Drill
More complex setup and configuration process

Code Comparison

Apache Drill query example:

SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30;

Dremio OSS query example:

SELECT * FROM "MySource"."path"."to"."data"."file.json" WHERE age > 30;

Both projects use SQL-like syntax for querying data, but Dremio OSS typically requires more specific path information due to its data catalog structure.

Apache Drill is known for its simplicity and ease of use, especially for ad-hoc queries on various data sources. It excels in querying nested data formats like JSON and Parquet.

Dremio OSS, on the other hand, offers more advanced features for data lake management and optimization, including data reflections for improved query performance and a collaborative workspace for data curation.

While Apache Drill has a larger and more established community, Dremio OSS is gaining traction due to its modern architecture and focus on cloud-native deployments. The choice between the two depends on specific use cases and requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Drill

Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google's Dremel.

Developers

Please read Environment.md for setting up and running Apache Drill. For complete developer documentation see DevDocs.md.

More Information

Please see the Apache Drill Website or the Apache Drill Documentation for more information including:

Remote Execution Installation Instructions
Running Drill on Docker instructions
Information about how to submit logical and distributed physical plans
More example queries and sample data
Find out ways to be involved or discuss Drill

Join the community!

Apache Drill is an Apache Foundation project and is seeking all types of users and contributions. Please say hello on the Apache Drill mailing list.You can also join our Google Hangouts or join our Slack Channel if you need help with using or developing Apache Drill (more information can be found on Apache Drill website).

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.
The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code. The following provides more details on the included cryptographic software: Java SE Security packages are used to provide support for authentication, authorization and secure sockets communication. The Jetty Web Server is used to provide communication via HTTPS. The Cyrus SASL libraries, Kerberos Libraries and OpenSSL Libraries are used to provide SASL based authentication and SSL communication.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot