Top Related Projects
Apache Drill is a distributed MPP query layer for self describing data
The official home of the Presto distributed SQL query engine for big data
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Apache Spark - A unified analytics engine for large-scale data processing
Apache Hive
Apache Impala
Quick Overview
Dremio is an open-source data lake engine that provides fast, self-service data access to data lakes and other data sources. It enables data analysts and data scientists to query data directly from various sources without the need for complex ETL processes or data warehousing.
Pros
- Supports multiple data sources, including S3, HDFS, and relational databases
- Provides a SQL interface for querying data across different sources
- Offers data acceleration and caching capabilities for improved performance
- Includes a user-friendly interface for data exploration and visualization
Cons
- Can be complex to set up and configure for optimal performance
- May require significant resources for large-scale deployments
- Limited support for real-time data processing
- Learning curve for users unfamiliar with data lake concepts
Getting Started
To get started with Dremio OSS:
- Download the latest Dremio OSS release from the GitHub repository.
- Extract the downloaded archive to a directory of your choice.
- Navigate to the extracted directory and run the following command:
./bin/dremio start
- Open a web browser and go to
http://localhost:9047
to access the Dremio UI. - Follow the setup wizard to configure your first data source and start exploring data.
For more detailed instructions, refer to the official Dremio documentation.
Competitor Comparisons
Apache Drill is a distributed MPP query layer for self describing data
Pros of Drill
- Fully open-source Apache project with a large community
- Supports a wider range of data sources out-of-the-box
- More flexible query execution model for complex analytics
Cons of Drill
- Less user-friendly interface compared to Dremio
- Slower query performance on certain workloads
- Lacks built-in data catalog and governance features
Code Comparison
Drill query example:
SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30;
Dremio query example:
SELECT * FROM "My Source"."file.json" WHERE age > 30;
Both Drill and Dremio use SQL-like syntax for querying data, but Dremio's approach is more intuitive with its virtual dataset concept. Drill requires specifying the storage plugin (e.g., dfs
) and full path, while Dremio uses a more familiar database-like structure.
Drill and Dremio are both powerful query engines for distributed data analysis. Drill offers more flexibility and a wider range of data sources, making it suitable for complex analytics scenarios. Dremio, on the other hand, provides a more user-friendly experience with better performance for certain workloads and additional features for data management and governance.
The official home of the Presto distributed SQL query engine for big data
Pros of Presto
- More mature and widely adopted in the industry
- Supports a broader range of data sources out-of-the-box
- Highly scalable for large-scale data processing
Cons of Presto
- Requires more setup and configuration
- Less user-friendly for non-technical users
- Limited built-in data visualization capabilities
Code Comparison
Presto SQL query:
SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;
Dremio SQL query:
SELECT customer_name, SUM(order_total)
FROM @"Sales"."Orders" orders
JOIN @"Sales"."Customers" customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;
The main difference in the code is the use of virtual datasets in Dremio, denoted by the @
symbol. Presto uses more traditional table references, while Dremio's approach allows for easier data virtualization and management.
Both Presto and Dremio-OSS are powerful SQL query engines for big data analytics. Presto excels in performance and scalability for large-scale data processing, while Dremio-OSS offers a more user-friendly interface and built-in data curation features. The choice between the two depends on specific use cases, technical expertise, and data management requirements.
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Pros of Arrow
- Broader scope and applicability across various data processing systems
- More active community with frequent contributions and updates
- Extensive language support including C++, Python, R, and more
Cons of Arrow
- Steeper learning curve for newcomers due to its low-level nature
- Less out-of-the-box functionality compared to Dremio's complete data lake engine
Code Comparison
Arrow (C++ example):
#include <arrow/api.h>
#include <arrow/io/api.h>
std::shared_ptr<arrow::Table> table;
arrow::io::FileOutputStream::Open("data.arrow", &output);
arrow::ipc::WriteTable(*table, output.get());
Dremio (Java example):
import com.dremio.exec.store.dfs.FileSystemPlugin;
import com.dremio.exec.store.dfs.SchemaMutability;
FileSystemPlugin plugin = new FileSystemPlugin(config, context, "dfs");
plugin.start();
Summary
Arrow is a more versatile and widely-adopted project for in-memory data representation, while Dremio OSS provides a complete data lake engine built on top of Arrow. Arrow offers greater flexibility and language support, but Dremio OSS provides more immediate functionality for data lake management. The choice between them depends on specific project requirements and the level of customization needed.
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- Mature ecosystem with extensive libraries and integrations
- Powerful distributed computing capabilities for big data processing
- Strong community support and active development
Cons of Spark
- Steeper learning curve, especially for complex use cases
- Resource-intensive, requiring significant cluster resources
- Can be overkill for smaller datasets or simpler analytics tasks
Code Comparison
Spark (Scala):
val df = spark.read.json("data.json")
df.groupBy("category").agg(avg("price").alias("avg_price"))
.orderBy(desc("avg_price"))
.show()
Dremio (SQL):
SELECT category, AVG(price) AS avg_price
FROM data.json
GROUP BY category
ORDER BY avg_price DESC
Key Differences
- Spark offers a more programmatic approach with support for multiple languages
- Dremio provides a SQL-first experience, making it more accessible for SQL users
- Spark excels in complex data processing and machine learning tasks
- Dremio focuses on data virtualization and query acceleration
Both projects have their strengths, with Spark being more suitable for advanced big data processing and Dremio offering easier data access and management for business intelligence use cases.
Apache Hive
Pros of Hive
- Mature and widely adopted in the Hadoop ecosystem
- Strong support for SQL-like queries on large datasets
- Integrates well with other Apache big data tools
Cons of Hive
- Can be slower for real-time queries compared to Dremio
- Less user-friendly interface and setup process
- Limited support for modern data formats and cloud-native architectures
Code Comparison
Hive query example:
SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Dremio query example:
SELECT customer_id, SUM(order_total)
FROM "Sales"."Orders"
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Both Hive and Dremio support SQL-like syntax, but Dremio offers a more modern approach with its data lake engine and support for various data sources. Hive is tightly integrated with Hadoop, while Dremio provides a more flexible architecture for working with diverse data ecosystems. Dremio also offers features like data curation and acceleration that are not natively available in Hive.
Apache Impala
Pros of Impala
- Mature and battle-tested in production environments
- Tightly integrated with the Hadoop ecosystem
- Supports a wide range of file formats and storage systems
Cons of Impala
- Limited support for complex data types and nested structures
- Requires Hadoop infrastructure, which can be complex to set up and maintain
- Less flexible in terms of data source connectivity compared to Dremio
Code Comparison
Impala SQL query:
SELECT customer_id, SUM(order_total) AS total_sales
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;
Dremio SQL query:
SELECT customer_id, SUM(order_total) AS total_sales
FROM "Sales"."Orders"
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;
Both Impala and Dremio use SQL-like syntax for querying data. The main difference in these examples is the table reference format, where Dremio uses a more flexible naming convention with quotes and dot notation.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Dremio
Dremio enables organizations to unlock the value of their data.
Table of Contents
Documentation
Documentation is available at https://docs.dremio.com.
Quickstart: How to build and run Dremio
(a) Prerequisites
- JDK 11 (OpenJDK or Oracle) as the default JDK (
JAVA_HOME
set to it) - JDK 17 (OpenJDK or Oracle) in Maven toolchain, required to run certain integration tests
- (Optional) Maven 3.9.3 or later (using Homebrew:
brew install maven
)
Run the following commands to verify that you have the correct versions of Maven and JDK installed:
java -version
mvn --version
Add JDK 17 to the Maven toolchain, easiest to use ${HOME}/.m2/toolchains.xml
. Example:
<?xml version="1.0" encoding="UTF-8"?>
<toolchains>
<toolchain>
<type>jdk</type>
<provides>
<version>11</version>
<vendor>sun</vendor>
</provides>
<configuration>
<jdkHome>FULL_PATH_TO_YOUR_JAVA_11_HOME</jdkHome>
</configuration>
</toolchain>
<toolchain>
<type>jdk</type>
<provides>
<version>17</version>
<vendor>sun</vendor>
</provides>
<configuration>
<jdkHome>FULL_PATH_TO_YOUR_JAVA_17_HOME</jdkHome>
</configuration>
</toolchain>
</toolchains>
(b) Clone the Repository
git clone https://github.com/dremio/dremio-oss.git dremio
(c) Build the Code
cd dremio
mvn clean install -DskipTests (or ./mvnw clean install -DskipTests if maven is not installed on the machine)
The "-DskipTests" option skips most of the tests. Running all tests takes a long time.
(d) Run/Install
Run
distribution/server/target/dremio-community-{DREMIO_VERSION}/dremio-community-{DREMIO_VERSION}/bin/dremio start
OR to start a server with a default user (dremio/dremio123)
mvn compile exec:exec -pl dac/daemon
Once run, the UI is accessible at:
http://localhost:9047
Production Install
(1) Unpack the tarball to install.
mkdir /opt/dremio
tar xvzf distribution/server/target/*.tar.gz --strip=1 -C /opt/dremio
(2) Start Dremio Embedded Mode
cd /opt/dremio
bin/dremio
OSS Only
To have the best possible experience with Dremio, we include a number of dependencies when building Dremio that are distributed under non-oss free (as in beer) licenses. Examples include drivers for major databases such as Oracle Database, Microsoft SQL Server, MySQL as well as enhancements to improve source pushdowns and thread scheduling. If you'd like to only include dependencies with OSS licenses, Dremio will continue to work but some features will be unavailable (such as connecting to databases that rely on these drivers).
To build dremio with only OSS dependencies, you can add the following option to your Maven commandline: -Ddremio.oss-only=true
The distribution directory will be distribution/server/target/dremio-oss-{DREMIO_VERSION}/dremio-oss-{DREMIO_VERSION}
Codebase Structure
Directory | Details |
---|---|
dac | Dremio Analyst Center - The Dremio management component. |
common | Dremio Common |
distribution | Dremio Distribution |
plugins | Dremio Plugins |
Contributing
If you want to contribute to Dremio, please see Contributing to Dremio.
Questions?
If you have questions, please post them on https://community.dremio.com.
Top Related Projects
Apache Drill is a distributed MPP query layer for self describing data
The official home of the Presto distributed SQL query engine for big data
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Apache Spark - A unified analytics engine for large-scale data processing
Apache Hive
Apache Impala
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot