dremio-oss

Dremio - the missing link in modern data

1,435

450

1,435

View on GitHub

Top Related Projects

drill

1,985

Apache Drill is a distributed MPP query layer for self describing data

presto

16,420

The official home of the Presto distributed SQL query engine for big data

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

spark

42,015

Apache Spark - A unified analytics engine for large-scale data processing

Quick Overview

Dremio is an open-source data lake engine that provides fast, self-service data access to data lakes and other data sources. It enables data analysts and data scientists to query data directly from various sources without the need for complex ETL processes or data warehousing.

Pros

Supports multiple data sources, including S3, HDFS, and relational databases
Provides a SQL interface for querying data across different sources
Offers data acceleration and caching capabilities for improved performance
Includes a user-friendly interface for data exploration and visualization

Cons

Can be complex to set up and configure for optimal performance
May require significant resources for large-scale deployments
Limited support for real-time data processing
Learning curve for users unfamiliar with data lake concepts

Getting Started

To get started with Dremio OSS:

Download the latest Dremio OSS release from the GitHub repository.
Extract the downloaded archive to a directory of your choice.
Navigate to the extracted directory and run the following command:

./bin/dremio start

Open a web browser and go to http://localhost:9047 to access the Dremio UI.
Follow the setup wizard to configure your first data source and start exploring data.

For more detailed instructions, refer to the official Dremio documentation.

Competitor Comparisons

drill

1,985

Apache Drill is a distributed MPP query layer for self describing data

Pros of Drill

Fully open-source Apache project with a large community
Supports a wider range of data sources out-of-the-box
More flexible query execution model for complex analytics

Cons of Drill

Less user-friendly interface compared to Dremio
Slower query performance on certain workloads
Lacks built-in data catalog and governance features

Code Comparison

Drill query example:

SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30;

Dremio query example:

SELECT * FROM "My Source"."file.json" WHERE age > 30;

Both Drill and Dremio use SQL-like syntax for querying data, but Dremio's approach is more intuitive with its virtual dataset concept. Drill requires specifying the storage plugin (e.g., dfs) and full path, while Dremio uses a more familiar database-like structure.

Drill and Dremio are both powerful query engines for distributed data analysis. Drill offers more flexibility and a wider range of data sources, making it suitable for complex analytics scenarios. Dremio, on the other hand, provides a more user-friendly experience with better performance for certain workloads and additional features for data management and governance.

presto

16,420

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

More mature and widely adopted in the industry
Supports a broader range of data sources out-of-the-box
Highly scalable for large-scale data processing

Cons of Presto

Requires more setup and configuration
Less user-friendly for non-technical users
Limited built-in data visualization capabilities

Code Comparison

Presto SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;

Dremio SQL query:

SELECT customer_name, SUM(order_total)
FROM @"Sales"."Orders" orders
JOIN @"Sales"."Customers" customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;

The main difference in the code is the use of virtual datasets in Dremio, denoted by the @ symbol. Presto uses more traditional table references, while Dremio's approach allows for easier data virtualization and management.

Both Presto and Dremio-OSS are powerful SQL query engines for big data analytics. Presto excels in performance and scalability for large-scale data processing, while Dremio-OSS offers a more user-friendly interface and built-in data curation features. The choice between the two depends on specific use cases, technical expertise, and data management requirements.

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Pros of Arrow

Broader scope and applicability across various data processing systems
More active community with frequent contributions and updates
Extensive language support including C++, Python, R, and more

Cons of Arrow

Steeper learning curve for newcomers due to its low-level nature
Less out-of-the-box functionality compared to Dremio's complete data lake engine

Code Comparison

Arrow (C++ example):

#include <arrow/api.h>
#include <arrow/io/api.h>

std::shared_ptr<arrow::Table> table;
arrow::io::FileOutputStream::Open("data.arrow", &output);
arrow::ipc::WriteTable(*table, output.get());

Dremio (Java example):

import com.dremio.exec.store.dfs.FileSystemPlugin;
import com.dremio.exec.store.dfs.SchemaMutability;

FileSystemPlugin plugin = new FileSystemPlugin(config, context, "dfs");
plugin.start();

Summary

Arrow is a more versatile and widely-adopted project for in-memory data representation, while Dremio OSS provides a complete data lake engine built on top of Arrow. Arrow offers greater flexibility and language support, but Dremio OSS provides more immediate functionality for data lake management. The choice between them depends on specific project requirements and the level of customization needed.

spark

42,015

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Mature ecosystem with extensive libraries and integrations
Powerful distributed computing capabilities for big data processing
Strong community support and active development

Cons of Spark

Steeper learning curve, especially for complex use cases
Resource-intensive, requiring significant cluster resources
Can be overkill for smaller datasets or simpler analytics tasks

Code Comparison

Spark (Scala):

val df = spark.read.json("data.json")
df.groupBy("category").agg(avg("price").alias("avg_price"))
  .orderBy(desc("avg_price"))
  .show()

Dremio (SQL):

SELECT category, AVG(price) AS avg_price
FROM data.json
GROUP BY category
ORDER BY avg_price DESC

Key Differences

Spark offers a more programmatic approach with support for multiple languages
Dremio provides a SQL-first experience, making it more accessible for SQL users
Spark excels in complex data processing and machine learning tasks
Dremio focuses on data virtualization and query acceleration

Both projects have their strengths, with Spark being more suitable for advanced big data processing and Dremio offering easier data access and management for business intelligence use cases.

hive

5,749

Apache Hive

Pros of Hive

Mature and widely adopted in the Hadoop ecosystem
Strong support for SQL-like queries on large datasets
Integrates well with other Apache big data tools

Cons of Hive

Can be slower for real-time queries compared to Dremio
Less user-friendly interface and setup process
Limited support for modern data formats and cloud-native architectures

Code Comparison

Hive query example:

SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;

Dremio query example:

SELECT customer_id, SUM(order_total)
FROM "Sales"."Orders"
GROUP BY customer_id
HAVING SUM(order_total) > 1000;

Both Hive and Dremio support SQL-like syntax, but Dremio offers a more modern approach with its data lake engine and support for various data sources. Hive is tightly integrated with Hadoop, while Dremio provides a more flexible architecture for working with diverse data ecosystems. Dremio also offers features like data curation and acceleration that are not natively available in Hive.

impala

1,229

Apache Impala

Pros of Impala

Mature and battle-tested in production environments
Tightly integrated with the Hadoop ecosystem
Supports a wide range of file formats and storage systems

Cons of Impala

Limited support for complex data types and nested structures
Requires Hadoop infrastructure, which can be complex to set up and maintain
Less flexible in terms of data source connectivity compared to Dremio

Code Comparison

Impala SQL query:

SELECT customer_id, SUM(order_total) AS total_sales
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;

Dremio SQL query:

SELECT customer_id, SUM(order_total) AS total_sales
FROM "Sales"."Orders"
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;

Both Impala and Dremio use SQL-like syntax for querying data. The main difference in these examples is the table reference format, where Dremio uses a more flexible naming convention with quotes and dot notation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Dremio

Dremio enables organizations to unlock the value of their data.

Documentation
Quickstart
Codebase Structure
Contributing
Questions

Documentation

Documentation is available at https://docs.dremio.com.

Quickstart: How to build and run Dremio

(a) Prerequisites

JDK 21 (OpenJDK or Oracle) as the default JDK (JAVA_HOME set to it)
JDK 17 (OpenJDK or Oracle) in Maven toolchain, required to run certain integration tests
JDK 11 (OpenJDK or Oracle) in Maven toolchain, required to run unit and integration tests
(Optional) Maven 3.9.3 or later (using Homebrew: brew install maven)

Run the following commands to verify that you have the correct versions of Maven and JDK installed:

java -version
mvn --version

Add JDK 17 to the Maven toolchain, easiest to use ${HOME}/.m2/toolchains.xml. Example:

<?xml version="1.0" encoding="UTF-8"?>
<toolchains>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>11</version>
      <vendor>sun</vendor>
    </provides>
    <configuration>
      <jdkHome>FULL_PATH_TO_YOUR_JAVA_11_HOME</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>17</version>
      <vendor>sun</vendor>
    </provides>
    <configuration>
      <jdkHome>FULL_PATH_TO_YOUR_JAVA_17_HOME</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>21</version>
      <vendor>sun</vendor>
    </provides>
    <configuration>
      <jdkHome>FULL_PATH_TO_YOUR_JAVA_21_HOME</jdkHome>
    </configuration>
  </toolchain>
</toolchains>

(b) Clone the Repository

git clone https://github.com/dremio/dremio-oss.git dremio

(c) Build the Code

cd dremio
mvn clean install -DskipTests (or ./mvnw clean install -DskipTests if maven is not installed on the machine)

The "-DskipTests" option skips most of the tests. Running all tests takes a long time.

(d) Run/Install

Run

distribution/server/target/dremio-community-{DREMIO_VERSION}/dremio-community-{DREMIO_VERSION}/bin/dremio start

OR to start a server with a default user (dremio/dremio123)

mvn compile exec:exec -pl dac/daemon

Once run, the UI is accessible at:

http://localhost:9047

Production Install

(1) Unpack the tarball to install.

mkdir /opt/dremio
tar xvzf distribution/server/target/*.tar.gz --strip=1 -C /opt/dremio

(2) Start Dremio Embedded Mode

cd /opt/dremio
bin/dremio

OSS Only

To have the best possible experience with Dremio, we include a number of dependencies when building Dremio that are distributed under non-oss free (as in beer) licenses. Examples include drivers for major databases such as Oracle Database, Microsoft SQL Server, MySQL as well as enhancements to improve source pushdowns and thread scheduling. If you'd like to only include dependencies with OSS licenses, Dremio will continue to work but some features will be unavailable (such as connecting to databases that rely on these drivers).

To build dremio with only OSS dependencies, you can add the following option to your Maven commandline: -Ddremio.oss-only=true

The distribution directory will be distribution/server/target/dremio-oss-{DREMIO_VERSION}/dremio-oss-{DREMIO_VERSION}

Codebase Structure

Directory	Details
dac	Dremio Analyst Center - The Dremio management component.
common	Dremio Common
distribution	Dremio Distribution
plugins	Dremio Plugins

Contributing

If you want to contribute to Dremio, please see Contributing to Dremio.

Questions?

If you have questions, please post them on https://community.dremio.com.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Getting Started

Competitor Comparisons

Pros of Drill

Cons of Drill

Code Comparison

Pros of Presto

Cons of Presto

Code Comparison

Pros of Arrow

Cons of Arrow

Code Comparison

Summary

Pros of Spark

Cons of Spark

Code Comparison

Key Differences

Pros of Hive

Cons of Hive

Code Comparison

Pros of Impala

Cons of Impala

Code Comparison

Convert designs to code with AI

README

Dremio

Table of Contents

Documentation

Quickstart: How to build and run Dremio

(a) Prerequisites

(b) Clone the Repository

(c) Build the Code

(d) Run/Install

Run

Production Install

(1) Unpack the tarball to install.

(2) Start Dremio Embedded Mode

OSS Only

Codebase Structure

Contributing

Questions?

Top Related Projects

Convert designs to code with AI