Convert Figma logo to code with AI

dremio logodremio-oss

Dremio - the missing link in modern data

1,353
440
1,353
52

Top Related Projects

1,931

Apache Drill is a distributed MPP query layer for self describing data

15,889

The official home of the Presto distributed SQL query engine for big data

14,246

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

39,274

Apache Spark - A unified analytics engine for large-scale data processing

5,487

Apache Hive

1,122

Apache Impala

Quick Overview

Dremio is an open-source data lake engine that provides fast, self-service data access to data lakes and other data sources. It enables data analysts and data scientists to query data directly from various sources without the need for complex ETL processes or data warehousing.

Pros

  • Supports multiple data sources, including S3, HDFS, and relational databases
  • Provides a SQL interface for querying data across different sources
  • Offers data acceleration and caching capabilities for improved performance
  • Includes a user-friendly interface for data exploration and visualization

Cons

  • Can be complex to set up and configure for optimal performance
  • May require significant resources for large-scale deployments
  • Limited support for real-time data processing
  • Learning curve for users unfamiliar with data lake concepts

Getting Started

To get started with Dremio OSS:

  1. Download the latest Dremio OSS release from the GitHub repository.
  2. Extract the downloaded archive to a directory of your choice.
  3. Navigate to the extracted directory and run the following command:
./bin/dremio start
  1. Open a web browser and go to http://localhost:9047 to access the Dremio UI.
  2. Follow the setup wizard to configure your first data source and start exploring data.

For more detailed instructions, refer to the official Dremio documentation.

Competitor Comparisons

1,931

Apache Drill is a distributed MPP query layer for self describing data

Pros of Drill

  • Fully open-source Apache project with a large community
  • Supports a wider range of data sources out-of-the-box
  • More flexible query execution model for complex analytics

Cons of Drill

  • Less user-friendly interface compared to Dremio
  • Slower query performance on certain workloads
  • Lacks built-in data catalog and governance features

Code Comparison

Drill query example:

SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30;

Dremio query example:

SELECT * FROM "My Source"."file.json" WHERE age > 30;

Both Drill and Dremio use SQL-like syntax for querying data, but Dremio's approach is more intuitive with its virtual dataset concept. Drill requires specifying the storage plugin (e.g., dfs) and full path, while Dremio uses a more familiar database-like structure.

Drill and Dremio are both powerful query engines for distributed data analysis. Drill offers more flexibility and a wider range of data sources, making it suitable for complex analytics scenarios. Dremio, on the other hand, provides a more user-friendly experience with better performance for certain workloads and additional features for data management and governance.

15,889

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

  • More mature and widely adopted in the industry
  • Supports a broader range of data sources out-of-the-box
  • Highly scalable for large-scale data processing

Cons of Presto

  • Requires more setup and configuration
  • Less user-friendly for non-technical users
  • Limited built-in data visualization capabilities

Code Comparison

Presto SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;

Dremio SQL query:

SELECT customer_name, SUM(order_total)
FROM @"Sales"."Orders" orders
JOIN @"Sales"."Customers" customers ON orders.customer_id = customers.id
GROUP BY customer_name
HAVING SUM(order_total) > 1000;

The main difference in the code is the use of virtual datasets in Dremio, denoted by the @ symbol. Presto uses more traditional table references, while Dremio's approach allows for easier data virtualization and management.

Both Presto and Dremio-OSS are powerful SQL query engines for big data analytics. Presto excels in performance and scalability for large-scale data processing, while Dremio-OSS offers a more user-friendly interface and built-in data curation features. The choice between the two depends on specific use cases, technical expertise, and data management requirements.

14,246

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Pros of Arrow

  • Broader scope and applicability across various data processing systems
  • More active community with frequent contributions and updates
  • Extensive language support including C++, Python, R, and more

Cons of Arrow

  • Steeper learning curve for newcomers due to its low-level nature
  • Less out-of-the-box functionality compared to Dremio's complete data lake engine

Code Comparison

Arrow (C++ example):

#include <arrow/api.h>
#include <arrow/io/api.h>

std::shared_ptr<arrow::Table> table;
arrow::io::FileOutputStream::Open("data.arrow", &output);
arrow::ipc::WriteTable(*table, output.get());

Dremio (Java example):

import com.dremio.exec.store.dfs.FileSystemPlugin;
import com.dremio.exec.store.dfs.SchemaMutability;

FileSystemPlugin plugin = new FileSystemPlugin(config, context, "dfs");
plugin.start();

Summary

Arrow is a more versatile and widely-adopted project for in-memory data representation, while Dremio OSS provides a complete data lake engine built on top of Arrow. Arrow offers greater flexibility and language support, but Dremio OSS provides more immediate functionality for data lake management. The choice between them depends on specific project requirements and the level of customization needed.

39,274

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

  • Mature ecosystem with extensive libraries and integrations
  • Powerful distributed computing capabilities for big data processing
  • Strong community support and active development

Cons of Spark

  • Steeper learning curve, especially for complex use cases
  • Resource-intensive, requiring significant cluster resources
  • Can be overkill for smaller datasets or simpler analytics tasks

Code Comparison

Spark (Scala):

val df = spark.read.json("data.json")
df.groupBy("category").agg(avg("price").alias("avg_price"))
  .orderBy(desc("avg_price"))
  .show()

Dremio (SQL):

SELECT category, AVG(price) AS avg_price
FROM data.json
GROUP BY category
ORDER BY avg_price DESC

Key Differences

  • Spark offers a more programmatic approach with support for multiple languages
  • Dremio provides a SQL-first experience, making it more accessible for SQL users
  • Spark excels in complex data processing and machine learning tasks
  • Dremio focuses on data virtualization and query acceleration

Both projects have their strengths, with Spark being more suitable for advanced big data processing and Dremio offering easier data access and management for business intelligence use cases.

5,487

Apache Hive

Pros of Hive

  • Mature and widely adopted in the Hadoop ecosystem
  • Strong support for SQL-like queries on large datasets
  • Integrates well with other Apache big data tools

Cons of Hive

  • Can be slower for real-time queries compared to Dremio
  • Less user-friendly interface and setup process
  • Limited support for modern data formats and cloud-native architectures

Code Comparison

Hive query example:

SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;

Dremio query example:

SELECT customer_id, SUM(order_total)
FROM "Sales"."Orders"
GROUP BY customer_id
HAVING SUM(order_total) > 1000;

Both Hive and Dremio support SQL-like syntax, but Dremio offers a more modern approach with its data lake engine and support for various data sources. Hive is tightly integrated with Hadoop, while Dremio provides a more flexible architecture for working with diverse data ecosystems. Dremio also offers features like data curation and acceleration that are not natively available in Hive.

1,122

Apache Impala

Pros of Impala

  • Mature and battle-tested in production environments
  • Tightly integrated with the Hadoop ecosystem
  • Supports a wide range of file formats and storage systems

Cons of Impala

  • Limited support for complex data types and nested structures
  • Requires Hadoop infrastructure, which can be complex to set up and maintain
  • Less flexible in terms of data source connectivity compared to Dremio

Code Comparison

Impala SQL query:

SELECT customer_id, SUM(order_total) AS total_sales
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;

Dremio SQL query:

SELECT customer_id, SUM(order_total) AS total_sales
FROM "Sales"."Orders"
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000;

Both Impala and Dremio use SQL-like syntax for querying data. The main difference in these examples is the table reference format, where Dremio uses a more flexible naming convention with quotes and dot notation.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Dremio

Dremio enables organizations to unlock the value of their data.

Table of Contents

  1. Documentation
  2. Quickstart
  3. Codebase Structure
  4. Contributing
  5. Questions

Documentation

Documentation is available at https://docs.dremio.com.

Quickstart: How to build and run Dremio

(a) Prerequisites

  • JDK 11 (OpenJDK or Oracle) as the default JDK (JAVA_HOME set to it)
  • JDK 17 (OpenJDK or Oracle) in Maven toolchain, required to run certain integration tests
  • (Optional) Maven 3.9.3 or later (using Homebrew: brew install maven)

Run the following commands to verify that you have the correct versions of Maven and JDK installed:

java -version
mvn --version

Add JDK 17 to the Maven toolchain, easiest to use ${HOME}/.m2/toolchains.xml. Example:

<?xml version="1.0" encoding="UTF-8"?>
<toolchains>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>11</version>
      <vendor>sun</vendor>
    </provides>
    <configuration>
      <jdkHome>FULL_PATH_TO_YOUR_JAVA_11_HOME</jdkHome>
    </configuration>
  </toolchain>
  <toolchain>
    <type>jdk</type>
    <provides>
      <version>17</version>
      <vendor>sun</vendor>
    </provides>
    <configuration>
      <jdkHome>FULL_PATH_TO_YOUR_JAVA_17_HOME</jdkHome>
    </configuration>
  </toolchain>
</toolchains>

(b) Clone the Repository

git clone https://github.com/dremio/dremio-oss.git dremio

(c) Build the Code

cd dremio
mvn clean install -DskipTests (or ./mvnw clean install -DskipTests if maven is not installed on the machine)

The "-DskipTests" option skips most of the tests. Running all tests takes a long time.

(d) Run/Install

Run

distribution/server/target/dremio-community-{DREMIO_VERSION}/dremio-community-{DREMIO_VERSION}/bin/dremio start

OR to start a server with a default user (dremio/dremio123)

mvn compile exec:exec -pl dac/daemon

Once run, the UI is accessible at:

http://localhost:9047

Production Install

(1) Unpack the tarball to install.
mkdir /opt/dremio
tar xvzf distribution/server/target/*.tar.gz --strip=1 -C /opt/dremio
(2) Start Dremio Embedded Mode
cd /opt/dremio
bin/dremio

OSS Only

To have the best possible experience with Dremio, we include a number of dependencies when building Dremio that are distributed under non-oss free (as in beer) licenses. Examples include drivers for major databases such as Oracle Database, Microsoft SQL Server, MySQL as well as enhancements to improve source pushdowns and thread scheduling. If you'd like to only include dependencies with OSS licenses, Dremio will continue to work but some features will be unavailable (such as connecting to databases that rely on these drivers).

To build dremio with only OSS dependencies, you can add the following option to your Maven commandline: -Ddremio.oss-only=true

The distribution directory will be distribution/server/target/dremio-oss-{DREMIO_VERSION}/dremio-oss-{DREMIO_VERSION}

Codebase Structure

DirectoryDetails
dacDremio Analyst Center - The Dremio management component.
commonDremio Common
distributionDremio Distribution
pluginsDremio Plugins

Contributing

If you want to contribute to Dremio, please see Contributing to Dremio.

Questions?

If you have questions, please post them on https://community.dremio.com.