trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

11,671

3,277

11,671

2,411

View on GitHub

Top Related Projects

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

drill

1,985

Apache Drill is a distributed MPP query layer for self describing data

presto

16,420

The official home of the Presto distributed SQL query engine for big data

dremio-oss

1,435

Dremio - the missing link in modern data

Quick Overview

Trino (formerly PrestoSQL) is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. It's optimized for ad-hoc analysis and can process data from various sources including Hadoop, AWS S3, MySQL, and more.

Pros

High performance and scalability for querying large datasets
Supports a wide range of data sources and formats
ANSI SQL compatibility for ease of use
Active community and regular updates

Cons

Complex setup and configuration for large-scale deployments
Limited write capabilities compared to full-fledged databases
Resource-intensive, requiring significant memory and CPU
Steeper learning curve for advanced features and optimizations

Code Examples

Simple query example:

SELECT * FROM hive.sales.transactions
WHERE amount > 1000
LIMIT 10;

This query selects all columns from the 'transactions' table in the 'sales' schema of the Hive catalog, filtering for amounts over 1000 and limiting the result to 10 rows.

Joining data from different sources:

SELECT c.name, o.order_date, o.total_amount
FROM mysql.customers c
JOIN postgresql.orders o ON c.id = o.customer_id
WHERE o.order_date >= DATE '2023-01-01';

This query joins customer data from a MySQL database with order data from a PostgreSQL database, demonstrating Trino's ability to query across different data sources.

Creating a view:

CREATE VIEW hive.analytics.high_value_customers AS
SELECT customer_id, SUM(total_amount) as total_spent
FROM postgresql.orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000;

This example creates a view in the Hive catalog that identifies high-value customers based on their total spend from the orders table in PostgreSQL.

Getting Started

To get started with Trino:

Download Trino server from the official website.
Extract the tarball and configure etc/config.properties and etc/catalog/*.properties files.
Start the Trino server:

bin/launcher start

Use the Trino CLI to run queries:

./trino --server localhost:8080 --catalog hive --schema default

Run your first query:

SELECT * FROM hive.default.users LIMIT 10;

For more detailed setup instructions, refer to the official Trino documentation.

Competitor Comparisons

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

More mature ecosystem with extensive libraries and integrations
Better support for machine learning and graph processing
Stronger community and wider adoption in industry

Cons of Spark

Higher memory consumption and resource requirements
Steeper learning curve, especially for complex use cases
Less efficient for ad-hoc SQL queries on large datasets

Code Comparison

Trino SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name

Spark DataFrame API (Scala):

val result = spark.sql("""
  SELECT customer_name, SUM(order_total)
  FROM orders
  JOIN customers ON orders.customer_id = customers.id
  GROUP BY customer_name
""")

Both Trino and Spark support SQL queries, but Spark offers additional APIs for data processing. Trino focuses on SQL and is optimized for distributed query execution, while Spark provides a more comprehensive data processing framework with support for various programming languages and paradigms.

Trino excels at fast, interactive SQL queries across multiple data sources, making it ideal for data lake analytics. Spark, on the other hand, offers a more versatile platform for large-scale data processing, including batch and stream processing, machine learning, and graph computations.

hive

5,749

Apache Hive

Pros of Hive

Mature ecosystem with extensive documentation and community support
Tightly integrated with Hadoop ecosystem for big data processing
Supports a wide range of file formats and storage systems

Cons of Hive

Slower query performance compared to Trino, especially for interactive queries
Limited support for real-time data processing and streaming
Less flexible when working with diverse data sources outside the Hadoop ecosystem

Code Comparison

Hive query example:

SELECT customer_id, SUM(order_total)
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id;

Trino query example:

SELECT customer_id, SUM(order_total)
FROM hive.sales.orders
WHERE order_date >= DATE '2023-01-01'
GROUP BY customer_id;

The syntax is similar, but Trino uses a more standardized SQL dialect and supports querying across different data sources using a catalog.schema.table notation.

Key Differences

Performance: Trino is designed for fast, interactive queries, while Hive is better suited for batch processing
Data sources: Trino can query multiple data sources simultaneously, Hive is primarily focused on Hadoop data
Query engine: Trino uses a distributed query engine, while Hive translates queries into MapReduce jobs
Real-time capabilities: Trino offers better support for real-time data analysis compared to Hive

flink

25,110

Apache Flink

Pros of Flink

Robust stream processing capabilities with low latency and high throughput
Extensive ecosystem with support for various data sources and sinks
Built-in support for event time processing and windowing operations

Cons of Flink

Steeper learning curve due to complex API and concepts
Limited support for ad-hoc querying compared to Trino's SQL-based approach
Higher resource requirements for small-scale deployments

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .sum(1);
counts.print();

Trino (SQL):

SELECT word, COUNT(*) as count
FROM (
  SELECT UNNEST(SPLIT(LOWER(line), ' ')) AS word
  FROM input_table
)
GROUP BY word
ORDER BY count DESC

Summary

Flink excels in stream processing and complex event processing scenarios, offering low-latency and high-throughput capabilities. It provides a rich ecosystem for various data sources and sinks. However, Flink has a steeper learning curve and may be overkill for simpler data processing tasks.

Trino, on the other hand, focuses on distributed SQL query execution across multiple data sources. It offers a more familiar SQL-based approach, making it easier for users with SQL backgrounds to get started. Trino is better suited for ad-hoc querying and data exploration tasks, while Flink shines in real-time stream processing applications.

drill

1,985

Apache Drill is a distributed MPP query layer for self describing data

Pros of Drill

Supports schema-free JSON data querying without predefined schemas
Offers a distributed cache for improved query performance
Provides built-in support for Hive metastore integration

Cons of Drill

Less extensive connector ecosystem compared to Trino
Slower development pace and community growth
Limited support for complex data types and advanced SQL features

Code Comparison

Drill query example:

SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30

Trino query example:

SELECT * FROM hive.default.users WHERE age > 30

Both systems use SQL-like syntax, but Drill can query JSON files directly without a predefined schema, while Trino typically requires a catalog and schema definition.

Key Differences

Query Engine: Drill uses a vectorized execution engine, while Trino uses a distributed query engine
Data Format Support: Drill excels at querying schema-less data, whereas Trino is better suited for structured data sources
Ecosystem: Trino has a more extensive connector ecosystem and broader adoption in the industry
Performance: Trino generally offers better performance for large-scale, complex queries across multiple data sources

presto

16,420

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

Longer history and established ecosystem
Wider adoption in enterprise environments
More extensive documentation and community resources

Cons of Presto

Slower release cycle and feature development
Less active community engagement
Limited support for newer technologies and cloud-native environments

Code Comparison

Presto:

SELECT
  orderkey,
  orderstatus,
  totalprice,
  orderdate
FROM
  orders
WHERE
  orderdate >= DATE '1994-01-01'
  AND totalprice > 100000;

Trino:

SELECT
  orderkey,
  orderstatus,
  totalprice,
  orderdate
FROM
  orders
WHERE
  orderdate >= DATE '1994-01-01'
  AND totalprice > 100000;

The SQL syntax for both Presto and Trino is nearly identical in this example, as Trino was forked from Presto. However, Trino has since added more features and optimizations that may not be reflected in basic SQL queries.

While the core functionality remains similar, Trino has evolved to offer improved performance, enhanced security features, and better integration with modern data architectures. Presto, on the other hand, maintains its position as a stable and widely-used solution in enterprise environments, with a focus on compatibility and established workflows.

dremio-oss

1,435

Dremio - the missing link in modern data

Pros of Dremio

Built-in data catalog and data lineage features
Supports data virtualization and data lake architecture
User-friendly interface for data exploration and visualization

Cons of Dremio

Less extensive ecosystem and community support
More limited query optimization capabilities
Steeper learning curve for advanced features

Code Comparison

Trino query example:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name

Dremio query example:

SELECT customer_name, SUM(order_total)
FROM "Sales"."Orders"
JOIN "Sales"."Customers" ON Orders.customer_id = Customers.id
GROUP BY customer_name

Both Trino and Dremio use SQL-like syntax for querying data. The main difference lies in how data sources are referenced. Trino typically uses catalog and schema names, while Dremio uses virtual datasets and spaces.

Trino offers more flexibility in connecting to various data sources, making it easier to query across different systems. Dremio, on the other hand, provides a more integrated approach to data virtualization and management within its platform.

Overall, Trino is better suited for large-scale distributed queries across diverse data sources, while Dremio excels in data lake management and self-service analytics scenarios.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Trino is a fast distributed SQL query engine for big data analytics.

See the User Manual for deployment instructions and end user documentation.

Development

Learn about development for all Trino organization projects:

Further information in the development section of the website includes different roles, like contributors, reviewers, and maintainers, related processes, and other aspects.

See the Trino developer guide for information about the SPI, implementing connectors and other plugins plugins, the client protocol, writing tests and other lower level details.

See DEVELOPMENT for information about code style, development process, and guidelines.

See CONTRIBUTING for contribution requirements.

Security

See the project security policy for information about reporting vulnerabilities.

Trino supports reproducible builds as of version 449.

Build requirements

Mac OS X or Linux
- Note that some npm packages used to build the web UI are only available for x86 architectures, so if you're building on Apple Silicon, you need to have Rosetta 2 installed
Java 24.0.1+, 64-bit
Docker
- Turn SELinux or other systems disabling write access to the local checkout off, to allow containers to mount parts of the Trino source tree

Building Trino

Trino is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install -DskipTests

On the first build, Maven downloads all the dependencies from the internet and caches them in the local repository (~/.m2/repository), which can take a while, depending on your connection speed. Subsequent builds are faster.

Trino has a comprehensive set of tests that take a considerable amount of time to run, and are thus disabled by the above command. These tests are run by the CI system when you submit a pull request. We recommend only running tests locally for the areas of code that you change.

Running Trino in your IDE

Overview

After building Trino for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Trino is a standard Maven project, you easily can import it into your IDE. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

Open the File menu and select Project Structure
In the SDKs section, ensure that JDK 24 is selected (create one if none exist)
In the Project section, ensure the Project language level is set to 24

Running a testing server

The simplest way to run Trino for development is to run the TpchQueryRunner class. It will start a development version of the server that is configured with the TPCH connector. You can then use the CLI to execute queries against this server. Many other connectors have their own *QueryRunner class that you can use when working on a specific connector.

Running the full server

Trino comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

Main Class: io.trino.server.DevelopmentServer
VM Options: -ea -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties -Djdk.attach.allowAttachSelf=true
Working directory: $MODULE_DIR$
Use classpath of module: trino-server-dev

The working directory should be the trino-server-dev subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

If VM options doesn't exist in the dialog, you need to select Modify options and enable Add VM options.

To adjust which plugins are enabled for the development server, adjust the value of plugin.bundles in config.properties. Each entry in this list must represent a plugin specified by one of the following options:

A path to a pom.xml or *.pom file describing a Maven project that produces a plugin.
Maven coordinates, in the form <groupId>:<artifactId>[:<extension>[:<classifier>]]:<version>. The plugin will be loaded via Maven and therefore must be available in your local repository or a remote repository.
A path to a plugin directory containing JAR files. See Deploying a custom plugin for more details.

If you want to use a plugin in a catalog, you must add a corresponding <catalog_name>.properties file to testing/trino-server-dev/etc/catalog.

Running the CLI

Start the CLI to connect to the server and run SQL queries:

client/trino-cli/target/trino-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

Run a query against the TPCH connector:

SELECT * FROM tpch.tiny.region;

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot