Convert Figma logo to code with AI

trinodb logotrino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

10,324
2,970
10,324
2,404

Top Related Projects

40,184

Apache Spark - A unified analytics engine for large-scale data processing

5,524

Apache Hive

23,929

Apache Flink

1,938

Apache Drill is a distributed MPP query layer for self describing data

16,153

The official home of the Presto distributed SQL query engine for big data

Dremio - the missing link in modern data

Quick Overview

Trino (formerly PrestoSQL) is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. It's optimized for ad-hoc analysis and can process data from various sources including Hadoop, AWS S3, MySQL, and more.

Pros

  • High performance and scalability for querying large datasets
  • Supports a wide range of data sources and formats
  • ANSI SQL compatibility for ease of use
  • Active community and regular updates

Cons

  • Complex setup and configuration for large-scale deployments
  • Limited write capabilities compared to full-fledged databases
  • Resource-intensive, requiring significant memory and CPU
  • Steeper learning curve for advanced features and optimizations

Code Examples

  1. Simple query example:
SELECT * FROM hive.sales.transactions
WHERE amount > 1000
LIMIT 10;

This query selects all columns from the 'transactions' table in the 'sales' schema of the Hive catalog, filtering for amounts over 1000 and limiting the result to 10 rows.

  1. Joining data from different sources:
SELECT c.name, o.order_date, o.total_amount
FROM mysql.customers c
JOIN postgresql.orders o ON c.id = o.customer_id
WHERE o.order_date >= DATE '2023-01-01';

This query joins customer data from a MySQL database with order data from a PostgreSQL database, demonstrating Trino's ability to query across different data sources.

  1. Creating a view:
CREATE VIEW hive.analytics.high_value_customers AS
SELECT customer_id, SUM(total_amount) as total_spent
FROM postgresql.orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000;

This example creates a view in the Hive catalog that identifies high-value customers based on their total spend from the orders table in PostgreSQL.

Getting Started

To get started with Trino:

  1. Download Trino server from the official website.
  2. Extract the tarball and configure etc/config.properties and etc/catalog/*.properties files.
  3. Start the Trino server:
bin/launcher start
  1. Use the Trino CLI to run queries:
./trino --server localhost:8080 --catalog hive --schema default
  1. Run your first query:
SELECT * FROM hive.default.users LIMIT 10;

For more detailed setup instructions, refer to the official Trino documentation.

Competitor Comparisons

40,184

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

  • More mature ecosystem with extensive libraries and integrations
  • Better support for machine learning and graph processing
  • Stronger community and wider adoption in industry

Cons of Spark

  • Higher memory consumption and resource requirements
  • Steeper learning curve, especially for complex use cases
  • Less efficient for ad-hoc SQL queries on large datasets

Code Comparison

Trino SQL query:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name

Spark DataFrame API (Scala):

val result = spark.sql("""
  SELECT customer_name, SUM(order_total)
  FROM orders
  JOIN customers ON orders.customer_id = customers.id
  GROUP BY customer_name
""")

Both Trino and Spark support SQL queries, but Spark offers additional APIs for data processing. Trino focuses on SQL and is optimized for distributed query execution, while Spark provides a more comprehensive data processing framework with support for various programming languages and paradigms.

Trino excels at fast, interactive SQL queries across multiple data sources, making it ideal for data lake analytics. Spark, on the other hand, offers a more versatile platform for large-scale data processing, including batch and stream processing, machine learning, and graph computations.

5,524

Apache Hive

Pros of Hive

  • Mature ecosystem with extensive documentation and community support
  • Tightly integrated with Hadoop ecosystem for big data processing
  • Supports a wide range of file formats and storage systems

Cons of Hive

  • Slower query performance compared to Trino, especially for interactive queries
  • Limited support for real-time data processing and streaming
  • Less flexible when working with diverse data sources outside the Hadoop ecosystem

Code Comparison

Hive query example:

SELECT customer_id, SUM(order_total)
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id;

Trino query example:

SELECT customer_id, SUM(order_total)
FROM hive.sales.orders
WHERE order_date >= DATE '2023-01-01'
GROUP BY customer_id;

The syntax is similar, but Trino uses a more standardized SQL dialect and supports querying across different data sources using a catalog.schema.table notation.

Key Differences

  • Performance: Trino is designed for fast, interactive queries, while Hive is better suited for batch processing
  • Data sources: Trino can query multiple data sources simultaneously, Hive is primarily focused on Hadoop data
  • Query engine: Trino uses a distributed query engine, while Hive translates queries into MapReduce jobs
  • Real-time capabilities: Trino offers better support for real-time data analysis compared to Hive
23,929

Apache Flink

Pros of Flink

  • Robust stream processing capabilities with low latency and high throughput
  • Extensive ecosystem with support for various data sources and sinks
  • Built-in support for event time processing and windowing operations

Cons of Flink

  • Steeper learning curve due to complex API and concepts
  • Limited support for ad-hoc querying compared to Trino's SQL-based approach
  • Higher resource requirements for small-scale deployments

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .sum(1);
counts.print();

Trino (SQL):

SELECT word, COUNT(*) as count
FROM (
  SELECT UNNEST(SPLIT(LOWER(line), ' ')) AS word
  FROM input_table
)
GROUP BY word
ORDER BY count DESC

Summary

Flink excels in stream processing and complex event processing scenarios, offering low-latency and high-throughput capabilities. It provides a rich ecosystem for various data sources and sinks. However, Flink has a steeper learning curve and may be overkill for simpler data processing tasks.

Trino, on the other hand, focuses on distributed SQL query execution across multiple data sources. It offers a more familiar SQL-based approach, making it easier for users with SQL backgrounds to get started. Trino is better suited for ad-hoc querying and data exploration tasks, while Flink shines in real-time stream processing applications.

1,938

Apache Drill is a distributed MPP query layer for self describing data

Pros of Drill

  • Supports schema-free JSON data querying without predefined schemas
  • Offers a distributed cache for improved query performance
  • Provides built-in support for Hive metastore integration

Cons of Drill

  • Less extensive connector ecosystem compared to Trino
  • Slower development pace and community growth
  • Limited support for complex data types and advanced SQL features

Code Comparison

Drill query example:

SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30

Trino query example:

SELECT * FROM hive.default.users WHERE age > 30

Both systems use SQL-like syntax, but Drill can query JSON files directly without a predefined schema, while Trino typically requires a catalog and schema definition.

Key Differences

  • Query Engine: Drill uses a vectorized execution engine, while Trino uses a distributed query engine
  • Data Format Support: Drill excels at querying schema-less data, whereas Trino is better suited for structured data sources
  • Ecosystem: Trino has a more extensive connector ecosystem and broader adoption in the industry
  • Performance: Trino generally offers better performance for large-scale, complex queries across multiple data sources
16,153

The official home of the Presto distributed SQL query engine for big data

Pros of Presto

  • Longer history and established ecosystem
  • Wider adoption in enterprise environments
  • More extensive documentation and community resources

Cons of Presto

  • Slower release cycle and feature development
  • Less active community engagement
  • Limited support for newer technologies and cloud-native environments

Code Comparison

Presto:

SELECT
  orderkey,
  orderstatus,
  totalprice,
  orderdate
FROM
  orders
WHERE
  orderdate >= DATE '1994-01-01'
  AND totalprice > 100000;

Trino:

SELECT
  orderkey,
  orderstatus,
  totalprice,
  orderdate
FROM
  orders
WHERE
  orderdate >= DATE '1994-01-01'
  AND totalprice > 100000;

The SQL syntax for both Presto and Trino is nearly identical in this example, as Trino was forked from Presto. However, Trino has since added more features and optimizations that may not be reflected in basic SQL queries.

While the core functionality remains similar, Trino has evolved to offer improved performance, enhanced security features, and better integration with modern data architectures. Presto, on the other hand, maintains its position as a stable and widely-used solution in enterprise environments, with a focus on compatibility and established workflows.

Dremio - the missing link in modern data

Pros of Dremio

  • Built-in data catalog and data lineage features
  • Supports data virtualization and data lake architecture
  • User-friendly interface for data exploration and visualization

Cons of Dremio

  • Less extensive ecosystem and community support
  • More limited query optimization capabilities
  • Steeper learning curve for advanced features

Code Comparison

Trino query example:

SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name

Dremio query example:

SELECT customer_name, SUM(order_total)
FROM "Sales"."Orders"
JOIN "Sales"."Customers" ON Orders.customer_id = Customers.id
GROUP BY customer_name

Both Trino and Dremio use SQL-like syntax for querying data. The main difference lies in how data sources are referenced. Trino typically uses catalog and schema names, while Dremio uses virtual datasets and spaces.

Trino offers more flexibility in connecting to various data sources, making it easier to query across different systems. Dremio, on the other hand, provides a more integrated approach to data virtualization and management within its platform.

Overall, Trino is better suited for large-scale distributed queries across diverse data sources, while Dremio excels in data lake management and self-service analytics scenarios.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Trino Logo

Trino is a fast distributed SQL query engine for big data analytics.

See the User Manual for deployment instructions and end user documentation.

Trino download Reproducible builds supported Trino Slack Trino: The Definitive Guide book download

Development

Learn about development for all Trino organization projects:

Further information in the development section of the website includes different roles, like contributors, reviewers, and maintainers, related processes, and other aspects.

See the Trino developer guide for information about the SPI, implementing connectors and other plugins plugins, the client protocol, writing tests and other lower level details.

See DEVELOPMENT for information about code style, development process, and guidelines.

See CONTRIBUTING for contribution requirements.

Security

See the project security policy for information about reporting vulnerabilities.

Trino supports reproducible builds as of version 449.

Build requirements

  • Mac OS X or Linux
  • Java 23.0.0+, 64-bit
  • Docker
    • Turn SELinux or other systems disabling write access to the local checkout off, to allow containers to mount parts of the Trino source tree

Building Trino

Trino is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install -DskipTests

On the first build, Maven downloads all the dependencies from the internet and caches them in the local repository (~/.m2/repository), which can take a while, depending on your connection speed. Subsequent builds are faster.

Trino has a comprehensive set of tests that take a considerable amount of time to run, and are thus disabled by the above command. These tests are run by the CI system when you submit a pull request. We recommend only running tests locally for the areas of code that you change.

Running Trino in your IDE

Overview

After building Trino for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Trino is a standard Maven project, you easily can import it into your IDE. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

  • Open the File menu and select Project Structure
  • In the SDKs section, ensure that JDK 23 is selected (create one if none exist)
  • In the Project section, ensure the Project language level is set to 23

Running a testing server

The simplest way to run Trino for development is to run the TpchQueryRunner class. It will start a development version of the server that is configured with the TPCH connector. You can then use the CLI to execute queries against this server. Many other connectors have their own *QueryRunner class that you can use when working on a specific connector.

Running the full server

Trino comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

  • Main Class: io.trino.server.DevelopmentServer
  • VM Options: -ea -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties -Djdk.attach.allowAttachSelf=true
  • Working directory: $MODULE_DIR$
  • Use classpath of module: trino-server-dev

The working directory should be the trino-server-dev subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

If VM options doesn't exist in the dialog, you need to select Modify options and enable Add VM options.

To adjust which plugins are enabled for the development server, adjust the value of plugin.bundles in config.properties. Each entry in this list must represent a plugin specified by one of the following options:

  • A path to a pom.xml or *.pom file describing a Maven project that produces a plugin.
  • Maven coordinates, in the form <groupId>:<artifactId>[:<extension>[:<classifier>]]:<version>. The plugin will be loaded via Maven and therefore must be available in your local repository or a remote repository.
  • A path to a plugin directory containing JAR files. See Deploying a custom plugin for more details.

If you want to use a plugin in a catalog, you must add a corresponding <catalog_name>.properties file to testing/trino-server-dev/etc/catalog.

Running the CLI

Start the CLI to connect to the server and run SQL queries:

client/trino-cli/target/trino-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

Run a query against the TPCH connector:

SELECT * FROM tpch.tiny.region;