trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Top Related Projects
Apache Spark - A unified analytics engine for large-scale data processing
Apache Hive
Apache Flink
Apache Drill is a distributed MPP query layer for self describing data
The official home of the Presto distributed SQL query engine for big data
Dremio - the missing link in modern data
Quick Overview
Trino (formerly PrestoSQL) is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. It's optimized for ad-hoc analysis and can process data from various sources including Hadoop, AWS S3, MySQL, and more.
Pros
- High performance and scalability for querying large datasets
- Supports a wide range of data sources and formats
- ANSI SQL compatibility for ease of use
- Active community and regular updates
Cons
- Complex setup and configuration for large-scale deployments
- Limited write capabilities compared to full-fledged databases
- Resource-intensive, requiring significant memory and CPU
- Steeper learning curve for advanced features and optimizations
Code Examples
- Simple query example:
SELECT * FROM hive.sales.transactions
WHERE amount > 1000
LIMIT 10;
This query selects all columns from the 'transactions' table in the 'sales' schema of the Hive catalog, filtering for amounts over 1000 and limiting the result to 10 rows.
- Joining data from different sources:
SELECT c.name, o.order_date, o.total_amount
FROM mysql.customers c
JOIN postgresql.orders o ON c.id = o.customer_id
WHERE o.order_date >= DATE '2023-01-01';
This query joins customer data from a MySQL database with order data from a PostgreSQL database, demonstrating Trino's ability to query across different data sources.
- Creating a view:
CREATE VIEW hive.analytics.high_value_customers AS
SELECT customer_id, SUM(total_amount) as total_spent
FROM postgresql.orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000;
This example creates a view in the Hive catalog that identifies high-value customers based on their total spend from the orders table in PostgreSQL.
Getting Started
To get started with Trino:
- Download Trino server from the official website.
- Extract the tarball and configure
etc/config.properties
andetc/catalog/*.properties
files. - Start the Trino server:
bin/launcher start
- Use the Trino CLI to run queries:
./trino --server localhost:8080 --catalog hive --schema default
- Run your first query:
SELECT * FROM hive.default.users LIMIT 10;
For more detailed setup instructions, refer to the official Trino documentation.
Competitor Comparisons
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- More mature ecosystem with extensive libraries and integrations
- Better support for machine learning and graph processing
- Stronger community and wider adoption in industry
Cons of Spark
- Higher memory consumption and resource requirements
- Steeper learning curve, especially for complex use cases
- Less efficient for ad-hoc SQL queries on large datasets
Code Comparison
Trino SQL query:
SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
Spark DataFrame API (Scala):
val result = spark.sql("""
SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
""")
Both Trino and Spark support SQL queries, but Spark offers additional APIs for data processing. Trino focuses on SQL and is optimized for distributed query execution, while Spark provides a more comprehensive data processing framework with support for various programming languages and paradigms.
Trino excels at fast, interactive SQL queries across multiple data sources, making it ideal for data lake analytics. Spark, on the other hand, offers a more versatile platform for large-scale data processing, including batch and stream processing, machine learning, and graph computations.
Apache Hive
Pros of Hive
- Mature ecosystem with extensive documentation and community support
- Tightly integrated with Hadoop ecosystem for big data processing
- Supports a wide range of file formats and storage systems
Cons of Hive
- Slower query performance compared to Trino, especially for interactive queries
- Limited support for real-time data processing and streaming
- Less flexible when working with diverse data sources outside the Hadoop ecosystem
Code Comparison
Hive query example:
SELECT customer_id, SUM(order_total)
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id;
Trino query example:
SELECT customer_id, SUM(order_total)
FROM hive.sales.orders
WHERE order_date >= DATE '2023-01-01'
GROUP BY customer_id;
The syntax is similar, but Trino uses a more standardized SQL dialect and supports querying across different data sources using a catalog.schema.table notation.
Key Differences
- Performance: Trino is designed for fast, interactive queries, while Hive is better suited for batch processing
- Data sources: Trino can query multiple data sources simultaneously, Hive is primarily focused on Hadoop data
- Query engine: Trino uses a distributed query engine, while Hive translates queries into MapReduce jobs
- Real-time capabilities: Trino offers better support for real-time data analysis compared to Hive
Apache Flink
Pros of Flink
- Robust stream processing capabilities with low latency and high throughput
- Extensive ecosystem with support for various data sources and sinks
- Built-in support for event time processing and windowing operations
Cons of Flink
- Steeper learning curve due to complex API and concepts
- Limited support for ad-hoc querying compared to Trino's SQL-based approach
- Higher resource requirements for small-scale deployments
Code Comparison
Flink (Java):
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1);
counts.print();
Trino (SQL):
SELECT word, COUNT(*) as count
FROM (
SELECT UNNEST(SPLIT(LOWER(line), ' ')) AS word
FROM input_table
)
GROUP BY word
ORDER BY count DESC
Summary
Flink excels in stream processing and complex event processing scenarios, offering low-latency and high-throughput capabilities. It provides a rich ecosystem for various data sources and sinks. However, Flink has a steeper learning curve and may be overkill for simpler data processing tasks.
Trino, on the other hand, focuses on distributed SQL query execution across multiple data sources. It offers a more familiar SQL-based approach, making it easier for users with SQL backgrounds to get started. Trino is better suited for ad-hoc querying and data exploration tasks, while Flink shines in real-time stream processing applications.
Apache Drill is a distributed MPP query layer for self describing data
Pros of Drill
- Supports schema-free JSON data querying without predefined schemas
- Offers a distributed cache for improved query performance
- Provides built-in support for Hive metastore integration
Cons of Drill
- Less extensive connector ecosystem compared to Trino
- Slower development pace and community growth
- Limited support for complex data types and advanced SQL features
Code Comparison
Drill query example:
SELECT * FROM dfs.`/path/to/data/file.json` WHERE age > 30
Trino query example:
SELECT * FROM hive.default.users WHERE age > 30
Both systems use SQL-like syntax, but Drill can query JSON files directly without a predefined schema, while Trino typically requires a catalog and schema definition.
Key Differences
- Query Engine: Drill uses a vectorized execution engine, while Trino uses a distributed query engine
- Data Format Support: Drill excels at querying schema-less data, whereas Trino is better suited for structured data sources
- Ecosystem: Trino has a more extensive connector ecosystem and broader adoption in the industry
- Performance: Trino generally offers better performance for large-scale, complex queries across multiple data sources
The official home of the Presto distributed SQL query engine for big data
Pros of Presto
- Longer history and established ecosystem
- Wider adoption in enterprise environments
- More extensive documentation and community resources
Cons of Presto
- Slower release cycle and feature development
- Less active community engagement
- Limited support for newer technologies and cloud-native environments
Code Comparison
Presto:
SELECT
orderkey,
orderstatus,
totalprice,
orderdate
FROM
orders
WHERE
orderdate >= DATE '1994-01-01'
AND totalprice > 100000;
Trino:
SELECT
orderkey,
orderstatus,
totalprice,
orderdate
FROM
orders
WHERE
orderdate >= DATE '1994-01-01'
AND totalprice > 100000;
The SQL syntax for both Presto and Trino is nearly identical in this example, as Trino was forked from Presto. However, Trino has since added more features and optimizations that may not be reflected in basic SQL queries.
While the core functionality remains similar, Trino has evolved to offer improved performance, enhanced security features, and better integration with modern data architectures. Presto, on the other hand, maintains its position as a stable and widely-used solution in enterprise environments, with a focus on compatibility and established workflows.
Dremio - the missing link in modern data
Pros of Dremio
- Built-in data catalog and data lineage features
- Supports data virtualization and data lake architecture
- User-friendly interface for data exploration and visualization
Cons of Dremio
- Less extensive ecosystem and community support
- More limited query optimization capabilities
- Steeper learning curve for advanced features
Code Comparison
Trino query example:
SELECT customer_name, SUM(order_total)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customer_name
Dremio query example:
SELECT customer_name, SUM(order_total)
FROM "Sales"."Orders"
JOIN "Sales"."Customers" ON Orders.customer_id = Customers.id
GROUP BY customer_name
Both Trino and Dremio use SQL-like syntax for querying data. The main difference lies in how data sources are referenced. Trino typically uses catalog and schema names, while Dremio uses virtual datasets and spaces.
Trino offers more flexibility in connecting to various data sources, making it easier to query across different systems. Dremio, on the other hand, provides a more integrated approach to data virtualization and management within its platform.
Overall, Trino is better suited for large-scale distributed queries across diverse data sources, while Dremio excels in data lake management and self-service analytics scenarios.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Trino is a fast distributed SQL query engine for big data analytics.
See the User Manual for deployment instructions and end user documentation.
Development
Learn about development for all Trino organization projects:
Further information in the development section of the website includes different roles, like contributors, reviewers, and maintainers, related processes, and other aspects.
See the Trino developer guide for information about the SPI, implementing connectors and other plugins plugins, the client protocol, writing tests and other lower level details.
See DEVELOPMENT for information about code style, development process, and guidelines.
See CONTRIBUTING for contribution requirements.
Security
See the project security policy for information about reporting vulnerabilities.
Trino supports reproducible builds as of version 449.
Build requirements
- Mac OS X or Linux
- Java 23.0.0+, 64-bit
- Docker
- Turn SELinux or other systems disabling write access to the local checkout off, to allow containers to mount parts of the Trino source tree
Building Trino
Trino is a standard Maven project. Simply run the following command from the project root directory:
./mvnw clean install -DskipTests
On the first build, Maven downloads all the dependencies from the internet
and caches them in the local repository (~/.m2/repository
), which can take a
while, depending on your connection speed. Subsequent builds are faster.
Trino has a comprehensive set of tests that take a considerable amount of time to run, and are thus disabled by the above command. These tests are run by the CI system when you submit a pull request. We recommend only running tests locally for the areas of code that you change.
Running Trino in your IDE
Overview
After building Trino for the first time, you can load the project into your IDE
and run the server. We recommend using
IntelliJ IDEA. Because Trino is a standard
Maven project, you easily can import it into your IDE. In IntelliJ, choose
Open Project from the Quick Start box or choose Open
from the File menu and select the root pom.xml
file.
After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:
- Open the File menu and select Project Structure
- In the SDKs section, ensure that JDK 23 is selected (create one if none exist)
- In the Project section, ensure the Project language level is set to 23
Running a testing server
The simplest way to run Trino for development is to run the TpchQueryRunner
class. It will start a development version of the server that is configured with
the TPCH connector. You can then use the CLI to execute queries against this
server. Many other connectors have their own *QueryRunner
class that you can
use when working on a specific connector.
Running the full server
Trino comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:
- Main Class:
io.trino.server.DevelopmentServer
- VM Options:
-ea -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties -Djdk.attach.allowAttachSelf=true
- Working directory:
$MODULE_DIR$
- Use classpath of module:
trino-server-dev
The working directory should be the trino-server-dev
subdirectory. In
IntelliJ, using $MODULE_DIR$
accomplishes this automatically.
If VM options
doesn't exist in the dialog, you need to select Modify options
and enable Add VM options
.
To adjust which plugins are enabled for the development server, adjust the value of
plugin.bundles
in config.properties
. Each entry in this list must represent a plugin
specified by one of the following options:
- A path to a
pom.xml
or*.pom
file describing a Maven project that produces a plugin. - Maven coordinates, in the form
<groupId>:<artifactId>[:<extension>[:<classifier>]]:<version>
. The plugin will be loaded via Maven and therefore must be available in your local repository or a remote repository. - A path to a plugin directory containing JAR files. See Deploying a custom plugin for more details.
If you want to use a plugin in a catalog, you must add a corresponding
<catalog_name>.properties
file to testing/trino-server-dev/etc/catalog
.
Running the CLI
Start the CLI to connect to the server and run SQL queries:
client/trino-cli/target/trino-cli-*-executable.jar
Run a query to see the nodes in the cluster:
SELECT * FROM system.runtime.nodes;
Run a query against the TPCH connector:
SELECT * FROM tpch.tiny.region;
Top Related Projects
Apache Spark - A unified analytics engine for large-scale data processing
Apache Hive
Apache Flink
Apache Drill is a distributed MPP query layer for self describing data
The official home of the Presto distributed SQL query engine for big data
Dremio - the missing link in modern data
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot