presto
The official home of the Presto distributed SQL query engine for big data
Top Related Projects
Quick Overview
Presto is an open-source distributed SQL query engine designed for running interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. It was developed by Facebook and is now maintained by the Presto Foundation. Presto allows querying data where it lives, including Hive, Cassandra, relational databases, and proprietary data stores.
Pros
- High performance and low latency for large-scale data processing
- Supports a wide variety of data sources and formats
- Highly scalable and can handle petabytes of data
- ANSI SQL compatible, making it easy for SQL users to adapt
Cons
- Requires significant system resources for optimal performance
- Can be complex to set up and configure for beginners
- Limited support for real-time data processing
- May not be suitable for small-scale data operations
Code Examples
- Simple SELECT query:
SELECT name, age
FROM users
WHERE country = 'USA'
LIMIT 10;
- Joining tables from different data sources:
SELECT u.name, o.order_date, o.total_amount
FROM mysql.mydb.users u
JOIN hive.orders o ON u.id = o.user_id
WHERE o.order_date >= DATE '2023-01-01';
- Using window functions:
SELECT
product_name,
category,
sales_amount,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales_amount DESC) AS rank
FROM sales
WHERE year = 2023;
Getting Started
To get started with Presto:
- Download Presto from the official website or use a package manager.
- Configure
etc/config.properties
with your coordinator and worker settings. - Set up catalog properties in
etc/catalog/
. - Start the Presto server:
bin/launcher run
- Use the Presto CLI to run queries:
./presto --server localhost:8080 --catalog hive --schema default
- Or connect using a JDBC driver in your application:
String url = "jdbc:presto://localhost:8080/hive/default";
Connection connection = DriverManager.getConnection(url, "username", null);
Competitor Comparisons
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- More versatile, supporting batch processing, stream processing, machine learning, and graph processing
- Better performance for iterative algorithms and machine learning tasks
- Larger and more active community, with more frequent updates and contributions
Cons of Spark
- Higher memory consumption, especially for large datasets
- Steeper learning curve due to its broader feature set
- Less optimized for interactive SQL queries compared to Presto
Code Comparison
Presto SQL query:
SELECT customer_name, SUM(order_total)
FROM orders
GROUP BY customer_name
HAVING SUM(order_total) > 1000
Spark SQL query (using PySpark):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OrderAnalysis").getOrCreate()
df = spark.read.table("orders")
result = df.groupBy("customer_name").agg({"order_total": "sum"}).filter("sum(order_total) > 1000")
result.show()
Both Presto and Spark support SQL queries, but Spark offers additional APIs for data processing in various programming languages. Presto is more focused on SQL-based analytics, while Spark provides a broader set of data processing capabilities.
Apache Hive
Pros of Hive
- Mature ecosystem with extensive documentation and community support
- Tightly integrated with Hadoop ecosystem for big data processing
- Supports a wide range of file formats and storage systems
Cons of Hive
- Generally slower query performance compared to Presto
- Limited support for real-time queries and interactive analytics
- More complex setup and configuration process
Code Comparison
Hive query example:
SELECT customer_id, SUM(order_total) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Presto query example:
SELECT customer_id, SUM(order_total) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
The SQL syntax for both Hive and Presto is similar in this case, but Presto typically executes queries faster, especially for interactive analytics. Hive is better suited for batch processing and ETL workloads, while Presto excels at ad-hoc queries and real-time analytics. Presto also offers better support for complex queries and joins across multiple data sources, making it more versatile for modern data analytics needs.
Apache Flink
Pros of Flink
- Robust stream processing capabilities with low latency and high throughput
- Supports both batch and stream processing in a unified framework
- Offers built-in state management and fault tolerance features
Cons of Flink
- Steeper learning curve compared to Presto, especially for complex stream processing
- Less optimized for ad-hoc querying and interactive analytics
- Smaller ecosystem of connectors and integrations
Code Comparison
Flink (Stream Processing):
DataStream<String> stream = env.addSource(new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), properties));
stream.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1)
.print();
Presto (SQL Query):
SELECT customer_id, SUM(order_total)
FROM orders
WHERE order_date >= DATE '2023-01-01'
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Summary
Flink excels in stream processing and offers a unified approach to batch and stream data processing. It provides robust state management and fault tolerance but has a steeper learning curve. Presto, on the other hand, is optimized for ad-hoc querying and interactive analytics, with a more straightforward SQL-based approach. The choice between the two depends on the specific use case and requirements of the project.
Apache Impala
Pros of Impala
- Better performance for small to medium-sized queries due to its MPP architecture
- Tighter integration with Hadoop ecosystem, especially for Cloudera users
- Lower latency for interactive queries, making it suitable for ad-hoc analysis
Cons of Impala
- Limited scalability for very large datasets compared to Presto
- Less flexibility in terms of supported data sources and connectors
- Smaller community and ecosystem compared to Presto's wide adoption
Code Comparison
Impala SQL query:
SELECT customer_id, SUM(order_total) AS total_sales
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING total_sales > 1000
ORDER BY total_sales DESC
LIMIT 10;
Presto SQL query:
SELECT customer_id, SUM(order_total) AS total_sales
FROM orders
WHERE order_date BETWEEN DATE '2023-01-01' AND DATE '2023-12-31'
GROUP BY customer_id
HAVING SUM(order_total) > 1000
ORDER BY total_sales DESC
LIMIT 10;
The queries are very similar, with minor syntax differences. Presto uses DATE
keyword for date literals, while Impala doesn't require it. Both support standard SQL syntax for complex analytical queries.
Apache Drill is a distributed MPP query layer for self describing data
Pros of Drill
- Supports a wider range of data sources, including NoSQL databases and file systems
- Offers schema-free querying, allowing for more flexible data exploration
- Provides better performance for ad-hoc queries on semi-structured data
Cons of Drill
- Less mature ecosystem and community support compared to Presto
- Limited optimization for complex, multi-join queries on large datasets
- Fewer connectors available for enterprise data warehouses and cloud services
Code Comparison
Presto SQL query:
SELECT customer_id, SUM(order_total)
FROM orders
WHERE order_date >= DATE '2023-01-01'
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Drill SQL query:
SELECT customer_id, SUM(CAST(order_total AS DECIMAL(10,2)))
FROM dfs.`/path/to/orders.json`
WHERE CAST(order_date AS DATE) >= DATE '2023-01-01'
GROUP BY customer_id
HAVING SUM(CAST(order_total AS DECIMAL(10,2))) > 1000;
The main difference in these queries is that Drill requires explicit casting for certain data types, especially when working with semi-structured data sources like JSON files. Presto, on the other hand, can often infer data types more easily from structured data sources.
Dremio - the missing link in modern data
Pros of Dremio
- Offers a self-service semantic layer for data lakes
- Provides advanced data acceleration and caching capabilities
- Supports a wider range of data sources, including cloud object storage
Cons of Dremio
- Less mature and smaller community compared to Presto
- More complex setup and configuration process
- Limited support for certain advanced SQL features
Code Comparison
Dremio query example:
SELECT * FROM "S3"."my_bucket"."my_data.parquet"
WHERE column1 > 100
ORDER BY column2 DESC
LIMIT 10;
Presto query example:
SELECT * FROM hive.my_schema.my_table
WHERE column1 > 100
ORDER BY column2 DESC
LIMIT 10;
Both Dremio and Presto use SQL-like syntax for querying data, but Dremio's approach allows for direct querying of files in object storage without the need for a separate metadata catalog like Hive. Presto, on the other hand, typically relies on external catalogs for metadata management.
Dremio focuses on providing a unified semantic layer and data virtualization, while Presto is primarily designed for distributed query processing across various data sources. The choice between the two depends on specific use cases, existing infrastructure, and required features.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Presto
Presto is a distributed SQL query engine for big data.
See the Presto installation documentation for deployment instructions.
See the Presto documentation for general documentation.
Mission and Architecture
See PrestoDB: Mission and Architecture.
Requirements
- Mac OS X or Linux
- Java 8 Update 151 or higher (8u151+), 64-bit. Both Oracle JDK and OpenJDK are supported.
- Maven 3.6.1+ (for building)
- Python 2.4+ (for running with the launcher script)
Building Presto
Overview (Java)
Presto is a standard Maven project. Simply run the following command from the project root directory:
./mvnw clean install
On the first build, Maven will download all the dependencies from the internet and cache them in the local repository (~/.m2/repository
), which can take a considerable amount of time. Subsequent builds will be faster.
Presto has a comprehensive set of unit tests that can take several minutes to run. You can disable the tests when building:
./mvnw clean install -DskipTests
After building Presto for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Presto is a standard Maven project, you can import it into your IDE using the root pom.xml
file. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml
file.
After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:
- Open the File menu and select Project Structure
- In the SDKs section, ensure that a 1.8 JDK is selected (create one if none exist)
- In the Project section, ensure the Project language level is set to 8.0 as Presto makes use of several Java 8 language features
Presto comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:
- Main Class:
com.facebook.presto.server.PrestoServer
- VM Options:
-ea -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -Xmx2G -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties
- Working directory:
$MODULE_WORKING_DIR$
or$MODULE_DIR$
(Depends your version of IntelliJ) - Use classpath of module:
presto-main
The working directory should be the presto-main
subdirectory. In IntelliJ, using $MODULE_DIR$
accomplishes this automatically.
Additionally, the Hive plugin must be configured with location of your Hive metastore Thrift service. Add the following to the list of VM options, replacing localhost:9083
with the correct host and port (or use the below value if you do not have a Hive metastore):
-Dhive.metastore.uri=thrift://localhost:9083
Using SOCKS for Hive or HDFS
If your Hive metastore or HDFS cluster is not directly accessible to your local machine, you can use SSH port forwarding to access it. Setup a dynamic SOCKS proxy with SSH listening on local port 1080:
ssh -v -N -D 1080 server
Then add the following to the list of VM options:
-Dhive.metastore.thrift.client.socks-proxy=localhost:1080
Running the CLI
Start the CLI to connect to the server and run SQL queries:
presto-cli/target/presto-cli-*-executable.jar
Run a query to see the nodes in the cluster:
SELECT * FROM system.runtime.nodes;
In the sample configuration, the Hive connector is mounted in the hive
catalog, so you can run the following queries to show the tables in the Hive database default
:
SHOW TABLES FROM hive.default;
Building the Documentation
To build the Presto docs, see the docs README.
Building the Presto Console
The Presto Console is composed of several React components and is written in JSX and ES6. This
source code is stored in the presto-ui/
module. The compilation process generates
browser-compatible javascript which is added as JAR resources during the maven build. When the
resource JAR is included on the classpath of Presto coordinator, it will be able to serve the
resources.
None of the Java code relies on the Presto UI project being compiled, so it is possible to exclude
this UI when building Presto. Add the property -DskipUI
to the maven command to disable building
the ui
maven module.
./mvnw clean install -DskipUI
You must have Node.js and Yarn installed to build the UI. When using Maven to build
the project, Node and yarn are installed in the presto-ui/target
folder. Add the node and yarn
executables to the PATH
environment variable.
To update Presto Console after making changes, run:
yarn --cwd presto-ui/src install
If no JavaScript dependencies have changed (i.e., no changes to package.json
), it is faster to run:
yarn --cwd presto-ui/src run package
To simplify iteration, you can also run in watch
mode, which automatically re-compiles when
changes to source files are detected:
yarn --cwd presto-ui/src run watch
To iterate quickly, simply re-build the project in IntelliJ after packaging is complete. Project resources will be hot-reloaded and changes are reflected on browser refresh.
Presto native and Velox
Presto native is a C++ rewrite of Presto worker. Presto native uses Velox as its primary engine to run presto workloads.
Velox is a C++ database library which provides reusable, extensible, and high-performance data processing components.
Check out building instructions to get started.
Contributing!
Please refer to the contribution guidelines to get started.
Questions?
Please join our Slack channel and ask in #dev
.
License
By contributing to Presto, you agree that your contributions will be licensed under the Apache License Version 2.0 (APLv2).
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot