Presto — Introduction
Fast analytics queries against data of scale
Presto is an Open-source distributed query engine that provides a unified SQL interface for accessing disparate sources of data. The engine uses In-memory computing to avoid disk i/o overhead. It leverages distributed computing. The distributed nature allows parallel data pull. Presto decouples the storage and compute, allowing both to scale independently.
Features of Presto
- Interactive & Federated
- Capability to combine different data sources
- Unified SQL Interface (2011 SQL), less learning curve
- Extensible plugins for the backend
- Query data where it leaves no need to Ingest
- Runtime Byte code generation
- Vectorized columnal processing
- Optimized ORC and Parquet readers
What presto is not?
- Presto is NOT related to Hadoop, YARN or SPARK
- Presto is NOT database
- Presto is NOT designed for OLTP workloads
Use cases of Presto
- Interactive querying
- Reporting & Dashboards
- Federated query across databases
- Lakehouse
Components of Presto
Client
- CLI (prestocli.jar)
- JDBC (presto-python-client)
- PyHive
Coordinators
- Parse, Analyze, Plan & Schedule
- Monitors Workers via Heartbeat
- Hosts Webserver / UI
- Scheduler — The scheduler is a part of the coordinator, which is responsible for passing on the commands to the workers. It monitors the correct execution of the commands according to the plan created by the coordinator.
Discovery Service
- Worker registers themself on startup and the coordinator uses this to find workers and other services
Worker
- Responsible for parallelism
- Execute task
- Registers itself with coordinators
Connectors
- Drivers for specific database/storage engine
- Metadata API (Used by Parser/ Analyzer)
- Data Location API (Used by the scheduler)
- Data Stream API (Used by worker)
- Data Stats API (Used by the coordinator)
- Plugins are written in Java
Metastore
- Hive Or Glue
Presto Web UI
- 8889 (EMR) , 8080, 8081
- Cluster Overview & Query Details
- Display, Live Plan, Stage Performance, and other query stats
Admin Tools
- prestoadmin (python package)
- ambari-presto-service
- presto-query-predictor
- presto-query-analyzer
- presto-yarn integration Slider
Configuration Settings
└───presto
└───etc
└───catalog
| └───hive.properties
| └───postgresql.properties
└───config.properties
|───jvm.config
|───log.properties
|───node.properties
node.properties
node.environment
node.id
node.data-dir
catalog.config-dir
plugin.dir
log.properties
com.facebook.presto
jvm.config
-Xmx (upto 80% of the node’s memory)
config.properties
coordinator
node-scheduler.include-coordinator
http-server.http.port
discovery-server.enabled
discovery.uri
postgresql.properties
connector.name
connection-url
connection-user
connection-password
Other configuration props
query.max-memory-per-node
query.max-memory
distributed-joins-disabled
Query
- A client sends an SQL query to the coordinator
- At first, the query is Parsed and AST is generated.
- Optimiser / Analyser gets the partition info using metadata API
- From the generated AST a logical plan is generated.
- A Plan is made up of 1 or more fragments
- A Fragment is represented by a physical stage
- Task has pipelines which is a template for the execution drivers
- Execution drivers operate on splits of data
- Split is part of actual data processing
- Driver contains operators
- Scheduler does load balancing and distributes works to worker processes
- Workers scans files and data blocks from remote storage
- Workers combine results and send it back to client
The pattern to access data in presto is represented below
catalog.schema.table
Caching
Presto stores intermediate data during the period of tasks in its buffer cache. But this is not used as a caching layer. There are third-party solutions like Alluxio that provide a multi-tiered layer for Presto caching.
Note: With RaptorX hierarchical caching can be added to presto.
Deployment option
Security
- Authentication to presto (None, Kerberos, LDAP)
- Authentication to connector (Depends on implementation)
- Authorization in presto
- Authorization in connector system (service user)
- Support for Ranger & HTTPS
- Future pipeline (Authentication presto web UI, Knox, Sentry)
Limitations
- Not fault tolerant (Must restart the query in case of failures)
- Memory constraint Aggregation, JOINS, and Windowing
- Not full SQL support
- Coordinator is single point of failure
Things to consider
- Run background job to kill/clear long-running jobs
- Integrate scaling with Presto metrics by using Presto’s JMX connector, Presto Rest API, or presto-metrics.
- For very high traffic requirements use multiple clusters sitting behind a router/gateway. Check this one from Lyft. Twitter leverages presto router with intelligent scheduling. Similarly, Uber uses Prism to route traffic.
Managed Offerings:
- Presto on EMR
- AWS Athena
- Starburst Presto
- Ahana cloud for Presto
Companies Using Presto
- Slack, Atlassian, Facebook, Airbnb, Netflix, Walmart, Uber, Linkedin
Happy Data Querying!!