Presto — Introduction

Published in

Dev Genius

4 min readNov 18, 2022

Fast analytics queries against data of scale

Presto is an Open-source distributed query engine that provides a unified SQL interface for accessing disparate sources of data. The engine uses In-memory computing to avoid disk i/o overhead. It leverages distributed computing. The distributed nature allows parallel data pull. Presto decouples the storage and compute, allowing both to scale independently.

Features of Presto

Interactive & Federated
Capability to combine different data sources
Unified SQL Interface (2011 SQL), less learning curve
Extensible plugins for the backend
Query data where it leaves no need to Ingest
Runtime Byte code generation
Vectorized columnal processing
Optimized ORC and Parquet readers

What presto is not?

Presto is NOT related to Hadoop, YARN or SPARK
Presto is NOT database
Presto is NOT designed for OLTP workloads

Use cases of Presto

Interactive querying
Reporting & Dashboards
Federated query across databases
Lakehouse

Components of Presto

Client

CLI (prestocli.jar)
JDBC (presto-python-client)
PyHive

Coordinators

Parse, Analyze, Plan & Schedule
Monitors Workers via Heartbeat
Hosts Webserver / UI
Scheduler — The scheduler is a part of the coordinator, which is responsible for passing on the commands to the workers. It monitors the correct execution of the commands according to the plan created by the coordinator.

Discovery Service

Worker registers themself on startup and the coordinator uses this to find workers and other services

Worker

Responsible for parallelism
Execute task
Registers itself with coordinators

Connectors

Drivers for specific database/storage engine
Metadata API (Used by Parser/ Analyzer)
Data Location API (Used by the scheduler)
Data Stream API (Used by worker)
Data Stats API (Used by the coordinator)
Plugins are written in Java

Metastore

Hive Or Glue

Presto Web UI

8889 (EMR) , 8080, 8081
Cluster Overview & Query Details
Display, Live Plan, Stage Performance, and other query stats

Admin Tools

prestoadmin (python package)
ambari-presto-service
presto-query-predictor
presto-query-analyzer
presto-yarn integration Slider

Configuration Settings

└───presto
    └───etc
        └───catalog
        |   └───hive.properties
        |   └───postgresql.properties
        └───config.properties
        |───jvm.config
        |───log.properties
        |───node.properties

node.properties

node.environment
node.id
node.data-dir
catalog.config-dir
plugin.dir

log.properties

com.facebook.presto

jvm.config

-Xmx (upto 80% of the node’s memory)

config.properties

coordinator
node-scheduler.include-coordinator
http-server.http.port
discovery-server.enabled
discovery.uri

postgresql.properties

connector.name
connection-url
connection-user
connection-password

Other configuration props

query.max-memory-per-node
query.max-memory
distributed-joins-disabled

Query

A client sends an SQL query to the coordinator
At first, the query is Parsed and AST is generated.
Optimiser / Analyser gets the partition info using metadata API
From the generated AST a logical plan is generated.
A Plan is made up of 1 or more fragments
A Fragment is represented by a physical stage
Task has pipelines which is a template for the execution drivers
Execution drivers operate on splits of data
Split is part of actual data processing
Driver contains operators
Scheduler does load balancing and distributes works to worker processes
Workers scans files and data blocks from remote storage
Workers combine results and send it back to client

The pattern to access data in presto is represented below

catalog.schema.table

Caching

Presto stores intermediate data during the period of tasks in its buffer cache. But this is not used as a caching layer. There are third-party solutions like Alluxio that provide a multi-tiered layer for Presto caching.

Note: With RaptorX hierarchical caching can be added to presto.

Deployment option

Kubernetes Presto Operator

Security

Authentication to presto (None, Kerberos, LDAP)
Authentication to connector (Depends on implementation)
Authorization in presto
Authorization in connector system (service user)
Support for Ranger & HTTPS
Future pipeline (Authentication presto web UI, Knox, Sentry)

Limitations

Not fault tolerant (Must restart the query in case of failures)
Memory constraint Aggregation, JOINS, and Windowing
Not full SQL support
Coordinator is single point of failure

Things to consider

Run background job to kill/clear long-running jobs
Integrate scaling with Presto metrics by using Presto’s JMX connector, Presto Rest API, or presto-metrics.
For very high traffic requirements use multiple clusters sitting behind a router/gateway. Check this one from Lyft. Twitter leverages presto router with intelligent scheduling. Similarly, Uber uses Prism to route traffic.

Managed Offerings:

Presto on EMR
AWS Athena
Starburst Presto
Ahana cloud for Presto

Companies Using Presto

Slack, Atlassian, Facebook, Airbnb, Netflix, Walmart, Uber, Linkedin

Happy Data Querying!!