Presto — Introduction

Amit Singh Rathore
Dev Genius

--

Fast analytics queries against data of scale

Presto is an Open-source distributed query engine that provides a unified SQL interface for accessing disparate sources of data. The engine uses In-memory computing to avoid disk i/o overhead. It leverages distributed computing. The distributed nature allows parallel data pull. Presto decouples the storage and compute, allowing both to scale independently.

Features of Presto

  • Interactive & Federated
  • Capability to combine different data sources
  • Unified SQL Interface (2011 SQL), less learning curve
  • Extensible plugins for the backend
  • Query data where it leaves no need to Ingest
  • Runtime Byte code generation
  • Vectorized columnal processing
  • Optimized ORC and Parquet readers

What presto is not?

  • Presto is NOT related to Hadoop, YARN or SPARK
  • Presto is NOT database
  • Presto is NOT designed for OLTP workloads

Use cases of Presto

  • Interactive querying
  • Reporting & Dashboards
  • Federated query across databases
  • Lakehouse

Components of Presto

Client

  • CLI (prestocli.jar)
  • JDBC (presto-python-client)
  • PyHive

Coordinators

  • Parse, Analyze, Plan & Schedule
  • Monitors Workers via Heartbeat
  • Hosts Webserver / UI
  • Scheduler — The scheduler is a part of the coordinator, which is responsible for passing on the commands to the workers. It monitors the correct execution of the commands according to the plan created by the coordinator.

Discovery Service

  • Worker registers themself on startup and the coordinator uses this to find workers and other services

Worker

  • Responsible for parallelism
  • Execute task
  • Registers itself with coordinators

Connectors

  • Drivers for specific database/storage engine
  • Metadata API (Used by Parser/ Analyzer)
  • Data Location API (Used by the scheduler)
  • Data Stream API (Used by worker)
  • Data Stats API (Used by the coordinator)
  • Plugins are written in Java

Metastore

  • Hive Or Glue

Presto Web UI

  • 8889 (EMR) , 8080, 8081
  • Cluster Overview & Query Details
  • Display, Live Plan, Stage Performance, and other query stats

Admin Tools

  • prestoadmin (python package)
  • ambari-presto-service
  • presto-query-predictor
  • presto-query-analyzer
  • presto-yarn integration Slider

Configuration Settings

└───presto
└───etc
└───catalog
| └───hive.properties
| └───postgresql.properties
└───config.properties
|───jvm.config
|───log.properties
|───node.properties

node.properties

node.environment
node.id
node.data-dir
catalog.config-dir
plugin.dir

log.properties

com.facebook.presto

jvm.config

-Xmx (upto 80% of the node’s memory)

config.properties

coordinator
node-scheduler.include-coordinator
http-server.http.port
discovery-server.enabled
discovery.uri

postgresql.properties

connector.name
connection-url
connection-user
connection-password

Other configuration props

query.max-memory-per-node
query.max-memory
distributed-joins-disabled

Query

query steps in presto
  • A client sends an SQL query to the coordinator
  • At first, the query is Parsed and AST is generated.
  • Optimiser / Analyser gets the partition info using metadata API
  • From the generated AST a logical plan is generated.
  • A Plan is made up of 1 or more fragments
  • A Fragment is represented by a physical stage
  • Task has pipelines which is a template for the execution drivers
  • Execution drivers operate on splits of data
  • Split is part of actual data processing
  • Driver contains operators
  • Scheduler does load balancing and distributes works to worker processes
  • Workers scans files and data blocks from remote storage
  • Workers combine results and send it back to client

The pattern to access data in presto is represented below

catalog.schema.table

Caching

Presto stores intermediate data during the period of tasks in its buffer cache. But this is not used as a caching layer. There are third-party solutions like Alluxio that provide a multi-tiered layer for Presto caching.

Note: With RaptorX hierarchical caching can be added to presto.

Deployment option

Security

  • Authentication to presto (None, Kerberos, LDAP)
  • Authentication to connector (Depends on implementation)
  • Authorization in presto
  • Authorization in connector system (service user)
  • Support for Ranger & HTTPS
  • Future pipeline (Authentication presto web UI, Knox, Sentry)

Limitations

  • Not fault tolerant (Must restart the query in case of failures)
  • Memory constraint Aggregation, JOINS, and Windowing
  • Not full SQL support
  • Coordinator is single point of failure

Things to consider

  • Run background job to kill/clear long-running jobs
  • Integrate scaling with Presto metrics by using Presto’s JMX connector, Presto Rest API, or presto-metrics.
  • For very high traffic requirements use multiple clusters sitting behind a router/gateway. Check this one from Lyft. Twitter leverages presto router with intelligent scheduling. Similarly, Uber uses Prism to route traffic.

Managed Offerings:

  • Presto on EMR
  • AWS Athena
  • Starburst Presto
  • Ahana cloud for Presto

Companies Using Presto

  • Slack, Atlassian, Facebook, Airbnb, Netflix, Walmart, Uber, Linkedin

Happy Data Querying!!

--

--