image source google.com

Serverless Options for Data platform in AWS

Amit Singh Rathore
Dev Genius
Published in
5 min readSep 26, 2022

--

Reduced OpEx for data platform by going serverless

Serverless doesn’t mean no server, it means you don’t manage servers.

When we say serverless we generally mean that server management work is offloaded to third parties, nowadays mostly to cloud providers. Since we are not managing servers, we eliminate the operational overhead and reduce the cost. When it comes to data platforms they are mostly the drivers of cost. On average, data platform cost is ~40% of the overall infra cost for medium to large companies. Going serverless can reduce costs in the initial phases of the platform when the workload is not predictable or consistent. Even in the stable phases, we can offload ad hoc data workloads on a serverless platform. The serverless solution has a shorter time to market as compared to the traditional “big-data” platforms.

A data platform consists of many components, the major ones are as below:

Integration
Storage
Catalog
Transformation
Analysis

A Serverless data platform is a collection of serverless services to operationalize the components of data platform.

In this blog, we will see the serverless offering of AWS in the data & analytics segment across these components.

Data Integration

Glue ETL/Studio/catalog

AWS Glue ETL is a serverless data integration service that allows us to run ETL (Spark Code) as soon as data lands in the data lake.

AWS Glue Studio makes it even easier by enabling us to visually create, run, and monitor AWS Glue ETL jobs. It is a low code version of Glue ETL where we can use a drag-and-drop editor and AWS Glue automatically generates the corresponding ETL code.

AWS Glue Catalog is a managed service that allows us to collect metadata of the data and allows us to impose search and governance features on top of the data.

Data Lake

AWS Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. AWS Data lake is built on serverless object storage (S3). S3 ensures the availability & redundancy of data. On top of it, Lake Formation provides its own permissions model that augments the IAM permissions model. It uses the Glue catalog for metadata collection and search capability. It also uses Glue ETL to do transformations like storing data in columnar formats, such as Parquet and ORC, for better performance.

Distributed Computation

EMR Serverless

Amazon EMR Serverless is an offering in Amazon EMR that makes it easy and cost-effective for us to run petabyte-scale applications built using Apache Spark, Presto, and Hive, without having to tune, operate, secure, optimize, or manage clusters. With EMR Serverless, customers no longer have to worry about over — or under-provisioning of resources needed to run their application. EMR Serverless automatically provisions and scales up or down the compute and memory resources required by the applications, and customers only pay for the resources they use.

EKS With Fargate

AWS Fargate is a K8s deployment model that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, we don’t have to provision, configure, or scale groups of virtual machines on our own to run containers. We also don’t need to choose server types, decide when to scale our node groups, or optimize cluster packing. With the latest release, we can launch containers of size 16 vCPU and 120 GB RAM. Fargate pricing is based on the requested vCPU and memory resources required for the pod.

SQL-based query & Analytics

Athena

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically — running queries in parallel — so results are fast, even with large datasets and complex queries. The pricing is based on the amount of data scanned.

Redshift serverless

Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. Redshift pricing is measured in RPU hours on a per-second basis (with a 60-second minimum charge). There is no charge for data warehouse start-up time. Automatic scaling and comprehensive security capabilities are included. We do not need to pay for concurrency scaling and Redshift Spectrum separately because they are both included with Amazon Redshift Serverless.

Streaming Integration

Kinesis

Amazon Kinesis Data Streams is a serverless streaming data service that makes it easy to capture, process, and store streaming data at any scale. Kinesis Data Streams [On-Demand] is a new capacity mode for Kinesis Data Streams, capable of serving gigabytes of write and read throughput per minute without capacity planning.

Kinesis Analytics

Serverless Apache Flink & Beam.

MSK Serverless

Amazon MSK Serverless is a cluster type for Amazon MSK that makes it easy for us to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources, so we can use Apache Kafka on demand and pay for the data we stream and retain.

Databases

Amazon Keyspace (Cassandra) — For long term persistence of compacted aggregates.

Amazon Aurora Serverless — General ACID requirements

Amazon Neptune Serverless — Graph Database

For more details on DB's offering read this:

Orchestration

MWAA / Step Function

With Managed Workflows for Airflow, we can use Airflow and Python to create workflows without having to manage the underlying infrastructure for scalability, availability, and security. Managed Workflows automatically scales its workflow execution capacity to meet our needs.

AWS Step Functions is a serverless, low-code, visual workflow service that developers can use to build data and machine learning pipelines using AWS services. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

BI Service

Quicksight

It is a cloud-native BI service that allows us to understand our data by asking questions in natural language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by machine learning, without any client software or server infrastructure. QuickSight also offers pay-per-session pricing, making it cost-effective for large-scale deployments.

Auxiliary services

These services are not directly linked to the data platform but do help in certain aspects of a cloud data platform.

Lambda
DynamoDB
OpenSearch

Overall AWS does offer a bunch of serverless options in the data segment. These can be combined together to create data solutions without the need to manage too many servers and focus on core business problems.

Happy cloud Data Engineering!!

--

--