Data Engineer — Learning Path

Amit Singh Rathore

Published in

Dev Genius

4 min readAug 2, 2022

Skills you need to become a data engineer

Typically a data engineer performs the following tasks:

Ingest business-relevant datasets from different sources
Create a data model for efficient storage & retrieval of data
Write algorithms to transform data into actionable information
Automate the above tasks using data pipelines
Create new data validation methods for tests
Ensure compliance with the company’s data governance and security policies
Expose data as a service/product
Provide technical assistance to BI/BAs in their queries & dashboards

To perform these tasks Data Engineer needs to have certain skill sets. Let us go through the common & basic skills one would need in a data engineer role.

Programming Language

Data engineering leads to various roles in the future like Data Analyst, Data Scientist, ML Engineer, and BI Developer. So choosing the right language is important. Python is more diverse in the sense that it is used widely in all the previously mentioned job profiles.

Python / Scala / Java
For python Pandas, Numpy & Matplotlib, Scipy
YAML / JSON / Jinja

Note: If you want to stick to data engineering in long run as well then you will need to work with both Java (Kafka, Flink, Nifi, Storm, and Datahub are java based) and Scala (spark is written in Scala). So it will be good to know that. On the other hand if you plan to gravitate towards machine learning Python and PySpark will be your go to option.

Operating System Basics

Data engineers use tools that are distributed and almost always run on linux systems. So knowing Linux is a must to debug and understand the process flow. Also, Data engineers deal with lots of files and hence file manipulation is an important skill.

Linux (filesystem vs Object storage, NFS, process, SIGTERM vs SIGKILL)
Shell scripting (vi, jq)
File manipulation (grep, sed, awk)
Nohup & screen (Since data engineers run the longer jobs it is good to know how to run processes in the background and use terminals active while waiting for results)
Networking — TCP, IP, DNS, SSH, Firewall

Newer data platforms are now being hosted in cloud/virtualized environments hence knowing about virtualization is a good skill to have.

Virtualization

Docker /LXC
Kubernetes / Mesos

DSA

As a data engineer, you are supposed to write fast and scalable solutions. That can be achieved if you have a good understanding of data structure and algorithms. Generally lean towards algorithms that can be distributed.

Array, Linked List, String, Stack, Queue, Heap, Tree
DP, Graph (BFS & DFS)

Data storage

As a data engineer, you will often work with databases as a source or sink. So it is required to know the basics of DBMS. These topics also help in the way you store or organize data (data modeling) in databases.

DBMS

SQL & DBMS concepts
MySQL / Postgres

Key Topics

JOINS, Aggregation (Group By), Window Function
ACID & Normalization
Index, Vertical and horizontal partitioning
Subquery, Correlated query, CTE (recursive)

NoSQL Database

In Data Engineering most of the data have no defined structure or have changing structures and query patterns. So as a Data Engineer you will need to know NoSQL databases. It is based on the requirement.

Cassandra
MongoDB
Elasticsearch

Data Warehouse

Data engineers will build and organize the data for consumption and the majority of that output is stored in data warehouses since it is suited for analysis and aggregated results. So knowing the basics of Data warehousing is needed to build the right kind of warehouse.

Star vs Snowflake schema
SCD (Types)
Hive/BigQuery/Redshift/Snowflake

Apart from knowing the SQL and database engines’ working we also need to know about interacting with these systems programmatically. We need to understand ORM (SQLAlchemy), database drivers like psycopg2, pyhive, and cassandra-driver, etc.

Data Processing/Storage Framework

Data engineers read, clean, enrich, transform and store the data. To do this activity on the large dataset we need a distributed computation framework and for that, we have many frameworks. We need to understand how these frameworks operate.

Hadoop & HDFS or distributed object storage like (S3, ADLS, GCP cloud storage or IBM GPFS
Spark / Dask / Ray

Queue / Message broker

As Data engineers work on large data and failure in between causes re-computation of large scale, we need a buffer or queue system to decouple certain subsystems. Also in many cases, data engineers have to process huge numbers of small events for those use cases we need a robust messaging platform.

Kafka / Pulsar / Redpanda
Kinesis

Orchestrator

Data engineers work with a continuous inflow of data, for them, the jobs need to run at the correct schedule and should be automated. For that, we have many orchestration engines that data engineer needs to understand.

Airflow
Apache Nifi
Azkaban
Oozie (Some big enterprises still have these pipelines although the trend is to replace these with other alternatives)
Step function
Camunda

Design (system)

As a data engineer, you need to build solutions that are scalable, reliable & fault tolerant. So knowing system design skills come in handy.

Decoupling
Backoff / retry/ backfill
Sharding / Network partitioning
Distributed designs
CAP & PACELC
Lambda & Kappa architectures

Cloud

Nowadays most data platforms are hosted in the cloud and use its offering as auxiliary services to enrich the platform’s features. Knowing the cloud adds to the employability of Data engineers.

AWS / GCP / Azure

Every individual has a different learning process. These skills are not needed all at once on day one. A good starting point would be to master the essentials. With time and experience, one should learn advanced skills.

Happy Learning!!