My First ETL Data Engineering Project

Gospel Orok
Dev Genius
Published in
7 min readAug 24, 2022

--

Photo by Stephen Dawson on Unsplash

Prior to my work experience in working with Nigerian Pipelines and Storage Company (NPSC), Port Harcourt Area, I was exposed to the industrial architecture and operations of pipeline pumping systems for the transportation of Petroleum products from the source depot to destination depots.

Unknowingly for me, I was on a journey towards becoming a Data Engineer. By then, in 2019, 2020 I didn’t even know about data engineering and even talk about ETL. Working with NPSC, Port Harcourt Area exposed me to the various pipeline pumping and line-up operations. I was responsible for operations department product accounting, reconciliation and operations monthly reports.

Looking at NPSC petroleum pipeline system of operation, I discovered how the workflow is similar to the case of a data pipeline system. But I must tell you that data is the new oil so moving from petroleum products pipeline career line to Data Engineering with data pipeline and its different repositories makes it not difficult for me to embrace the transition but rather enhances my understanding by relating the workflow into solving data pipeline problems.

This article explains how I developed an ETL model for Nigerian pipelines and storage company using python programing language. The ETL model is able to perform the following :

  • Collect traditional batch pumping parameters from a petroleum product pipeline source station
  • Transforms the collected data
  • Save the transformed data in a ready to load csv format for Analysis.

Let us talk briefly about Data Engineering, Data Pipeline and ETL

Data Engineering is concerned with the process of making sure data is transported accurately from its disparate sources through a dedicated pipeline into a data repository for use.

Data pipeline is a subset of data engineering which enables data transportation from one processing point to another.

ETL(Extract, Transform and Load) is a data pipeline integration process that involves three distinct but interrelated steps and is used to aid the transportation of data from multiple sources into data repositories such as RDBMS, Data Warehouse, Data Hub, or Data Lake.

Having a better understanding of a business context is one of the important steps in building an effective ETL solution.

NPSC Petroleum Product Context of Operation

NPSC Pump Station runs a traditional batch pipeline transportation of petroleum products to different depots from refinery tank farms where the product tanks are located. Before each pumping operation commenced, product stock taking must be carried out, usually described as fiscalisation and defiscalisation of the nominated product storage tank, this also can be referred to as getting the initial tank levels and the final tank levels to ascertain the amount or quantity of product in stock.

The product evacuation from the nominated tanks is aided by using booster pumps and mainline pumps in batches, where each flow-rate of transfer is done hourly by manually taking the tank level reading of the nominated tank, this is relative to the initial and final level of the tank to help ascertain the volume of product in stock and volume transferred. It also helps to know the accumulation in the destination depots. The goal of every pumping operation is to make sure that the programmed product is accurately transferred to the end point within the measured parameters and the estimated time.

Why Do We Need ETL for a Petroleum Product Pipeline System?

ETL is essential for Petroleum products pipeline transportation system in the following ways:

  • It provides a historical context of the operation.
  • Enhances Business Intelligence solutions for decision making.
  • The data context and aggregations can help increase the company’s revenue or save cost of operation.
  • Enables a common data repository.
  • Allows verification of data transformation, aggregation and calculations rules.
  • Allows sample data comparison between source and target system.
  • Helps to improve productivity as it codifies and reuses without additional technical skills.

Now let’s dive into the ETL development

This project is a model for the NPSC Pump Station pipeline system. It is developed in relation to the traditional pumping parameters of the NPSC Pump Station pipeline system. The model follows the 3 key ETL processes.

First I have to import the necessary libraries that will help in the ETL development.

import pandas as pd             #for data manipulation 
import numpy as np #for numerical computation
import matplotlib.pyplot as plt #for visualization
import plotly.express as px #for visualization
import seaborn as sns #for visualization
import plotly.express as px
from datetime import datetime #for date&timestamp for log.txt file

Next, setting paths to store the event logs and transformed data.

logfile = “logfile.txt” #all event logs will be stored in this file
targetfile = “transformed_data.csv” #file where transformed data is stored

Defining the obtained parameters:

Here the parameters are obtained manually from the pump station according to the system standard and specification. This parameters are used to develop the extraction function.

def extract():

parameters obtained:

tank_name = input(“Enter Tank Name:”) # 52TK59G 
tank_factor = eval(input(“Enter the Tank factor:”)) #tank_factor 1.164
batch_size = eval(input(“Enter the Batch size to pump: “)) #Batch_size 15000
initial_hr = int(input(“Enter Initial Time(hour):”)) #6
initial_minute = float(input(“Enter initial Time(Mins):”)) #0
final_hr = int(input(“Enter Final time(hour):”)) #11
final_minute = float(input(“Enter Final time(Minute):”))#40
initial_tank_level = eval(input(“Enter the initial tank level: “)) #initial_level 12870
final_tank_level = eval(input(“Enter the final Tank level: “)) #11622
accummulated= input(‘Enter last accumulative Value’) #3946 or 0 for last accumulation value

Developing the transform function.

def transform():

The above parameters can now be used to develop the transform function.

Defining the loading function:

def load(targetfile,data_to_load):
data_to_load.to_csv(targetfile)

Defining the log function:

def log(message):
timestamp_format = ‘%Y-%h-%d-%H:%M:%S’ # Year-Monthname-Day- Hour-Minute-Second
now = datetime.now() # get current timestamp
timestamp = now.strftime(timestamp_format)
with open(“logfile.txt”,”a”) as f:
f.write(timestamp + ‘,’ + message + ‘\n’)

Running the ETL Process

Log:

log(“\n ETL Job Started”)

Extract:

log(“Extract phase Started”)
extracted_data = extract()
log(“Extract phase Ended”)
extracted_data

Transform:

log(“Transform phase Started”)
transformed_data = transform()
log(“Transform phase Ended”)
transformed_data

Load:

log(“Load phase Started”)
load(targetfile,transformed_data)
log(“Load phase Ended”)
log(“Load phase Ended”)
log(“ETL Job Ended”)

Analysis

transformed_data.head()
transformed_data.describe()

The log file result:

Visualization

The plots below can be used for line decay and time decay evaluation.

plt.figure(figsize=(5,5))
plt.plot(transformed_data[‘time’],transformed_data[‘flowrate’])
plt.xticks(rotation=’vertical’);
Line decay plot
plt.figure(figsize=(5,5))
plt.plot(transformed_data[‘time’], transformed_data[‘Tank_levels(mm)’])
Time Decay plot

Plotting a bar chart to show the rate of accumulation

transformed_data.sort_values(by=[‘accumulation’],ascending=False)
fig1=px.bar(transformed_data,x=’accumulation’,y=’time’, color=’time’,title=’Accumulation per hour’)
fig1.show()

Summary

The ETL model performed the three ETL(Extract, Transform and Load) steps according to the pipeline system pumping parameters. The result was proven with 80% efficiency when compared with manually computed flow-rate log sheets of NPSC pump station, Port Harcourt Area. The ETL model only accepts batch pumping parameters because the pipeline system is not connected to any real time reading devices. This also is a case for further development of the model in situations where the various pumping parameters are obtained from remote terminal sensors or digital reading devices. This ETL can be integrated into big data platforms such as Hadoop framework and Apache Spark. Developing the ETL model was possible because of my understanding of the business context which is an important aspect of developing an efficient ETL.

The full code for this project can be found on my Github.

About the Author

Gospel Orok is a data specialist and an AI enthusiast with working experience in the e-commerce and engineering industry. He is well acquainted with Data Analysis, Data Pipelines, Data management, and Processing Systems with strong desire to always face new challenges and expand his skills in the Data Engineering field.

He is also an advocate for specialised skills in Data Analysis, Data Science and Data Engineering in delivering high impact solutions. He is currently the lead of Port-Harcourt City Data Scientists Network (DSN) AI+ community. He is open for collaborations with organisations and individuals to build an AI ecosystem that develops high human capacity impact.

Thanks for reading this project article. Feel free to contact the Author on LinkedIn and Twitter: Orok Gospel

--

--

Data Engineer, Analytics mind, AI+ Guru, Python Script lover, Growth Catalyst