A Story of Income Inequality and The Unexpected Encounter with Machine Learning

Published in

Dev Genius

6 min readJan 16, 2022

Read the previous post for my comeback story to the programming world.

After leading us with the 5-day SQL bootcamp, our Industry Leader Andrea Scaringello invited us to carry out an independent project in pairs. In this project, we could use any dataset that we like, and present anything that we wanted. This was a do-as-you-wish kind of thing.

Having been on class rep duty for a week with my friend Joyce Zheng and worked together on the bootcamp exercises, I felt a good connection and asked her to continue working together. She accepted my invitation and we ended up digging into the income inequality dataset.

Income inequality is the income gap between the rich and everyone else, formally known as Gini index, where a higher index indicates a larger gap. There have been analyses before to check the correlation between income inequality and GDP but they found only a weak negative correlation between them.

Being very eager, we set out with this problem in mind:

Could it be that income inequality is correlated to other metrics that are not so obvious?

We set to explore if income inequality has a correlation with happiness score, life expectancy, and fertility rate using these 4 hypotheses:

The continent with a higher percentage of developed countries has lower income inequality
Countries with low-income inequality are more likely to have a high life expectancy
Countries with low-income inequality are more likely to have high happiness score
Countries with low-income inequality are more likely to have a low fertility rate

Here are the datasets that we used:

Gini index and fertility rate from Gapminder
Life expectancy from Kaggle
World happiness report from Kaggle

After downloading the .csv files, we checked the data and learned something very important. Datasets can be very messy! Especially when it comes from various sources. A typical example when dealing with countries is the country name. On one dataset a country is called ‘Democratic Republic of the Congo’, on another ‘DR Congo’, and on another one ‘Congo-Kinshasa’. Without a country code present on any dataset, we had no choice but to clean it manually. We also added the continents manually through Google search. Such is the life of a data analyst.

After we cleaned our data, a new challenge arose. We had to figure out how to insert our .csv datasets into a SQLite database. Using my rusty 8-years-of-no-coding brain with zero Python experience, Andrea’s Google Colab, and a lot of googling, I managed to figure it out. Also, it turns out that a lot of classmates were also struggling with this part, so I decided to make a tutorial.

Now, this is my first attempt in Python—since I did mostly Java and C-family back in university. It could certainly have been more efficient, but I didn’t know any better then, so this is what I wrote.

With the data stored in the database, we started doing our analysis. Using SQL query, we figured out the Gini index difference for each continent compared to global average (mean).

Visualization of output using Microsoft Excel

Based on the input from Andrea, we understood that the mean could be misleading, since there might be an outlier that stretches the mean far from the majority of the datapoint. He suggested that we do a box plot to see how the data is distributed. Again, with a lot of googling, I managed to figure out how to use matplotlib on Python to do a box plot. The code, however, is too inefficient, so I’ll just put the link here if you’re interested.

From the box plot, we can see that North America and Africa have a big standard deviation, meaning that we can’t really trust the mean.

Finally, we summarized our findings in datawrapper to answer our first hypothesis: The continent with higher percentage of developed countries has lower income inequality. Turns out, the hypothesis is only true for Europe, having the lowest Gini index and the highest percentage of developed countries. The other continents don’t follow the same assumption.

For the remaining hypotheses, regression seemed to be the only way forward. SQL is simply not made to do it, so we had to do it with a more powerful programming language, and Python seemed to be the answer. However, we were not familiar enough with Python to solve this problem. Simply googling the code without understanding anything wasn’t super helpful to my learning journey either, so we decided to step up our game. Fortunately, Hyper has a partnership with DataCamp, where we could learn coding specifically tailored for data-related professions.

Joyce started suggesting some courses that could help us on doing this, but those courses have prerequisites, and those prerequisites have other prerequisites. I tried jumping immediately to the Introduction to Regression with statsmodels in Python course, only to give up on the second exercise. And so began my DataCamp binge, doing all the prerequisites to be able to complete the course needed to test our hypotheses.

Finally, we were able to come up with this:

The first snippet is to show the graph and the second snippet is to show the statistical summary of the regression. Now, I was able to write the code but it was still hard for me to understand how to read the result. Fortunately, Joyce has a background in statistics and she taught me what the terms mean and how to read the result.

In terms of Gini index and life expectancy, the graph and statistical summary show a weak negative correlation that is statistically significant. Therefore, we could assume that the second hypothesis is true.

Using the same method, we found that the third and fourth hypotheses are also true.

Having done all our hypotheses, we thought we were done with one week to spare. However, Andrea challenged us to write a machine learning code. Joyce and I were nervous, thinking that machine learning is way beyond our capability, but Andrea told us that it’s just stretching a bit more. After googling and binging DataCamp some more, finally we were able to make our first prediction code:

The code works! Using matplotlib, we visualized the prediction and actual Gini index for a particular country. Though the code works, the model is extremely bad, with only 5% accuracy.

We then understood that it’s not the code, but rather the logic behind it that is flawed. Life expectancy, happiness score, and fertility rate are not the right predictors of income inequality. However, these datasets seem to be good to predict life expectancy instead. We added education level into the predictors and predicted the life expectancy, and got the model with 79% accuracy.

From our project, I learned many things that I summarize into three points below:

Correlation doesn’t mean causation. There are a lot of factors that can cause certain things and it’s important to not jump to conclusions.
Machine learning is not magic, just because the metrics show some correlation doesn’t mean that the model is going to be good. In other words, bad input will result in bad output.
Determining what data to predict and what data to be used as predictors are key to increasing the chance of making a good model.

Thank you for reading until the end. You can find the code for this project on my Github.

Cheers from team Zheng!

Dev Genius

A Story of Income Inequality and The Unexpected Encounter with Machine Learning

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Dev Genius

Written by Abraham Setiawan

No responses yet