Linear Regression using Automobile mpg Data

Spang
Dev Genius
Published in
4 min readOct 28, 2022

--

The following will be discussed in this article. We will use a linear regression model to discover a relationship between two variables in an automobile mpg dataset.

  • Linear Regression
  • Least Squares Method
  • Automobile mpg Dataset
  • Implementing Simple Linear Regression
  • Implementing Piecewise Linear Regression

Linear Regression

Linear regression models the relationship between two variables by fitting a linear equation to observed data.

Or, as Josh Starmer explains, the concept behind linear regression is simply fitting a line to data with least squares and R-squared.

Least-Squares Method

Credit: https://www.jmp.com/en_in/statistics-knowledge-portal/what-is-regression/the-method-of-least-squares.html

The least-Squares method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line.

We will use these methods to find a relationship between two variables in the automobile mpg data.

Automobile mpg Dataset

The Automobile mpg dataset is from the UC Irvine Machine Learning Repository, which maintains 622 data sets as a service to the machine learning community.

https://archive.ics.uci.edu/ml/datasets/auto+mpg

As shown in the detailed description of the dataset found on the website, there are 398 rows and 9 columns and missing values. The 9 attributes information is described below.

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes (Quinlan, 1993).

Implementing Simple Linear Regression

After importing Matplotlib, Pandas and Numpy packages, we load the dataset.

Using head() function, we can see the first 6 rows of the dataset and check whether the dataframe is loaded correctly.

Using columns, shape, and index functions, we can check the column names, and the number of rows (indices) and columns.

Convert the dataframes to NumPy arrays and we can visualize the relationship between cars’ weight and mpg.

The average weight and the average mpg can also be explored using the following codes.

Plotting a Linear Fit

We’ll explore a relationship between a different feature, horsepower, and mpg.

Since horsepower has some missing data, we’ll use dropna method to remove them.

𝑥 bar and 𝑦 bar are the sample means and 𝑠𝑦𝑥 and 𝑠𝑥𝑥 are the covariance and variance.

We plot x and y (actual), and x and ypred (predicted) to determine how the predicted fits the actual datasets.

The squared loss (RSS or L2 loss) is one of the most common ways to measure the difference between the predicted (yhat) and the actual (y) values.

Implementing Piecewise Linear Regression

Since the mpg is higher when horsepower is between 50–100, we will see how we can improve the fit by using the piecewise linear regression.

We will follow this piecewise function.

Since horsepower between 50–100 has higher mpg, we will set lambda to 100.

The new X matrix will look as below.

We then find the optimal beta using the following formula and function.

Then, we can find yhat (the predicted) based on the optimal beta.

By plotting x and y (actual), and x and Yhat(predicted) again, we can visualize how the predicted fits the actual datasets.

The loss is reduced by close to 2000, which is way better than the first linear fit.

Thank you for taking the time to read this post!

--

--