Encoding Methods to encode Categorical data in Machine Learning

Published in

Dev Genius

4 min readJul 18, 2022

In the field of machine learning, before going for the modeling, data preparation is a mandatory task. There are various steps we need to perform while we are preparing the data and Encoding Categorical data is one of such tasks which is considered crucial. Most Machine Learning algorithms accept only Numerical Data as input. For example, K-Nearest Neighbours Algorithm calculates the Euclidean distance between the two observations, of a feature. To calculate the Euclidean distance between the two observations, So the input that is passed should be of Numerical type. So the Categorical data must be transformed or encoded into Numerical type before feeding data to an Algorithm, which in turn yields better results.

Categorical data can be considered as the finite possible values that are divided into groups. For Example — different blood groups, Genders, Different cities, and states. There are two types of categorical data:

Ordinal Data: Data that comprises a finite set of discrete values with an order or level of preferences. Example — [Low, Medium, High], [Positive, Negative], [True, False]
Nominal Data: Data that comprises a finite set of discrete values with no relationship between them. Example — [“India”, “America”, “England”], [“Lion”, “Monkey”, “Zebra”]

For Ordinal Data, after encoding the data and training the model, we need to transform it to its original form again, as it is needed to properly predict the value. But for Nominal Data, it is not required as here preference doesn’t matter, we just need the information.

Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions.

Enough of theory, Let's start with the coding part and different techniques we can use for Encoding the categorical data. Throughout this tutorial, we will be using a processed Mushroom dataset from UCI. You can find it here.

Loading the dataset

As we can see we have 15 columns out of which 12 are of Datatype — object And there are no missing values in this data. So we need to Encode these 12 features before we go for modeling.

Label Encoding

In Label Encoding, each label will be converted into an integer value. Here the output will be 1 dimensional.

#using scikit-learn
#Import LabelEncoder using sklearnfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()data["season"] = le.fit_transform(data["season"])

Ordinal Encoding

Similar to label Encoding but ordinal Encoding is generally used when we are intended for input variables that are organized into rows and columns. (eg: Matix)

#using scikit-learn
#Import OrdinalEncoder using sklearnfrom sklearn.preprocessing import OrdinalEncoderoe = OrdinalEncoder()encoded_data["season"] = oe.fit_transform(data[["season"]])

One Hot Encoding

In One-Hot Encoding, each category of any categorical variable gets a new variable. It maps each category with binary numbers (0 or 1). This type of encoding is used when the data is nominal. Newly created binary features can be considered dummy variables. After one hot encoding, the number of dummy variables depends on the number of categories presented in the data.

For One hot encoding, we will be using the category_encoders package instead of sklearn, as it is more useful.

!pip install category_encoders

Code:

#Import OneHotEncoder from category_encoders
from category_encoders import OneHotEncoderohe = OneHotEncoder( handle_unknown='return_nan', return_df=True, use_cat_names=True)ohe_results = ohe.fit_transform(data[["season"]])

MultiColumnLabelEncoder (preferred method)

The main advantage of using MultiColumnLabelEncoder is whenever the model contains the ordinal data, after encoding we need to inverse transform the data and get the original back, so for that MultiColumnLabelEncoder is used, and as the name suggests like multiple columns are encoded here in a single step.

It almost works similar to label encoding but it works for multiple columns in a single instance and inverse transforming and getting original data is too easy.

Installation:

!pip install MultiColumnLabelEncoder

Code for Encoding:

from MultiColumnLabelEncoder import MultiColumnLabelEncoderMcle = MultiColumnLabelEncoder()
encoded_data = Mcle.fit_transform(encoded_data)encoded_data.head()

Here it will encode all the object columns from the data frame. We can also pass the columns argument, if the columns argument is passed it will encode only the column names which are passed.

Inverse fit transform:

inverse_encoded_data = Mcle.inverse_fit_transform(encoded_data)inverse_encoded_data.head()

Same string columns which are used earlier will be used for inversing, but if we want to pass only certain columns, we can do it by passing an optional argument column

Conclusion

There are also other Encoding libraries that can be used for encoding and internal functions in pandas such as map, replace, apply can also be used for encoding, But the methods provided above are easy ways to encode the data and inverse transform it in time. So I have skipped those methods.

Happy Coding…

Stay happy….