How To Predict Customer Churn Risk using Machine Learning in Python

An in-depth tutorial using Python, pandas and scikit-learn, RFM analysis and SMOTE

Zain Baquar

Published in

Dev Genius

11 min readSep 12, 2022

Introduction

To the untrained eye, customer behavior is difficult to predict. After all, they are humans with erratic whims and desires. However, to a machine that can compute thousands of things a second, trends and patterns are increasingly obvious. Businesses aim to engage with customers in a way that they return to the store repeatedly, generating revenue each time. However, it can be challenging to determine which customers are likely to return, and which customers have lost interest in the goods or services being provided.

Here we introduce the concept of Customer Churn: A customer is considered churn-ing if they are actively returning to the store. Whereas a churn-ed customer is one that is no longer coming back for more.

Customer Churn Risk is the probability that a customer will disengage with the business. We can therefore define it as:

Churn Risk = 1 - Probability of purchase over a determined period

Understanding how your customers behave is imperative to make the most of their patronage. Today, we can leverage the volumes of data available to us, to predict how likely a customer is to continue engaging with your business. This can be valuable in the following ways:

Provides a data-driven customer-level metric to aid in implementing proactive decisions which will impact the business.
Enriches the customer database by adding another dimension over which to perform a customer segmentation.
Can be taken a step further to predict exactly when a customer will return.
Understanding your inbound and outbound customers, provides insight into customer retention rate, as well as the overall health of your customer base.

In this tutorial, using the Kaggle data set found here, we will take the following steps:

Gather and combine our data.
Transform our dataset into rich features and labels using a technique known as recursive RFM (Recency-Frequency-Monetary Value).
Fit a model that can predict on this data.

In many ways, this approach is similar to the one we used for Customer Lifetime Value, except we are generating the labels differently. The notebook I used for this tutorial can be found here.

Step 1: Gathering Data

For our customer data, we essentially just need 3 columns: a customer identifier, transaction date/time and transaction value, we can bring in other features too, but you should make sure to aggregate them by customer in the feature engineering step. We can use the date to extract the day of the week, month, hour and all time-based features related to each transaction. If there are different categories of transactions, those columns can be brought in as well.

import pandas as pd

# Load transaction data from CSV
df = pd.read_csv(data_path) # path to your data

# Convert Date column to date-time object
df.Date = pd.to_datetime(df.Date)
df.head(10)

Output:

Step 2: Feature Engineering

Recency, Frequency and Monetary Value (RFM)

RFM is a method of quantifying customers in a meaningful way and can serve as a good baseline when it comes to performing any analytics on customer specific transactional data.

Recency, Frequency and Monetary value capture when the customer made their most recent transaction, how often they have returned for business and what the average sale was for each customer. We can add on to this by using any other available features (like GrossMargin, Age, CostToRetain) or other predicted features (Lifetime Value) or Sentiment Analysis).

The way it works is that we can split the training data into an observed period and a future period. If we want to predict how much a customer will spend in a year, we would set the length of the future period as one year, and the rest would come under observed (as shown below).

This allows us to fit a model to classify which customers engaged with the business in the future period using features computed in the observed period.

# Data before cut off
observed = df[df[date_col] < cut_off

# Data after cut off
future = df[(df[date_col] > cut_off) & (df[date_col] < cut_off + pd.Timedelta(label_period_days, unit='D'))]

Here we introduce the concept of the cut-off. This is simply where the observed period ends, and defines before what date we should calculate our features.

Recency: Time since most recent transaction (hours/days/weeks). We need to set a cut off to calculate Recency. As in: how many days since the cut-off did they make a transaction?

# Copy transactions
cut_off = df.Date.max()
recency = df[df.Date < cut_off].copy()

# Group customers by latest transaction
recency = recency.groupby(customer_id_column)[date_column].max()
recency = (max_date - recency).dt.days).reset_index().rename(
columns={date_column:'recency'})

Frequency: Number of distinct time periods in which a customer made a transaction. This will allow us to to track how much transactions a customer made, and when they occurred. We can also retain the practice of calculating these metrics from a cut off date as it will be handy later.

# Copy transactions
cut_off = df.Date.max()
frequency = df[df.Date < cut_off].copy()

# Set date column as index
frequency.set_index(date_column, inplace=True)
frequency.index = pd.DatetimeIndex(frequency.index)

# Group transactions by customer key and by distinct period
# and count transactions in each period
frequency = frequency.groupby([customer_id_column, pd.Grouper(freq="M", level=date_column)]).count()
# (Optional) Only count the number of distinct periods a transaction # occurred. Else, we will be calculating total transactions in each # period instead.

frequency[value_column] = 1 # Store all distinct transactions

# Sum transactions
frequency = frequency.groupby(customer_id_column).sum().reset_index().rename(
columns={value_column : 'frequency'})

Monetary Value: Average sales amount. Here we are simply calculating what the average sales amount was across all transactions for each customer. We may additionally add a ‘TotalAmountSpent’ feature by taking the sum instead of the mean in the last step.

# Copy transactions
cut_off = df.Date.max()
value = df[df.Date < cut_off].copy()

# Set date column as index
value.set_index(date_column, inplace=True)
value.index = pd.DatetimeIndex(value.index)

# Get mean or total sales amount for each customer
value = value.groupby(customer_id_column[value_column].mean().reset_index().rename(columns={value_column : 'value'})

Age: Time since first transaction. For this feature we will simply find the number of days since each customers first transaction. Again, we will need a cut off to calculate the time between the cut off and the first transaction.

# Copy transactions
cut_off = df.Date.max()
age = df[df.Date < cut_off].copy()

# Get date of first transaction
first_purchase = age.groupby(customer_id_column)[date_column].min().reset_index()

# Get number of days between cut off and first transaction
first_purchase['age'] = (cut_off - first_purchase[date_column]).dt.days

We can wrap all these functions together with the following function:

def customer_rfm(data, cut_off, date_column, customer_id_column, value_column, freq='M'):
  cut_off = pd.to_datetime(cut_off)
  
   # Compute Recency
  recency = customer_recency(data, cut_off, date_column, customer_id_column)
  
  # Compute Frequency
  frequency = customer_frequency(data, cut_off, date_column, customer_id_column, value_column, freq=freq)
  
  # Compute average value
  monetary_value = customer_value(data, cut_off, date_column, customer_id_column, value_column)
  
  # Compute age
  age = customer_age(data, cut_off, date_column, customer_id_column)
  
  # Merge all columns
  return recency.merge(frequency, on=customer_id_column).merge(on=customer_id_column).merge(age, on=customer_id_column).merge(monetary_value, on=customer_id_column)

Ideally, this can capture information about customer retention within a certain time period. This might look something like this:

For the labels we would just set 1 for those who bought something in the future period, and 0 for everyone who didn’t.

def generate_churn_labels(future):
   future['DidBuy'] = 1
   return future[['Customer_ID', 'DidBuy']]

In some cases, performing this once over the whole data set and fitting a model to predict the labels can yield a tolerable accuracy. However, if you look closely, you might ask: what if something interesting happened in the observed period? Which is the correct question to ask. Simply doing this once over the data set ignores all the seasonality in the data and only looks at one specific label period. Here we introduce what I call, Recursive RFM.

Recursive RFM

Let us apply what we know of RFM thus far and loop it through the dataset.

Lets say the data begins on the left at the beginning of a year. We’ll select a frequency (for example, one month) and iterate through the data set, computing our features from observed (o) and generate our labels from future (f). The idea is to recursively compute these features in order for the model to learn how customers’ behavior changes over time.

For this part of the algorithm we will first get the date of each interval in the span of the data set and use each of those dates as a cut off to compute our RFM features and labels. To reiterate, we have selected a frequency of 1 month in our example.

For each cut off (co) date:

Compute RFM features from all rows (i) before the the cut off ( i → co)
Compute labels from rows (i) between the cut off and one month after the cut off (co → i → co + frequency)
Outer join the features and labels based on Customer ID to create a data set to fill in customers who did not make any transactions.

Concatenate all datasets in the loop.

This is implemented in the code below:

def recursive_rfm(data, date_col, id_col, value_col, freq='M', start_length=30, label_period_days=30):
  # Resultant list of datasets
  dset_list = []
  # Get start and end dates of dataset
  start_date = data[date_col].min() + pd.Timedelta(start_length, unit="D")
  end_date = data[date_col].max() - pd.Timedelta(label_period_days, unit="D")
  # Get dates at desired interval
  dates = pd.date_range(
  start=start_date, end=end_date, freq=freq
  data[date_col] = pd.to_datetime(data[date_col]
  )
  for cut_off in dates:
     # split by observed / future
     observed = data[data[date_col] < cut_off
     future = data[
                    (data[date_col] > cut_off) &
                    (data[date_col] < cut_off + pd.Timedelta(
                     label_period_days,  unit='D'))
                  ]
     # Get relevant columns
     rfm_columns = [date_col, id_col, value_col]
     _observed = observed[rfm_columns]
     # Compute features from observed
     rfm_features = customer_rfm(
          _observed, cut_off, date_col, id_col, value_col
     )
     # Set label for everyone who bought in 'future' as 1'
     labels = generate_churn_labels(future)
     # Outer join features with labels to ensure customers 
     # not in observed are still recorded with a label of 0
     dset = rfm_features.merge(
          labels, on=id_col, how='outer'
     ).fillna(0)
     dset_list.append(dset)
  # Concatenate all datasets
  full_dataset = pd.concat(dset_list, axis=0)
  res = full_dataset[full_dataset.recency != 0].dropna(axis=1, how='any')
  return res

rec_df = recursive_rfm(data_for_rfm, 'Date', 'Customer_ID', 'Sales_Amount')

Now that we have generated our dataset, all we need to do now is shuffle and perform a train/test split on our data. We’ll use 80% for training and 20% for testing.

from sklearn.model_selection import train_test_split

rec_df = rec_df.sample(frac=1) # Shuffle

# Set X and y
X = rec_df[['recency', 'frequency', 'value', 'age']]
y = rec_df[['Sales_Amount']].values.reshape(-1)

# Set test ratio and perform train / test split
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, shuffle=True)

Note: Class Imbalance

In a classification task, sometimes the classes we want to predict are imbalanced in the data set. For example, if there are 10 observations and two classes; 2 of them may be in Class_0 and the other 8 are in Class_1. This could introduce bias into the model as it sees significantly more of one class than the other. We define the minority class as the one with fewer observations, and the majority class as the one with more observations. In our tutorial, this would like something like this:

A technique to remedy this is to either under-sample the majority class, or over-sample the minority class. Sampling is the practice of taking a subset of the data on which to perform some operation. Furthermore, under/over sampling is when we either duplicate observations (over) or remove observations (under) pertaining to the relevant class. It is definitely worth experiments which option will work best for the task at hand and the data you are working with.

SMOTE (Synthetic Minority Oversampling Technique) is a tool we can use for this.

from imblearn.over_sampling import SMOTE

oversample = SMOTE()
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)

pd.Series(y_train_over).value_counts()

Output:

Step 3: The Model

When it comes to data science, machine learning (and all it encompasses) is just a technique used to estimate the relationship between variables. Finding the right model for your data is yet another journey to get the best results possible for your use-case. The true value of data science is in using these techniques to make well-informed decisions in the real world.

For this example we will try a Random Forest Classifier, as they are very plug-and-play in their implementation, and so they are very easy to try straight away. In addition, we will see how the predictions on over sampled data compares.

from sklearn.ensemble import RandomForestRegressor

# Initialize and fit model on train dataset
rf = RandomForestClassifier().fit(X_train, y_train)

# Fit on over-sampled data as well
rf_over = RandomForestClassifier().fit(X_train_over, y_train_over)

Once fit, we can view our predictions on the test set in a dataframe.

from sklearn.metrics import accuracy_score

# Create Dataframe and populate with predictions and actuals
# Train set
predictions = pd.DataFrame()
predictions['true'] = y_train
predictions['preds'] = rf.predict(X_train)

# Test set
predictions_test = pd.DataFrame()
predictions_test['true'] = y_test
predictions_test['preds'] = rf.predict(X_test)
predictions_test['preds_over'] = rf_over.predict(X_test)

# Compute error
train_acc = accuracy_score(predictions.true, predictions.preds)
test_acc = accuracy_score(
predictions_test.true, predictions_test.preds)
test_acc_over = accuracy_score(
predictions_test.true, predictions_test.preds_over)
print(f"Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}, Test Acc Oversampled: {test_acc_over:.4f}")

Output:

Train Acc: 0.9863, Test Acc: 0.8772, Test Acc Oversampled: 0.8671

Results

It’s interesting that both data sets produced very similar results, in fact the over sampled data performed worse than the imbalanced data. Here we can look at the classification report to see how precise the predictions actually were.

Classification report for predictions on imbalanced data

Classification report for predictions on over-sampled data

We can see that in this case, there was no benefit to using SMOTE, however, in your case that may not be the case, based on your data.

Now that our model has been trained, we can use the predict_proba()function to get the probabilities associated with each prediction. Here is a plot of the predicted probability distribution. Remember, the probability predicted by the model is how likely a customer is to engage with the business, and we are looking for the probability that they won’t, so we can simply subtract each probability from 1.

Histogram of probability distribution of churn risk among customers

As expected, most customers are on either end of the spectrum. However, the most meaningful and actionable insights are found between them. Customers lying before 0.5 are at a low risk of disengaging, and so this plot indicates that most customers are healthy. On the other hand, those with a churn risk of over 0.5 are more likely to disengage and so paying attention to their preferences are imperative to retain them.

Conclusion

Feature engineering techniques like Recursive RFM allow for rich features to describe customers. As seen here, these features can be useful to analyze their behaviors and predict what they might do in the future. We also covered how to handle class imbalance if necessary using SMOTE. Churn Risk is just one of these predictable metrics. Others include Customer Lifetime Value and Customer Segmentation. What’s special about Churn Risk, is that it taken be taken a step further to identify the probability the customers will do something more specific, like buy a specific category of product, or likelihood of engaging on each day of the week. The potential of customer analytics is far-reaching and ever-insightful, especially for businesses.

Be sure to check out my other articles for information on more Data Science and Machine Learning for business.

If you enjoyed this article, give me a follow for more Customer Analytics content!