Train a classification model to estimate customer churn
Last updated
Was this helpful?
Last updated
Was this helpful?
In this tutorial, we’ll dive into telecom customer churn data to uncover the key reasons behind customer departures and develop targeted strategies to boost retention and satisfaction. Specifically, we will learn how to predict customer churn for a telecom provider offering telephony and internet services using CARTO Workflows. You can access the full template here.
For this use case, we will be using IBM’s Telco Customer Churn Dataset, which contains information about a fictional telco company that provides home phone and Internet services to 7043 customers in California. This dataset provides essential insights into each customer's profile, covering everything from subscribed services and tenure to socio-demographic information and sentiment data.
Before stating, let’s take a look at the data. From the map widget’s section, we can see that 26,54% of customers churned this quarter, resulting in a $3,68M revenue loss. Regions like Los Angeles and San Diego are characterized by having both a large number of customers and a higher number of lost customers, positioning them as high-priority areas for improving customer retention.
For this tutorial, we will be using CARTO's BigQuery ML Extension Package, a powerful tool that allows users to exploit BigQuery’s ML capabilities directly from Workflows, enabling seamless integration of machine learning models into automated pipelines.
To install the Extension Package from the Workflows gallery, follow the next steps:
Log into the CARTO Workspace, then head to Workflows and Create a new workflow; use the CARTO Data Warehouse
connection.
Go to the Components tab, on the left-side menu, then click on Manage Extension Packages.
In the Explore tab, you will see a set of Extension Packages that CARTO has developed. Click on the BigQuery ML for Workflows box, then on Install extension.
You have successfully installed the Extension Package! Now you can click on it to navigate through the components. You can also go to the Components section and see the components from there, ready to be drag-and-droped into the canvas.
Alternatively, one can manually install the extension following the next steps:
Go to BigQuery ML Extension Package documentation.
Download the .zip
file by clicking on Download the BigQuery ML extension package.
Log into the CARTO Workspace, then head to Workflows and Create a new workflow; use the CARTO Data Warehouse connection.
Go to Components and select Manage Extension Packages > Upload > Upload extension and upload the .zip
file.
Click on Install Extension.
This type of installation is required for custom extensions and for Self-hosted users not having access to the Workflows gallery from their environment.
Please refer to the documentation for more details about managing Extension Packages.
Now, let's add components to our Workflow to predict customer churn. We will load the telco dataset, from which we’ve pre-selected some interesting features (e.g. those correlated with churn), and we will train a classification model to estimate which customers are prone to churn and which are not.
Drag the Get Table by Name component to the canvas and import the cartobq.docs.telco_churn_ca_template
dataset. This data is publicly available in BigQuery (remember that we are using a connection to the CARTO DW, a fully-managed, default Google BigQuery project for the organization).
Use the Where component to select only those rows for which the churn_label
is available (churn_label IS NOT NULL
). This will be the data we will split for training (70%) and evaluating (30%) our model through random sampling (RAND() < 0.7
) using another Where component. Once our model is ready, we will predict the churn_label
for those customers which we do not know whether they will churn or not.
Now, we will use the training data to create a classification model, whose output will be the probability of churn (i.e. 0 means no churn, 1 means churn) for a customer given specific socio-demographic, contract type and sentiment characteristics.
Use the Drop Columns component to remove unnecessary columns that won't be used for training: geom
(GEOMETRY
type columns are not valid).
Connect the Create Classification Model component to the input data and set up the model’s parameters: we will train a Logistic Regression model and we will not further split the data (we have done so in step 2).
Note: You will need to give the model a Fully Qualified Name (FQN), which is where the model will be stored. In this way, it would also be possible to call the model from a different workflow using the Get Model by Name component. To find the FQN of your CARTO DW, go to the SQL tab in the lower menu and copy the project name as seen in the image below. Your FQN should look something like: carto-dw-ac-<id>.shared.telco_churn_ca_predicted
.
Next, we will Evaluate the performance of our model using the test data.
Based on the classification metrics, the results seem very promising. The high accuracy indicates that the model correctly predicts the majority of instances, and the low log loss suggests that our model's probability estimates are close to the actual values. With precision and recall both performing well, we can be confident that the model is making correct positive predictions, and the F1 score further reassures us that the balance between precision and recall is optimal. Additionally, the ROC AUC score shows that our model has a strong ability to distinguish between clients churning and not churning. Overall, these metrics highlight that our model is well-tuned and capable of handling the classification task effectively.
Having a model that performs good, we can then run predictions and obtain estimates to check which customers are prone to churn. To do so, connect the Create Classification Model component and the data with no churn_label
to the Predict component.
As we can see, two new columns appear on our data:
predicted_churn_label_probs
: indicates the probability that a customer will churn.
predicted_churn_label
: indicates whether the customer will or won't potentially churn based on the probability of churning using a threshold of 0,5.
Lastly, to better understand our model, we can take a look at the model’s explainability. This gives an estimate of each feature’s importance when it comes to churn.
Connect the Create Classification Model component to the Global Explain component. The latter provides the feature importance of the model predictors to each class (churn vs no churn). If the Class level explain option is not clicked, the overall feature importances are given, rather than per class.
For further details, we can also use the Explain Predict component, that provides feature attributions that indicate how much each feature in your model contributed to the final prediction for each given customer. You can select how many features you want to use to retrieve their attributions.
From the results for the overall feature importances, we can see that the most important features when it comes to estimating churn are the customer’s overall satisfaction rating of the company (satisfaction_score
), the customer’s current contract type (contract
), the number of referrals the customer has made (number_of_referrals
), and whether or not the customer has subscribed to an additional online security service (online_security
).
We can visualize the results in the following map, where we can see which customers are prone to churn, and with which probability this will happen.