Artificial Intelligence: Telecom Customer Attrition Prediction

Praveen Joshi
12 min readJun 2, 2019

~~Tackling Imbalanced Dataset~~

All of your customers are partners in your mission. ~Shep Hyken

GitHub link for notebook: https://github.com/praveenjoshi01/cork-institute-of-technology-Practical-Machine-Learning/blob/master/TelecomCustomerAttritionPrediction/Telecom%20Customer%20Attrition%20Prediction.ipynb
DataSet: https://www.kaggle.com/blastchar/telco-customer-churn
Work is inpired from kernel: https://www.kaggle.com/pavanraj159/telecom-customer-churn-prediction

Abstract

In this article, I propose an end-to-end pipeline composed of a hybrid machine learning system to tackle the imbalanced class classification problem. A variety of indicators from the telecom industry are used as an input feature to predict customer attrition. Sampling techniques have been evaluated based on Logistic classifier, and the best sampling technique is then used to provide input to different machine learning models. The result shows that random undersampling with logistic classifier outperforms most of the machine learning models.

  1. Introduction

The heart and soul of the telecom industry have always been the customer. It’s always been famous for the telecom organisation to know about their customers. If an organisation can cater to the need of the customer before competitors do, they can survive for long. Making the right offers to the right customer at the right time holds the key for survival. Due to liberalisation and globalisation, earlier concepts of production of goods and services based on the requirement seems to be challenging. Nowadays, this concept has been changed to creating a sense of belongingness amongst the heart of customers, such that they should remain loyal customers to the organisation.

Churn is defined as the propensity of a customer to stop doing business with an organisation and subsequently moving to some other company in a given period [1]. For telecom industries, Customer attrition is becoming a significant area of concern. Maintaining overall profitability over a long time demands current customers to be always with an organisation; hence, it is indispensable to retain the customers; therefore, it becomes essential to know churning customers beforehand. Losing customers not only lead to a loss in revenue but attracting new customers also involves a considerable amount of investment further impacting the income of the organisation.

With the multi-fold ability of current era to hold the data and process it opened a new channel, where lots of data can be utilised to answer such questions. The various commercial organisations are now trying to generate relevant information out of the considerable repositories to find out ways to provide crisp details to tackle the customer attrition problem.

This article uses a set of dataset from the telecom industry to predict customer churn. Dataset under consideration has 19 columns as independent features, 1 column as a dependent feature, 1 column as index depicting the ID of the anonymised customer and 7043 data points with no missing values. A detailed description of the dataset is provided underneath:

Feature Description of Dataset

Here in Figure 1, black depicts the customer who is loyal and grey who will churn.

2. Research

The topic of the article will is to find the best possible ways to tackle the imbalance class problem. In this research, multiple sampling techniques will be used and evaluated against each other. Best sampling technique will then be providing input dataset to the multiple machine learning classifier to assess the performance of prediction.

For Imbalanced class dataset, there are three main ways to handle it namely:

1. Appropriate error matrix

2. Data level approach — Resampling strategies

3. Algorithmic ensemble techniques

2.1 Appropriate Error Matrix

As continuous development of algorithms by the research community to give promising results with an imbalanced dataset, it becomes of great interest to have standardised evaluation metrics to assess the effectiveness of the algorithms appropriately. Few of the error metrics are critically reviewed underneath:

2.1.1 Singular Assessment Metrics

Traditionally, the most commonly used metrics are accuracy and error- rate. In this article, we use the minority class as the positive class. Then by the convention, formulas for accuracy and error rate will be defined as

Accuracy = (TP + TN) / (PC + NC); Error Rate = 1 — accuracy [2]

Although these metrics can provide classifier’s performance on a given data in certain situations, they become deceiving as they are highly sensitive towards the changes in data. In the case of the imbalanced dataset, the accuracy metric doesn’t provide adequate information regarding the classifier’s functionality concerning the required classification.

In place of accuracy, other frequently adopted evaluation metrics to provide comprehensive assessments of imbalanced learning problems are, namely, G-mean, precision, F-measure and recall.

Precision is a measure of exactness that is correct predictions over totally predicted positive labels.

Precision = (TP) / (TP + FP) [3]

The recall is a measure of completeness that is correct positive predictions over actual positive labels.

Recall = (TP) / (TP + FN) [4]

The f-measure metric is given by the harmonic mean of precision and recall and effectiveness of classification in terms of a ratio is determined by the coefficient of beta.

F-Measure = ((1-beta)2 / beta2) / ((Recall * Precision)/ Recall +Precision)) [5]

Where beta is a coefficient to adjust the relative importance of precision versus recall (usually, beta =1)

The G-Mean metric is the ratio of positive accuracy versus the negative accuracy, which measures the degree of inductive bias.

G-mean = √((TP) / (TP + FN)) * ((TN) / (TN + FP)) [6]

2.1.2 Receiver Operating Characteristics (ROC) Curves

To compare the performance over the range of sample distribution for the different classifier, ROC assessment technique can be used. ROC gives the proportion of two evaluation metrics, namely false positive rate and true positive rate, which is described underneath:

TP_rate = TP/Positive Class; FP_rate = FP/Negative Class [7]

For research, amongst the assessment metrics discussed so far, F1-score and ROC curve is used to evaluate the sampling techniques and the models built over the dataset.

2.2 Data level approach — Resampling strategies

Typically, in imbalanced learning application sampling methods are widely used so that the distribution of minority class can be balanced. Prior research has shown a balanced data set improves the overall classification performance of several base classifiers [8].

2.2.1 Over-sampling

Oversampling provides a way to generate new samples in the classes, which are under-represented. Oversampling creates new data points for minority classes by using several techniques. Some of the oversampling techniques are theoretically described underneath:

2.2.1.1 Random Over-sampling

In random over-sampling, minority classes data points are generated randomly with the replacement of the currently available samples.

2.2.1.2 SMOTE

In SMOTE, we pick a point in minority class in random, then we compute the k-nearest neighbours for that point and lastly add k new points between the chosen point and its corresponding neighbours.

2.2.1.3 ADASYN

In ADASYN, we pick a point in minority class iteratively in random, then we compute the k-nearest neighbours for that point and lastly add k new points after adding small random values to them, spaced in-between the parent point and the nearest neighbours.

2.2.2 Under-sampling

Under-sampling is a technique which reduces the number of samples from the majority classes. Based on the algorithm to reduce the number of data points of majority classes, different under-sampling techniques are defined underneath:

2.2.2.1 Random Under-sampling

In random under-sampling, we reduce the data points of majority classes on random. It allows sampling of heterogeneous data. It provides a fast way to obtain the balanced dataset by randomly selecting a subset of data for the targeted classes.

2.2.2.2 Cluster Centroids

Cluster centroids, to reduce the data points from the majority samples, makes use of K-means. Hence it synthesises each of class with the centroids obtained by K-mean algorithm rather than the original sample

2.2.2.3 Tomek’s links

Tomek Links works on the concept of finding Tomek Links and removing the data point from the majority for two samples of different class q and w, for Tomek link there should be no such point as z, that can be defined as:

d (q, w) < d (q, z) or d (w, z) < d (q, w)

where d(.) represents the distance between two samples.

2.2.3 Combination of over- and under-sampling

Previously, presented oversampling techniques showed that the oversampling method could generate noisy samples during the interpolation of new data between marginal outliers and inliers. The underlying issue can be resolved by cleaning the space created from over-sampling. Based on the different cleaning algorithms, two of the combination of over- and under-sampling are described underneath:

2.2.3.1 SMOTETomek

In SMOTETomek, SMOTE is used to generate new samples of minority classes and cleaning of space is done by the under-sampling technique. In SMOTETomek, its Tomek’s link has been added to the pipeline after applying SMOTE.

2.2.3.2 SMOTEENN

In SMOTETNN, SMOTE is used to generate new samples of minority classes and cleaning of space is done by the under-sampling technique. In SMOTETNN, its nearest neighbour algorithm has been added to the pipeline after applying SMOTE.

2.3 Algorithmic approach

While sampling methods attempt to generate balance distribution of classes based on representative proportions of the class, on the other hand, the algorithmic approach helps in cost-sensitive learning of different machine learning models based on the misclassification.

2.3.1 Cost sensitive training

This is one of algorithmic technique, also known as penalised training. It is done by penalising model for every misclassification on minority class. It can be done by customising our error matrix, shown underneath:

Score = (9* False-ve + 1* false+ve)/6

Moreover, various empirical studies have shown that for some application domains, cost-sensitive learning works better as compared to sampling techniques [9].

2.3.2 Choice of Algorithm

Algorithmic choices also help a lot in building a better model. Ensemble methods have been found good for handling imbalance data. sklearn’s algorithmic implementation provides an easy way to handle the imbalanced data-set by setting the class_weight parameter.

3 Methodology

This section of the article attempts to gain insight into the different sampling techniques and behaviour of the machine learning model over the imbalanced class classification problem.

3.1 Overview of the Experiment

This section provides the experimental pipeline set up for evaluation of different sampling techniques and behaviour of the machine learning model over the imbalanced class classification problem.

The experiment starts by providing an overview of data followed by data manipulation and exploratory data analysis. In exploratory data analysis, analysis is done to find the proportion of customer attrition in data, variables distribution in customer attrition and few additional analysis has been done to understand the dataset. Next preprocessing is done where numerical features have been scaled, label encoding is done for binary columns, and dummification of multi categorical columns is done. In model building nine techniques of sampling have been used and evaluated over logistic regression classifier best sampling method amongst the sampling technique is then chosen to be input for another eight machine learning model and compared against the baseline model. F1-score is used to select the sampling technique whereas for model selection F1-score with ROC is considered to report the findings.

3.1.1 Data Manipulation

This section provides information regarding the initial manipulation of data done over the dataset for doing exploratory analysis. Section 2 in python notebook capture the manipulation done over the dataset. Underneath is a list of preprocessing done:

1. Replacing spaces with null values in total charges column.

2. Dropping null values from total charges column which contain .15% missing data.

3. Tenure to categorical column

4. Separating churn and non-churn customers

5. Separating categorical and numerical columns

3.1.2 Exploratory Data Analysis

This section provides information on exploratory data analysis done. Section 3 in the attached python notebook should be referred for in-depth insights. List of exploratory data analysis done:

1. Customer attrition in data

2. Variables distribution in customer attrition

3. Customer attrition in tenure groups

4. Monthly charges and total charges by tenure and churn group

5. Average charges by tenure groups

6. Monthly charges, total charges and tenure in customer attrition

7. Variable Summary

8. Correlation Matrix

9. Visualizing data with principal components

10. Binary variable distribution in customer attrition (Radar Chart)

3.1.3 Data Preprocessing

This section provides information regarding the preprocessing done over the dataset to make it compatible with machine learning algorithms. Section 4 in python notebook capture the preprocessing done over the dataset. Underneath is a list of preprocessing done:

1. Dummification of multi categorical feature columns

2. Label encoding for binary columns

3. Scaling of numerical columns

3.1.4 Model Building

This section provides the information regarding different model built for evaluation of sampling techniques and various machine learning model to run over the best performing sampling technique. Section 5 in python notebook capture the model developed over the dataset.

The first model is the baseline model built without doing sampling to balance the data distribution and henceforth will be used to evaluate the performance of sampling techniques and different machine learning model.

Underneath is a list of various sampling techniques used and evaluated with logistic regression classifier:

1. Oversampling techniques:

1.1 RandomOverSampler

1.2 SMOTE

1.3 ADASYN

2. Under sampling techniques:

2.1 RandomUnderSampler

2.2 Tomek links

2.3 ClusterCentroids

3. Combination of over- and under- sampling:

3.1 SMOTETomek

3.2 SMOTENN

Underneath is a list of multiple machine learning models used and assessed:

1. XGBoost Classifier

2. LGBM Classifier

3. SVM Classifier RBF

4. SVM Classifier Linear

5. Naïve Bayes

6. Random Forest Classifier

7. KNN Classifier

8. Decision Tree

3.1.5 Model Performances

This section provides information regarding the performance of sampling techniques and also for the different machine learning model. Section 6 in python notebook capture the evaluation of methods against each other.

Underneath is a list of performances captured and reported in the next section of the article:

1. Selection of Sampling Technique

2. Comparing sampling selection metrics

3. Model performance metrics

4. Compare model metrics

5. Confusion matrices for models

6. ROC- Curves for models

7. Precision recall curves

4 Evaluation

This section provides a comparative study of the performance of sampling techniques and also for the different machine learning model. Section 6 in python notebook capture the assessment of methods against each other.

An evaluation study is broken down into two parts. The first study will focus on the different sampling technique and comparison between each other and the baseline model. F1-score is chosen as criteria for selection of best sampling technique. The second study will focus on machine learning models provided with input obtained from the best sampling technique.

4.1 Evaluation of Sampling techniques

Different sampling techniques are used to balance the class distribution in the dataset. Each of the sampling techniques is compared against each other by measuring their performance with Logistic regression classifier. Also, all the sampling techniques are evaluated against the baseline model, where the class distribution hasn’t been modified.

Table comparing different sampling technique with Logistic Regression Classifier is provided underneath:

Comparative Analysis of Sampling Techniques

Based on the f1_score, we can say that all of the sampling techniques outperformed the baseline model. Amongst all sampling techniques, random under-sampling performed best out of the other sampling techniques.

4.2 Evaluation of Machine Learning models

As per the finding in the above section with undersampling synthesised data, we will evaluate different machine learning model. Also, all the ML-based modes are assessed against the baseline model, where the class distribution hasn’t been modified.

Table comparing different ML models after applying the under-sampling technique is provided underneath:

Comparison of ML-models evaluation metrics with sampling techniques

Based on the observation from table 3, all of the machine learning models performed well against the baseline model. Amongst all, Logistic regression model with under-sampled sampling technique outperformed all machine learning model.

5 Conclusion

For the imbalanced dataset, there is no straight away solution to improve the ROC of the machine learning classifier. One has to try multiple methods to figure out which sampling technique will work best with data in hand. Also, different ML model needs to be evaluated exhaustively with different techniques of sampling. For evaluation of models f1 score and ROC, curve metric turned out to be the best metric. This article hasn’t evaluated the algorithmic approaches which can be done in an extension of this research.

References

1. Bhambri, Vivek. “Data Mining as a Tool to Predict Churn Behaviour of Customers.” International Journal of Management Research (2013): 59–69.

2. N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” J. Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

3. H. Guo and H.L. Viktor, “Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost IM Approach,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 30–39, 2004.

4. K. Woods, C. Doss, K. Bowyer, J. Solka, C. Priebe, and W. Kegelmeyer, “Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography,” Int’l J. Pattern Recognition and Artificial Intelligence, vol. 7, no. 6, pp. 1417–1436, 1993.

5. R.B. Rao, S. Krishnan, and R.S. Niculescu, “Data Mining for Improved Cardiac Care,” ACM SIGKDD Explorations Newsletter, vol. 8, no. 1, pp. 3–10, 2006.

6. P.K. Chan, W. Fan, A.L. Prodromidis, and S.J. Stolfo, “Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, vol. 14, no. 6, pp. 67–74, Nov./Dec. 1999.

7. P. Clifton, A. Damminda, and L. Vincent, “Minority Report in Fraud Detection: Classification of Skewed Data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 50–59, 2004.

8. G.M. Weiss and F. Provost, “The Effect of Class Distribution on Classifier Learning: An Empirical Study,” Technical Report MLTR-43, Dept. of Computer Science, Rutgers Univ., 2001.

9. X.Y. Liu and Z.H. Zhou, “Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 1, pp. 63–77, Jan. 2006.

--

--

Praveen Joshi

Director of Technology @ Speire | AI and ML consultant | Casual NLP Lecturer @ Munster Technological University | ADAPT Researcher