Data Mining – CLASSIFICATION ANALYSIS

Research Question

Which variables show the highest likelihood of customers leaving the company, based on three explanatory variables from the churn database. The ‘churn’ variable is the target variable.

Goal

Using known variables within a dataset, analysts can predict which customers are more likely to leave in the future by using KNN classification. This is helpful for decision makers to know how to adjust certain services offered and which variables are at risk to leaving, to keep customers longer.

Method Justification: Classification Method

For KNN classification algorithm for this task, is a method of classifying data points based on the distance of the closest neighbors. This is particularly useful when training the dataset for look-alike data points. Once a dataset is trained any new data points that come in can be classified based on the nearest neighbors. Expected outcomes will be using the training dataset to run accuracy of the testing data points in the Churn dataset. Based on three variables: income, tenure, and bandwidth usage this knn model will classify new customers on whether they will leave the company or not. (Sree, 2020)

Method Assumption:

In this task, KNN method assumes that within a certain set distance, that data points are similar. (Grant, 2019)

Packages and Libraries:

The Python packages and libraries used in this task are:

Scikit-learn: predictive and classification analysis is ran in this package for this task. To find the AUC score this packages was used too.
Matplotlib: one of the most common graph plots Matplotlib package, is used in this task.
Numpy: was used to compute the array operations for this task. (Koidan, 2021)

Python is the programming language used to run stats, analysis, models, and research for this task. Python comes with several packages that will be used for various parts of this analysis. Python makes it easy to run classification models and statistics on large datasets like the churn dataset. Jupyter Notebook is an online kernel to run Python using a web browser to import more packages, and libraries. Anaconda Navigator application runs Jupyter Notebook within a web browser and command prompt local sever. Anaconda Navigator has more than one thousand data sciences packages built in, which is easy to install and run. A MacBook pro laptop is used to install Anaconda application, run Python, export, and save datasets as well as save Jupyter Notebooks of Python code for future use.

Data Preparation: Data Preprocessing

Several of the variables are binary (yes/no), in order to use these they must be changed to numeric as dummy variables (0/1).

Dataset Variables

The ‘churn’ dataset has 50 variables ranging from information about the customer to details around individual subscription services. For this KNN model three variables remain that are relevant to the research question.

The dropped variables are: ‘Population’, ‘Children’, ‘Age’, ‘Outage_sec_perweek’, ‘Multiple’, ‘StreamingTV’, ‘StreamingMovies’, ‘CaseOrder’, ‘Customer_id’, ‘Interaction’, ‘UID’, ‘City’, ‘State’, ‘County’, ‘Zip’, ‘Lat’, ‘Lng’, ‘Area’, ‘MonthlyCharge’, ‘TimeZone’, ‘Job’, ‘Marital’, ‘Gender’, ‘Churn’, ‘Email’, ‘Contacts’, ‘Yearly_equip_failure’, ‘Techie’, ‘Contract’, ‘Port_modem’, ‘Tablet’, ‘InternetService’, ‘Phone’, ‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’, ‘PaperlessBilling’, ‘PaymentMethod’, ‘Item1’, ‘Item2’, ‘Item3’, ‘Item4’, ‘Item5’, ‘Item6’, ‘Item7’, ‘Item8’.

The variables that remain are numerical independent, including the target variable (churn): ‘Income’, ‘Tenure’, and ‘Bandwidth_GB_Year’. The mean income is 39806.926771 (with std: 28199.916702). The mean tenure months is 34.5 (std: 26.443063). And the mean bandwidth used per year is 3392.34GB/Year (std: 2185.294852).

Re-encode ‘churn’ binary (yes/no) categorical to numbers; by replacing “No” answers to “0’s” and “Yes” answers with “1’s”.

Steps for Analysis

Import Python libraries
Select: Import churn dataset into Python and determine which variables best answer the research question.
Find and treat nulls: Look for missing data and treat those.
Remove unrelated variables: Using the .drop() function, remove variables that aren’t correlated to the research question.
Find and treat outliers: Look for outliers within the remaining variables and remove those using percentile’s function.
Convert categorical data: To use categorical variables in the KNN model, these need to be converted from categorical to numeric
Extract cleaned dataset as “dataset.csv” for use in KNN model

# Standard data science imports
import numpy as np

import pandas as pd
from pandas import Series, DataFrame

# Visualization libraries
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

# Statistics packages
import pylab
from pylab import rcParams
import statsmodels.api as sm
import statistics
from scipy import stats

# Scikit-learn
import sklearn
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report

# Import chisquare from SciPy.stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

from sklearn.datasets import make_classification

from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import statsmodels.api as sm

# Load data set into Pandas dataframe
dataset = pd.read_csv('/Users/rednebula/Desktop/churn_clean.csv')

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 50 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   CaseOrder             10000 non-null  int64  
 1   Customer_id           10000 non-null  object 
 2   Interaction           10000 non-null  object 
 3   UID                   10000 non-null  object 
 4   City                  10000 non-null  object 
 5   State                 10000 non-null  object 
 6   County                10000 non-null  object 
 7   Zip                   10000 non-null  int64  
 8   Lat                   10000 non-null  float64
 9   Lng                   10000 non-null  float64
 10  Population            10000 non-null  int64  
 11  Area                  10000 non-null  object 
 12  TimeZone              10000 non-null  object 
 13  Job                   10000 non-null  object 
 14  Children              10000 non-null  int64  
 15  Age                   10000 non-null  int64  
 16  Income                10000 non-null  float64
 17  Marital               10000 non-null  object 
 18  Gender                10000 non-null  object 
 19  Churn                 10000 non-null  object 
 20  Outage_sec_perweek    10000 non-null  float64
 21  Email                 10000 non-null  int64  
 22  Contacts              10000 non-null  int64  
 23  Yearly_equip_failure  10000 non-null  int64  
 24  Techie                10000 non-null  object 
 25  Contract              10000 non-null  object 
 26  Port_modem            10000 non-null  object 
 27  Tablet                10000 non-null  object 
 28  InternetService       10000 non-null  object 
 29  Phone                 10000 non-null  object 
 30  Multiple              10000 non-null  object 
 31  OnlineSecurity        10000 non-null  object 
 32  OnlineBackup          10000 non-null  object 
 33  DeviceProtection      10000 non-null  object 
 34  TechSupport           10000 non-null  object 
 35  StreamingTV           10000 non-null  object 
 36  StreamingMovies       10000 non-null  object 
 37  PaperlessBilling      10000 non-null  object 
 38  PaymentMethod         10000 non-null  object 
 39  Tenure                10000 non-null  float64
 40  MonthlyCharge         10000 non-null  float64
 41  Bandwidth_GB_Year     10000 non-null  float64
 42  Item1                 10000 non-null  int64  
 43  Item2                 10000 non-null  int64  
 44  Item3                 10000 non-null  int64  
 45  Item4                 10000 non-null  int64  
 46  Item5                 10000 non-null  int64  
 47  Item6                 10000 non-null  int64  
 48  Item7                 10000 non-null  int64  
 49  Item8                 10000 non-null  int64  
dtypes: float64(7), int64(16), object(27)
memory usage: 3.8+ MB

print(dataset['Churn'].value_counts())
sns.countplot(x='Churn', data=dataset, palette='hls')
plt.show()
No     7350
Yes    2650
Name: Churn, dtype: int64

# will replace "Yes" value in dataframe with 1
dataset['churn_dummy'] = [1 if v == 'Yes' else 0 for v in dataset['Churn']]

dataset.groupby('Churn').mean().round(2).T

Churn	No	Yes
CaseOrder	5710.00	3032.65
Zip	49017.78	49529.24
Lat	38.73	38.84
Lng	-90.74	-90.90
Population	9830.51	9551.46
Children	2.09	2.07
Age	53.01	53.27
Income	39706.40	40085.76
Outage_sec_perweek	10.00	10.00
Email	11.99	12.08
Contacts	0.99	1.01
Yearly_equip_failure	0.40	0.38
Tenure	42.23	13.15
MonthlyCharge	163.01	199.30
Bandwidth_GB_Year	3971.86	1785.01
Item1	3.50	3.48
Item2	3.51	3.48
Item3	3.49	3.47
Item4	3.50	3.49
Item5	3.50	3.47
Item6	3.50	3.50
Item7	3.51	3.49
Item8	3.49	3.51
churn_dummy	0.00	1.00

dataset.to_csv(r'/Users/rednebula/Desktop/dataset.csv')

knn_df = dataset[['Income', 'Tenure', 'Bandwidth_GB_Year', 'churn_dummy']]

knn_df.describe()

	Income	Tenure	Bandwidth_GB_Year	churn_dummy
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	39806.926771	34.526188	3392.341550	0.265000
std	28199.916702	26.443063	2185.294852	0.441355
min	348.670000	1.000259	155.506715	0.000000
25%	19224.717500	7.917694	1236.470827	0.000000
50%	33170.605000	35.430507	3279.536903	0.000000
75%	53246.170000	61.479795	5586.141370	1.000000
max	258900.700000	71.999280	7158.981530	1.000000

‘dataset.csv.’ dataset attached

Analysis – knn
Re-encode churn to numbers
Save churn variable + 3 independent variables to knn dataframe
Train dataset – fit
K = 6
Output for churn
Accuracy score for churn
Find best k number
Auc score

Part 4: Analysis
D1. Splitting the Data

Using Python with Scikit-learn package add-on, the churn dataset was split into 80% train and 20% test data. Estimating, out of 10,000 customers, there were 2,000 in the test set out, and 8,000 were the training set model. Below is the code for achieving this as well as the KNN score of 77%. from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split 

X = knn_df.iloc[:, 1:-1].values
y = knn_df.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size = 0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=6)

knn.fit(X_train, y_train)

print(knn.predict(X_test))
[0 1 1 ... 0 0 0]

In [14]:
print(knn.score(X_test, y_test))

0.768

Calculations

Using Python with Scikit-learn package add-on, the churn dataset was split into 80% train and 20% test data. Below is the code and plot of fitting the data into the model, using KNN (K Nearest Neighbor) classification algorithm, with an output of values vs accuracy. (w3resource)

import numpy as np
import matplotlib.pyplot as plt

X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size = 0.2, random_state=42)

neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

for i, k in enumerate(neighbors):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_accuracy[i] = knn.score(X_train, y_train)
    test_accuracy[i] = knn.score(X_test, y_test)
    
plt.plot(neighbors, test_accuracy, label = 'Test Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')

plt.legend()
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()

Code

from sklearn.metrics
import roc_curvepred_prob = knn.predict_proba(X_test)

fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob[:,1], pos_label=1)

random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)
from sklearn.metrics import roc_auc_score

auc_score = roc_auc_score(y_test, pred_prob[:,1])

print(auc_score)
0.8297858506383322

Data Summary and Implications
Accuracy & AUC

The best score for this KNN model is k = 6. The accuracy model shows the training dataset drops in half in the first 2 k neighbors. Meaning k = 5, 6, 7, 8 is better for the algorithm. The accuracy score shows 77% – 79%; where the AUC (Area Under Curve) score is 83%.

Results & Implications

This supervised machine learning algorithm model shows, based on variables (Income, Tenure, and Bandwidth per year), it can accurately classify whether someone will churn or not 83% of the time.

Limitation

Using the KNN classification method, picking different numbers for k to train the model will yield different results. This is the bedrock of this classification model and algorithm. Because of this, each analysis can be much different based on the use of this number for “k”. It’s up to the analyst which number they want to use to train their model. We need more variables to accurately train this model.

Action

It’s important for this companies’ decision makers to know that this is a classification method, a way to predict certain services will be at a greater risk for their customers to leave. For this example, finding patterns between the longevity of their customers, how much money they make and how much bandwidth they consume can steer marketers into grouping new customers coming in. Offering a discount or free month of service for these customers who have been there a long time and are all about saving money might go a long way in retaining customers.

Sources for Third-Party Code

Sree. (2020, October 26). Predict Customer Churn in Python. Towards Data Science. https://towardsdatascience.com/predict-customer-churn-in-python-e8cd6d3aaa7

(2022, August 19). Python Scikit-learn: K Nearest Neighbors – Create a plot of k values vs accuracy. W3resource. https://www.w3resource.com/machine-learning/scikit-learn/iris/python-machine-learning-k-nearest-neighbors-algorithm-exercise-8.php

Sources Grant, P. (2019, July 21). Introducing k-Nearest Neighbors. TowardDataScience. https://towardsdatascience.com/introducing-k-nearest-neighbors-7bcd10f938c5

Koidan, Kateryna. (2021, August 19). Most Popular Python Packages in 2021. LearnPython https://learnpython.com/blog/most-popular-python-packages/