Churn Modelling Marketing Data with Julia
Churn Modelling Marketing Data with Julia
Introduction
I prepared this analysis to learn the logistic regression in Julia. The work is fairly straightforward. I am modelling if a customer will exit a website based on a number of sites. I'll improve the model in an upcoming post. As if, is not as interesting as asking when. That will be my next project
Imports
using Pkg
using DataFrames
using CSV
using Plots
using GLM
using StatsBase
using Lathe
using MLBase
using ClassImbalance
using ROCAnalysis
using PyCall
sklearn = pyimport
PyObject <module 'sklearn.metrics' from '/Users/jnapolitano/venvs/finance/lib/python3.9/site-packages/sklearn/metrics/__init__.py'>
load_csv (generic function with 1 method)
Loading Data
marketing_df = load_csv
first
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | |
|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | String31 | Int64 | String7 | String7 | Int64 | Int64 | |
| 1 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 |
| 2 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 |
| 3 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 |
| 4 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 |
| 5 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 |
println
describe
(10000, 14)
| variable | mean | min | median | max | nunique | nmissing | eltype | |
|---|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Union… | Nothing | DataType | |
| 1 | RowNumber | 5000.5 | 1 | 5000.5 | 10000 | Int64 | ||
| 2 | CustomerId | 1.56909e7 | 15565701 | 1.56907e7 | 15815690 | Int64 | ||
| 3 | Surname | Abazu | Zuyeva | 2932 | String31 | |||
| 4 | CreditScore | 650.529 | 350 | 652.0 | 850 | Int64 | ||
| 5 | Geography | France | Spain | 3 | String7 | |||
| 6 | Gender | Female | Male | 2 | String7 | |||
| 7 | Age | 38.9218 | 18 | 37.0 | 92 | Int64 | ||
| 8 | Tenure | 5.0128 | 0 | 5.0 | 10 | Int64 | ||
| 9 | Balance | 76485.9 | 0.0 | 97198.5 | 2.50898e5 | Float64 | ||
| 10 | NumOfProducts | 1.5302 | 1 | 1.0 | 4 | Int64 | ||
| 11 | HasCrCard | 0.7055 | 0 | 1.0 | 1 | Int64 | ||
| 12 | IsActiveMember | 0.5151 | 0 | 1.0 | 1 | Int64 | ||
| 13 | EstimatedSalary | 1.0009e5 | 11.58 | 1.00194e5 | 1.99992e5 | Float64 | ||
| 14 | Exited | 0.2037 | 0 | 0.0 | 1 | Int64 |
# Check column names
names
14-element Vector{Symbol}:
:RowNumber
:CustomerId
:Surname
:CreditScore
:Geography
:Gender
:Age
:Tenure
:Balance
:NumOfProducts
:HasCrCard
:IsActiveMember
:EstimatedSalary
:Exited
Check Class Imbalance
# Count the classes
countmap
Dict{Int64, Int64} with 2 entries:
0 => 7963
1 => 2037
Data Preprocessing
One Hot Encoding
# One hot encoding
Lathe.preprocess.OneHotEncode
Lathe.preprocess.OneHotEncode
select!
| CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | |
|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Float64 | Int64 | Int64 | Int64 | |
| 1 | 619 | 42 | 2 | 0.0 | 1 | 1 | 1 |
| 2 | 608 | 41 | 1 | 83807.9 | 1 | 0 | 1 |
| 3 | 502 | 42 | 8 | 1.59661e5 | 3 | 1 | 0 |
| 4 | 699 | 39 | 1 | 0.0 | 2 | 0 | 0 |
| 5 | 850 | 43 | 2 | 1.25511e5 | 1 | 1 | 1 |
| 6 | 645 | 44 | 8 | 1.13756e5 | 2 | 1 | 0 |
| 7 | 822 | 50 | 7 | 0.0 | 2 | 1 | 1 |
| 8 | 376 | 29 | 4 | 1.15047e5 | 4 | 1 | 0 |
| 9 | 501 | 44 | 4 | 1.42051e5 | 2 | 0 | 1 |
| 10 | 684 | 27 | 2 | 1.34604e5 | 1 | 1 | 1 |
| 11 | 528 | 31 | 6 | 1.02017e5 | 2 | 0 | 0 |
| 12 | 497 | 24 | 3 | 0.0 | 2 | 1 | 0 |
| 13 | 476 | 34 | 10 | 0.0 | 2 | 1 | 0 |
| 14 | 549 | 25 | 5 | 0.0 | 2 | 0 | 0 |
| 15 | 635 | 35 | 7 | 0.0 | 2 | 1 | 1 |
| 16 | 616 | 45 | 3 | 1.43129e5 | 2 | 0 | 1 |
| 17 | 653 | 58 | 1 | 1.32603e5 | 1 | 1 | 0 |
| 18 | 549 | 24 | 9 | 0.0 | 2 | 1 | 1 |
| 19 | 587 | 45 | 6 | 0.0 | 1 | 0 | 0 |
| 20 | 726 | 24 | 6 | 0.0 | 2 | 1 | 1 |
| 21 | 732 | 41 | 8 | 0.0 | 2 | 1 | 1 |
| 22 | 636 | 32 | 8 | 0.0 | 2 | 1 | 0 |
| 23 | 510 | 38 | 4 | 0.0 | 1 | 1 | 0 |
| 24 | 669 | 46 | 3 | 0.0 | 2 | 0 | 1 |
| 25 | 846 | 38 | 5 | 0.0 | 1 | 1 | 1 |
| 26 | 577 | 25 | 3 | 0.0 | 2 | 0 | 1 |
| 27 | 756 | 36 | 2 | 1.36816e5 | 1 | 1 | 1 |
| 28 | 571 | 44 | 9 | 0.0 | 2 | 0 | 0 |
| 29 | 574 | 43 | 3 | 1.41349e5 | 1 | 1 | 1 |
| 30 | 411 | 29 | 0 | 59697.2 | 2 | 1 | 1 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Split Train/and Test Data
# Train test split
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit;
Build Model
# Train logistic regression model
fm =
logit = glm
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Exited ~ 1 + CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain
Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept) -1.90933 0.165007 -11.57 <1e-30 -2.23274 -1.58592
CreditScore -0.000321917 0.000183184 -1.76 0.0789 -0.000680951 3.71172e-5
Age 0.040893 0.00165251 24.75 <1e-99 0.0376541 0.0441318
Tenure -0.008864 0.00611129 -1.45 0.1469 -0.0208419 0.0031139
Balance 1.65933e-6 3.30286e-7 5.02 <1e-06 1.01198e-6 2.30668e-6
NumOfProducts -0.040173 0.0309946 -1.30 0.1949 -0.100921 0.0205753
HasCrCard -0.00442931 0.0386394 -0.11 0.9087 -0.0801612 0.0713026
IsActiveMember -0.557894 0.0365213 -15.28 <1e-51 -0.629475 -0.486314
EstimatedSalary 2.2925e-7 3.07604e-7 0.75 0.4561 -3.73644e-7 8.32143e-7
Female 0.301642 0.0354259 8.51 <1e-16 0.232209 0.371076
France -0.450226 0.0446176 -10.09 <1e-23 -0.537674 -0.362777
Spain -0.443184 0.051707 -8.57 <1e-16 -0.544527 -0.34184
───────────────────────────────────────────────────────────────────────────────────────
Model Predictions and Evaluation
# Predict the target variable on test data
prediction = predict
2406-element Vector{Union{Missing, Float64}}:
0.24401107345293602
0.1266535868551322
0.031721959583257124
0.11357816519004983
0.24824114578495612
0.024688755265128235
0.14209354336141483
0.18528877855991494
0.15470097145575007
0.25962439112051505
0.15117890643161475
0.2110682947689441
0.06358192272871947
⋮
0.24899439141513482
0.23449577199293972
0.13610439167926225
0.1737934374110589
0.1341643450975004
0.5831068095078078
0.2950497674661655
0.04139159536998556
0.06795785137729822
0.017204995327274736
0.12888818685657766
0.15310112069144077
# Convert probability score to class
prediction_class = ;
prediction_df = DataFrame;
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted
2406-element BitVector:
0
1
1
1
1
1
1
1
1
0
1
1
1
⋮
1
1
1
1
1
1
1
1
1
1
1
1
Prediction Accuracy
accuracy = mean
0.8100581878636741
Confusion Matrix
# confusion_matrix = confusmat(2,prediction_df.y_actual, prediction_df.y_predicted)
confusion_matrix = MLBase.roc
ROCNums{Int64}
p = 510
n = 1896
tp = 105
tn = 1844
fp = 52
fn = 405
Results
The model is estimating far to many exiting cases. About 4 times the true value.
fpr, tpr, thresholds = sklearn.roc_curve
([0.0, 0.0, 0.0, 0.0005274261603375527, 0.0005274261603375527, 0.0010548523206751054, 0.0010548523206751054, 0.0015822784810126582, 0.0015822784810126582, 0.0026371308016877636 … 0.8829113924050633, 0.9066455696202531, 0.9066455696202531, 0.9193037974683544, 0.9193037974683544, 0.92457805907173, 0.92457805907173, 0.9725738396624473, 0.9725738396624473, 1.0], [0.0, 0.00196078431372549, 0.00392156862745098, 0.00392156862745098, 0.00784313725490196, 0.00784313725490196, 0.01568627450980392, 0.01568627450980392, 0.03137254901960784, 0.03137254901960784 … 0.9921568627450981, 0.9921568627450981, 0.9941176470588236, 0.9941176470588236, 0.996078431372549, 0.996078431372549, 0.9980392156862745, 0.9980392156862745, 1.0, 1.0], [1.8467335270755767, 0.8467335270755767, 0.8140811888019499, 0.8092555110984978, 0.7970873802691381, 0.79684704533007, 0.7719016175181805, 0.7709263202992206, 0.7060214606993195, 0.6994801619873218 … 0.04233143871590189, 0.03786940431261241, 0.037850945580692276, 0.035665362242897694, 0.03532968973176317, 0.03416668456674327, 0.03407543014692377, 0.020932892669754958, 0.020885871157504798, 0.00597005405256463])
# Plot ROC curve
plot
title!
The Class Imbalance Problem
# Count the classes
countmap
Dict{Int64, Int64} with 2 entries:
0 => 7963
1 => 2037
Smote to fix imbalance
X2, y2 =smote
df_balanced = X2
df_balanced.Exited = y2;
df = df_balanced;
# Count the classes
countmap
Dict{Int64, Int64} with 2 entries:
0 => 6111
1 => 6111
Retest
# Train test split
train, test = TrainTestSplit;
# Model Building
fm =
logit = glm
# Predict the target variable on test data
prediction = predict
# Convert probability score to class
prediction_class = ;
prediction_df = DataFrame;
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted
# Accuracy Score
accuracy = mean
print
# Confusion Matrix
confusion_matrix = MLBase.roc
Accuracy of the model is : 0.7169563791407019
ROCNums{Int64}
p = 1550
n = 1499
tp = 1091
tn = 1095
fp = 404
fn = 459
fpr, tpr, thresholds = sklearn.roc_curve
# Plot ROC curve
([0.0, 0.0, 0.0, 0.00066711140760507, 0.00066711140760507, 0.00133422281521014, 0.00133422281521014, 0.0020013342228152103, 0.0020013342228152103, 0.00266844563042028 … 0.9846564376250834, 0.9893262174783188, 0.9893262174783188, 0.9913275517011341, 0.9913275517011341, 0.9973315543695798, 0.9986657771847899, 0.9993328885923949, 0.9993328885923949, 1.0], [0.0, 0.0006451612903225806, 0.0025806451612903226, 0.0025806451612903226, 0.005161290322580645, 0.005161290322580645, 0.007741935483870968, 0.007741935483870968, 0.00903225806451613, 0.00903225806451613 … 0.9980645161290322, 0.9980645161290322, 0.9987096774193548, 0.9987096774193548, 0.9993548387096775, 0.9993548387096775, 0.9993548387096775, 0.9993548387096775, 1.0, 1.0], [1.9907624292252022, 0.9907624292252022, 0.983731024429679, 0.97951657298985, 0.9730082291507035, 0.9713532719467679, 0.9629327481173712, 0.9604203755106321, 0.9593444340323958, 0.9584649467140461 … 0.06923199350115271, 0.06553287523911823, 0.06469253560487893, 0.058594401854125504, 0.057872556108602216, 0.034170953161915506, 0.03357051125028141, 0.03297342671224324, 0.030937011626933943, 0.023743078872535135])
plot
title!
Final Discussion
When accounting for class imbalance, the model accuracy is reduced to 71 percent from about 81 percent.
While this seems counterintutive, the second model is actually a better model overall.
The model with 81 percent accuracy is simply more accurate by chance. The bin of exits to remains is far larger. Thus, reported accuracy is higher.
When the classes are normalized, we see a prediction of about 71 percent. Confidently, I can say that this model would scale appropriately.
The first model on the other hand would scale to about 25-30 percent accuracy.