Churn Modelling Marketing Data with Julia

May 30, 2022

Churn Modelling Marketing Data with Julia

Introduction

I prepared this analysis to learn the logistic regression in Julia. The work is fairly straightforward. I am modelling if a customer will exit a website based on a number of sites. I'll improve the model in an upcoming post. As if, is not as interesting as asking when. That will be my next project

Imports

using Pkg
using DataFrames
using CSV
using Plots
using GLM
using StatsBase
using Lathe
using MLBase
using ClassImbalance
using ROCAnalysis
using PyCall
sklearn = pyimport("sklearn.metrics")
PyObject <module 'sklearn.metrics' from '/Users/jnapolitano/venvs/finance/lib/python3.9/site-packages/sklearn/metrics/__init__.py'>
function load_csv() 

    df = DataFrame(CSV.File("./Churn_Modelling.csv")) 
    return df 
end
load_csv (generic function with 1 method)

Loading Data

marketing_df = load_csv()
first(marketing_df,5)

5 rows × 14 columns (omitted printing of 6 columns)

RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenure
Int64Int64String31Int64String7String7Int64Int64
1115634602Hargrave619FranceFemale422
2215647311Hill608SpainFemale411
3315619304Onio502FranceFemale428
4415701354Boni699FranceFemale391
5515737888Mitchell850SpainFemale432
println(size(marketing_df))
describe(marketing_df)
(10000, 14)

14 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType
1RowNumber5000.515000.510000Int64
2CustomerId1.56909e7155657011.56907e715815690Int64
3SurnameAbazuZuyeva2932String31
4CreditScore650.529350652.0850Int64
5GeographyFranceSpain3String7
6GenderFemaleMale2String7
7Age38.92181837.092Int64
8Tenure5.012805.010Int64
9Balance76485.90.097198.52.50898e5Float64
10NumOfProducts1.530211.04Int64
11HasCrCard0.705501.01Int64
12IsActiveMember0.515101.01Int64
13EstimatedSalary1.0009e511.581.00194e51.99992e5Float64
14Exited0.203700.01Int64
# Check column names
names(marketing_df)
14-element Vector{Symbol}:
 :RowNumber
 :CustomerId
 :Surname
 :CreditScore
 :Geography
 :Gender
 :Age
 :Tenure
 :Balance
 :NumOfProducts
 :HasCrCard
 :IsActiveMember
 :EstimatedSalary
 :Exited

Check Class Imbalance

# Count the classes
countmap(marketing_df.Exited)
Dict{Int64, Int64} with 2 entries:
  0 => 7963
  1 => 2037

Data Preprocessing

One Hot Encoding

# One hot encoding
Lathe.preprocess.OneHotEncode(marketing_df,:Geography)
Lathe.preprocess.OneHotEncode(marketing_df,:Gender)
select!(marketing_df, Not([:RowNumber, :CustomerId,:Surname,:Geography,:Gender,:Male]))

10,000 rows × 13 columns (omitted printing of 6 columns)

CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMember
Int64Int64Int64Float64Int64Int64Int64
16194220.0111
260841183807.9101
35024281.59661e5310
46993910.0200
58504321.25511e5111
66454481.13756e5210
78225070.0211
83762941.15047e5410
95014441.42051e5201
106842721.34604e5111
115283161.02017e5200
124972430.0210
1347634100.0210
145492550.0200
156353570.0211
166164531.43129e5201
176535811.32603e5110
185492490.0211
195874560.0100
207262460.0211
217324180.0211
226363280.0210
235103840.0110
246694630.0201
258463850.0111
265772530.0201
277563621.36816e5111
285714490.0200
295744331.41349e5111
3041129059697.2211

Split Train/and Test Data

# Train test split
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit(marketing_df,.75);

Build Model

# Train logistic regression model
fm = @formula(Exited ~ CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain)
logit = glm(fm, train, Binomial(), ProbitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Exited ~ 1 + CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                        Coef.   Std. Error       z  Pr(>|z|)     Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)      -1.90933      0.165007     -11.57    <1e-30  -2.23274      -1.58592
CreditScore      -0.000321917  0.000183184   -1.76    0.0789  -0.000680951   3.71172e-5
Age               0.040893     0.00165251    24.75    <1e-99   0.0376541     0.0441318
Tenure           -0.008864     0.00611129    -1.45    0.1469  -0.0208419     0.0031139
Balance           1.65933e-6   3.30286e-7     5.02    <1e-06   1.01198e-6    2.30668e-6
NumOfProducts    -0.040173     0.0309946     -1.30    0.1949  -0.100921      0.0205753
HasCrCard        -0.00442931   0.0386394     -0.11    0.9087  -0.0801612     0.0713026
IsActiveMember   -0.557894     0.0365213    -15.28    <1e-51  -0.629475     -0.486314
EstimatedSalary   2.2925e-7    3.07604e-7     0.75    0.4561  -3.73644e-7    8.32143e-7
Female            0.301642     0.0354259      8.51    <1e-16   0.232209      0.371076
France           -0.450226     0.0446176    -10.09    <1e-23  -0.537674     -0.362777
Spain            -0.443184     0.051707      -8.57    <1e-16  -0.544527     -0.34184
───────────────────────────────────────────────────────────────────────────────────────

Model Predictions and Evaluation

# Predict the target variable on test data 
prediction = predict(logit,test)
2406-element Vector{Union{Missing, Float64}}:
 0.24401107345293602
 0.1266535868551322
 0.031721959583257124
 0.11357816519004983
 0.24824114578495612
 0.024688755265128235
 0.14209354336141483
 0.18528877855991494
 0.15470097145575007
 0.25962439112051505
 0.15117890643161475
 0.2110682947689441
 0.06358192272871947
 0.24899439141513482
 0.23449577199293972
 0.13610439167926225
 0.1737934374110589
 0.1341643450975004
 0.5831068095078078
 0.2950497674661655
 0.04139159536998556
 0.06795785137729822
 0.017204995327274736
 0.12888818685657766
 0.15310112069144077
# Convert probability score to class
prediction_class = [if x < 0.5 0 else 1 end for x in prediction];

prediction_df = DataFrame(y_actual = test.Exited, y_predicted = prediction_class, prob_predicted = prediction);
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted
2406-element BitVector:
 0
 1
 1
 1
 1
 1
 1
 1
 1
 0
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

Prediction Accuracy

accuracy = mean(prediction_df.correctly_classified)
0.8100581878636741

Confusion Matrix

# confusion_matrix = confusmat(2,prediction_df.y_actual, prediction_df.y_predicted)
confusion_matrix = MLBase.roc(prediction_df.y_actual, prediction_df.y_predicted)
ROCNums{Int64}
  p = 510
  n = 1896
  tp = 105
  tn = 1844
  fp = 52
  fn = 405

Results

The model is estimating far to many exiting cases. About 4 times the true value.

fpr, tpr, thresholds = sklearn.roc_curve(prediction_df.y_actual, prediction_df.prob_predicted)
([0.0, 0.0, 0.0, 0.0005274261603375527, 0.0005274261603375527, 0.0010548523206751054, 0.0010548523206751054, 0.0015822784810126582, 0.0015822784810126582, 0.0026371308016877636  …  0.8829113924050633, 0.9066455696202531, 0.9066455696202531, 0.9193037974683544, 0.9193037974683544, 0.92457805907173, 0.92457805907173, 0.9725738396624473, 0.9725738396624473, 1.0], [0.0, 0.00196078431372549, 0.00392156862745098, 0.00392156862745098, 0.00784313725490196, 0.00784313725490196, 0.01568627450980392, 0.01568627450980392, 0.03137254901960784, 0.03137254901960784  …  0.9921568627450981, 0.9921568627450981, 0.9941176470588236, 0.9941176470588236, 0.996078431372549, 0.996078431372549, 0.9980392156862745, 0.9980392156862745, 1.0, 1.0], [1.8467335270755767, 0.8467335270755767, 0.8140811888019499, 0.8092555110984978, 0.7970873802691381, 0.79684704533007, 0.7719016175181805, 0.7709263202992206, 0.7060214606993195, 0.6994801619873218  …  0.04233143871590189, 0.03786940431261241, 0.037850945580692276, 0.035665362242897694, 0.03532968973176317, 0.03416668456674327, 0.03407543014692377, 0.020932892669754958, 0.020885871157504798, 0.00597005405256463])
# Plot ROC curve
plot(fpr, tpr)
title!("ROC curve")

svg

The Class Imbalance Problem

# Count the classes
countmap(marketing_df.Exited)
Dict{Int64, Int64} with 2 entries:
  0 => 7963
  1 => 2037

Smote to fix imbalance

X2, y2 =smote(marketing_df[!,[:CreditScore,:Age ,:Tenure, :Balance, :NumOfProducts, :HasCrCard, :IsActiveMember, :EstimatedSalary, :Female , :France, :Spain]], marketing_df.Exited, k = 5, pct_under = 150, pct_over = 200)
df_balanced = X2
df_balanced.Exited = y2;

df = df_balanced;

# Count the classes
countmap(df.Exited)
Dict{Int64, Int64} with 2 entries:
  0 => 6111
  1 => 6111

Retest

# Train test split
train, test = TrainTestSplit(df,.75);

# Model Building
fm = @formula(Exited ~ CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain)
logit = glm(fm, train, Binomial(), ProbitLink())

# Predict the target variable on test data 
prediction = predict(logit,test)

# Convert probability score to class
prediction_class = [if x < 0.5 0 else 1 end for x in prediction];

prediction_df = DataFrame(y_actual = test.Exited, y_predicted = prediction_class, prob_predicted = prediction);
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted


# Accuracy Score
accuracy = mean(prediction_df.correctly_classified)
print("Accuracy of the model is : ",accuracy)

# Confusion Matrix
confusion_matrix = MLBase.roc(prediction_df.y_actual, prediction_df.y_predicted)
Accuracy of the model is : 0.7169563791407019




ROCNums{Int64}
  p = 1550
  n = 1499
  tp = 1091
  tn = 1095
  fp = 404
  fn = 459
fpr, tpr, thresholds = sklearn.roc_curve(prediction_df.y_actual, prediction_df.prob_predicted)
# Plot ROC curve

([0.0, 0.0, 0.0, 0.00066711140760507, 0.00066711140760507, 0.00133422281521014, 0.00133422281521014, 0.0020013342228152103, 0.0020013342228152103, 0.00266844563042028  …  0.9846564376250834, 0.9893262174783188, 0.9893262174783188, 0.9913275517011341, 0.9913275517011341, 0.9973315543695798, 0.9986657771847899, 0.9993328885923949, 0.9993328885923949, 1.0], [0.0, 0.0006451612903225806, 0.0025806451612903226, 0.0025806451612903226, 0.005161290322580645, 0.005161290322580645, 0.007741935483870968, 0.007741935483870968, 0.00903225806451613, 0.00903225806451613  …  0.9980645161290322, 0.9980645161290322, 0.9987096774193548, 0.9987096774193548, 0.9993548387096775, 0.9993548387096775, 0.9993548387096775, 0.9993548387096775, 1.0, 1.0], [1.9907624292252022, 0.9907624292252022, 0.983731024429679, 0.97951657298985, 0.9730082291507035, 0.9713532719467679, 0.9629327481173712, 0.9604203755106321, 0.9593444340323958, 0.9584649467140461  …  0.06923199350115271, 0.06553287523911823, 0.06469253560487893, 0.058594401854125504, 0.057872556108602216, 0.034170953161915506, 0.03357051125028141, 0.03297342671224324, 0.030937011626933943, 0.023743078872535135])
plot(fpr, tpr)
title!("ROC curve")

svg

Final Discussion

When accounting for class imbalance, the model accuracy is reduced to 71 percent from about 81 percent.

While this seems counterintutive, the second model is actually a better model overall.

The model with 81 percent accuracy is simply more accurate by chance. The bin of exits to remains is far larger. Thus, reported accuracy is higher.

When the classes are normalized, we see a prediction of about 71 percent. Confidently, I can say that this model would scale appropriately.

The first model on the other hand would scale to about 25-30 percent accuracy.

RSS
https://justin-napolitano.com/posts/atom.xml