Data-Informed Thinking + Doing

Categorical Predictions Based on Probability

Predicting telecom customer churn (binary classification)—using logistic regression in R, Python, and Julia.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scikit-learn==1.3.0
import random
import pandas
import plotnine
import sklearn
R.version.string
[1] "R version 4.2.3 (2023-03-15)"
require(devtools)
devtools::install_version("fst", version="0.9.4", repos="http://cran.us.r-project.org")
devtools::install_version("dplyr", version="1.0.4", repos="http://cran.us.r-project.org")
devtools::install_version("tibble", version="3.1.6", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.3.3", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0-94", repos="http://cran.us.r-project.org")
library(dplyr)
library(tibble)
library(ggplot2)
library(ggcorrplot)
library(caret)
VERSION
v"1.9.2"
import Pkg
Pkg.add(name="CSV", version="0.9.11")
Pkg.add(name="DataFrames", version="1.3.0")
Pkg.add(name="CategoricalArrays", version="0.10.7")
Pkg.add(name="Colors", version="0.12.8")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
Pkg.add(name="MLJ", version="0.16.11")
Pkg.add(name="GLM", version="1.5.1")
using CSV
using DataFrames
using Dates
using CategoricalArrays
using Colors
using Cairo
using Gadfly
using MLJ
using GLM

Importing and Examining Dataset

Upon importing and examining the dataset, we can see that the data frame dimension is 7043 rows and 21 columns.

# Semicolon prevents the variable from being printed
customers_jl = DataFrames.DataFrame(CSV.File("../../dataset/telecom-customer-churn.csv"));
size(customers_jl)
(7043, 21)
first(customers_jl, 7)
7×21 DataFrame
 Row │ customer_id  gender   senior_citizen  partner  dependents  tenure  phone_service  multiple_lines    internet_service  online_security  online_backup  device_protection  tech_support  streaming_tv  streaming_movies  contract        paperless_billing  payment_method             monthly_charges  total_charges  churn
     │ String15     String7  Int64           String3  String3     Int64   String3        String31          String15          String31         String31       String31           String31      String31      String31          String15        String3            String31                   Float64          String7        String3
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 7590-VHVEG   Female                0  Yes      No               1  No             No phone service  DSL               No               Yes            No                 No            No            No                Month-to-month  Yes                Electronic check                     29.85  29.85          No
   2 │ 5575-GNVDE   Male                  0  No       No              34  Yes            No                DSL               Yes              No             Yes                No            No            No                One year        No                 Mailed check                         56.95  1889.50        No
   3 │ 3668-QPYBK   Male                  0  No       No               2  Yes            No                DSL               Yes              Yes            No                 No            No            No                Month-to-month  Yes                Mailed check                         53.85  108.15         Yes
   4 │ 7795-CFOCW   Male                  0  No       No              45  No             No phone service  DSL               Yes              No             Yes                Yes           No            No                One year        No                 Bank transfer (automatic)            42.3   1840.75        No
   5 │ 9237-HQITU   Female                0  No       No               2  Yes            No                Fiber optic       No               No             No                 No            No            No                Month-to-month  Yes                Electronic check                     70.7   151.65         Yes
   6 │ 9305-CDSKC   Female                0  No       No               8  Yes            Yes               Fiber optic       No               No             Yes                No            Yes           Yes               Month-to-month  Yes                Electronic check                     99.65  820.50         Yes
   7 │ 1452-KIOVK   Male                  0  No       Yes             22  Yes            Yes               Fiber optic       No               Yes            No                 No            Yes           No                Month-to-month  Yes                Credit card (automatic)              89.1   1949.40        No
customers_py = pandas.read_csv("../../dataset/telecom-customer-churn.csv")
customers_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        7043 non-null   object 
 1   gender             7043 non-null   object 
 2   senior_citizen     7043 non-null   int64  
 3   partner            7043 non-null   object 
 4   dependents         7043 non-null   object 
 5   tenure             7043 non-null   int64  
 6   phone_service      7043 non-null   object 
 7   multiple_lines     7043 non-null   object 
 8   internet_service   7043 non-null   object 
 9   online_security    7043 non-null   object 
 10  online_backup      7043 non-null   object 
 11  device_protection  7043 non-null   object 
 12  tech_support       7043 non-null   object 
 13  streaming_tv       7043 non-null   object 
 14  streaming_movies   7043 non-null   object 
 15  contract           7043 non-null   object 
 16  paperless_billing  7043 non-null   object 
 17  payment_method     7043 non-null   object 
 18  monthly_charges    7043 non-null   float64
 19  total_charges      7043 non-null   object 
 20  churn              7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
customers_py.head(n=7)
  customer_id  gender  senior_citizen partner dependents  tenure phone_service    multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies        contract paperless_billing             payment_method  monthly_charges total_charges churn
0  7590-VHVEG  Female               0     Yes         No       1            No  No phone service              DSL              No           Yes                No           No           No               No  Month-to-month               Yes           Electronic check            29.85         29.85    No
1  5575-GNVDE    Male               0      No         No      34           Yes                No              DSL             Yes            No               Yes           No           No               No        One year                No               Mailed check            56.95       1889.50    No
2  3668-QPYBK    Male               0      No         No       2           Yes                No              DSL             Yes           Yes                No           No           No               No  Month-to-month               Yes               Mailed check            53.85        108.15   Yes
3  7795-CFOCW    Male               0      No         No      45            No  No phone service              DSL             Yes            No               Yes          Yes           No               No        One year                No  Bank transfer (automatic)            42.30       1840.75    No
4  9237-HQITU  Female               0      No         No       2           Yes                No      Fiber optic              No            No                No           No           No               No  Month-to-month               Yes           Electronic check            70.70        151.65   Yes
5  9305-CDSKC  Female               0      No         No       8           Yes               Yes      Fiber optic              No            No               Yes           No          Yes              Yes  Month-to-month               Yes           Electronic check            99.65        820.50   Yes
6  1452-KIOVK    Male               0      No        Yes      22           Yes               Yes      Fiber optic              No           Yes                No           No          Yes               No  Month-to-month               Yes    Credit card (automatic)            89.10       1949.40    No
customers_r <- read.csv("../../dataset/telecom-customer-churn.csv", stringsAsFactors=TRUE)
str(object=customers_r)
'data.frame':	7043 obs. of  21 variables:
 $ customer_id      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
 $ gender           : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
 $ senior_citizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ partner          : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
 $ dependents       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
 $ tenure           : int  1 34 2 45 2 8 22 10 28 62 ...
 $ phone_service    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
 $ multiple_lines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
 $ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
 $ online_security  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
 $ online_backup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
 $ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
 $ tech_support     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ streaming_tv     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
 $ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
 $ contract         : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
 $ paperless_billing: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
 $ payment_method   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
 $ monthly_charges  : num  29.9 57 53.9 42.3 70.7 ...
 $ total_charges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ churn            : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
head(x=customers_r, n=7)
  customer_id gender senior_citizen partner dependents tenure phone_service   multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies       contract paperless_billing            payment_method monthly_charges total_charges churn
1  7590-VHVEG Female              0     Yes         No      1            No No phone service              DSL              No           Yes                No           No           No               No Month-to-month               Yes          Electronic check              30            30    No
2  5575-GNVDE   Male              0      No         No     34           Yes               No              DSL             Yes            No               Yes           No           No               No       One year                No              Mailed check              57          1890    No
3  3668-QPYBK   Male              0      No         No      2           Yes               No              DSL             Yes           Yes                No           No           No               No Month-to-month               Yes              Mailed check              54           108   Yes
4  7795-CFOCW   Male              0      No         No     45            No No phone service              DSL             Yes            No               Yes          Yes           No               No       One year                No Bank transfer (automatic)              42          1841    No
5  9237-HQITU Female              0      No         No      2           Yes               No      Fiber optic              No            No                No           No           No               No Month-to-month               Yes          Electronic check              71           152   Yes
6  9305-CDSKC Female              0      No         No      8           Yes              Yes      Fiber optic              No            No               Yes           No          Yes              Yes Month-to-month               Yes          Electronic check             100           820   Yes
7  1452-KIOVK   Male              0      No        Yes     22           Yes              Yes      Fiber optic              No           Yes                No           No          Yes               No Month-to-month               Yes   Credit card (automatic)              89          1949    No

Wrangling Data

# Create copy of data frame
customers_clean_jl = customers_jl;

# View structure of clean data frame
first(customers_clean_jl, 7)
7×21 DataFrame
 Row │ customer_id  gender   senior_citizen  partner  dependents  tenure  phone_service  multiple_lines    internet_service  online_security  online_backup  device_protection  tech_support  streaming_tv  streaming_movies  contract        paperless_billing  payment_method             monthly_charges  total_charges  churn
     │ String15     String7  Int64           String3  String3     Int64   String3        String31          String15          String31         String31       String31           String31      String31      String31          String15        String3            String31                   Float64          String7        String3
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 7590-VHVEG   Female                0  Yes      No               1  No             No phone service  DSL               No               Yes            No                 No            No            No                Month-to-month  Yes                Electronic check                     29.85  29.85          No
   2 │ 5575-GNVDE   Male                  0  No       No              34  Yes            No                DSL               Yes              No             Yes                No            No            No                One year        No                 Mailed check                         56.95  1889.50        No
   3 │ 3668-QPYBK   Male                  0  No       No               2  Yes            No                DSL               Yes              Yes            No                 No            No            No                Month-to-month  Yes                Mailed check                         53.85  108.15         Yes
   4 │ 7795-CFOCW   Male                  0  No       No              45  No             No phone service  DSL               Yes              No             Yes                Yes           No            No                One year        No                 Bank transfer (automatic)            42.3   1840.75        No
   5 │ 9237-HQITU   Female                0  No       No               2  Yes            No                Fiber optic       No               No             No                 No            No            No                Month-to-month  Yes                Electronic check                     70.7   151.65         Yes
   6 │ 9305-CDSKC   Female                0  No       No               8  Yes            Yes               Fiber optic       No               No             Yes                No            Yes           Yes               Month-to-month  Yes                Electronic check                     99.65  820.50         Yes
   7 │ 1452-KIOVK   Male                  0  No       Yes             22  Yes            Yes               Fiber optic       No               Yes            No                 No            Yes           No                Month-to-month  Yes                Credit card (automatic)              89.1   1949.40        No
# Create copy of data frame
customers_clean_py = customers_py

# Convert value to float
customers_clean_py["total_charges"] = customers_clean_py["total_charges"].replace([" ", ""], 0)
customers_clean_py["total_charges"] = pandas.to_numeric(customers_clean_py["total_charges"], errors="coerce")

# Create dummy variables (one-hot encoding)
one_hot_multiple_lines = pandas.get_dummies(customers_clean_py["multiple_lines"], prefix="multiple_lines", prefix_sep="_")
one_hot_internet_service = pandas.get_dummies(customers_clean_py["internet_service"], prefix="internet_service", prefix_sep="_")
one_hot_online_security = pandas.get_dummies(customers_clean_py["online_security"], prefix="online_security", prefix_sep="_")
one_hot_online_backup = pandas.get_dummies(customers_clean_py["online_backup"], prefix="online_backup", prefix_sep="_")
one_hot_device_protection = pandas.get_dummies(customers_clean_py["device_protection"], prefix="device_protection", prefix_sep="_")
one_hot_tech_support = pandas.get_dummies(customers_clean_py["tech_support"], prefix="tech_support", prefix_sep="_")
one_hot_streaming_tv = pandas.get_dummies(customers_clean_py["streaming_tv"], prefix="streaming_tv", prefix_sep="_")
one_hot_streaming_movies = pandas.get_dummies(customers_clean_py["streaming_movies"], prefix="streaming_movies", prefix_sep="_")
one_hot_contract = pandas.get_dummies(customers_clean_py["contract"], prefix="contract", prefix_sep="_")
one_hot_payment_method = pandas.get_dummies(customers_clean_py["payment_method"], prefix="payment_method", prefix_sep="_")

# Concatinate vectors with data frame
customers_clean_py = pandas.concat(
    [
        customers_clean_py,
        one_hot_multiple_lines,
        one_hot_internet_service,
        one_hot_online_security,
        one_hot_online_backup,
        one_hot_device_protection,
        one_hot_tech_support,
        one_hot_streaming_tv,
        one_hot_streaming_movies,
        one_hot_contract,
        one_hot_payment_method
    ],
    axis=1
)

# Rename column name to lowercase with underscores
customers_clean_py.columns = customers_clean_py.columns.str.lower().str.replace("-", "_").str.replace(" ", "_").str.replace("(", "").str.replace(")", "")

# Drop columns with 100% variability or duplicative
customers_clean_py = customers_clean_py.drop(
    [
        "customer_id",
        "multiple_lines",
        "internet_service",
        "online_security",
        "online_backup",
        "device_protection",
        "tech_support",
        "streaming_tv",
        "streaming_movies",
        "contract",
        "payment_method"
    ],
    axis=1
)

# Sort columns in alphabetical order of column names
customers_clean_py = customers_clean_py.sort_index(axis=1)

# Convert boolean-like values to 0 or 1
boolean_mapping = {"Yes": 1, "Male": 1, True: 1, "No": 0, "Female": 0, False: 0}
customers_clean_py = customers_clean_py.replace(boolean_mapping)

# View structure of clean data frame
customers_clean_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 41 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   churn                                   7043 non-null   int64  
 1   contract_month_to_month                 7043 non-null   int64  
 2   contract_one_year                       7043 non-null   int64  
 3   contract_two_year                       7043 non-null   int64  
 4   dependents                              7043 non-null   int64  
 5   device_protection_no                    7043 non-null   int64  
 6   device_protection_no_internet_service   7043 non-null   int64  
 7   device_protection_yes                   7043 non-null   int64  
 8   gender                                  7043 non-null   int64  
 9   internet_service_dsl                    7043 non-null   int64  
 10  internet_service_fiber_optic            7043 non-null   int64  
 11  internet_service_no                     7043 non-null   int64  
 12  monthly_charges                         7043 non-null   float64
 13  multiple_lines_no                       7043 non-null   int64  
 14  multiple_lines_no_phone_service         7043 non-null   int64  
 15  multiple_lines_yes                      7043 non-null   int64  
 16  online_backup_no                        7043 non-null   int64  
 17  online_backup_no_internet_service       7043 non-null   int64  
 18  online_backup_yes                       7043 non-null   int64  
 19  online_security_no                      7043 non-null   int64  
 20  online_security_no_internet_service     7043 non-null   int64  
 21  online_security_yes                     7043 non-null   int64  
 22  paperless_billing                       7043 non-null   int64  
 23  partner                                 7043 non-null   int64  
 24  payment_method_bank_transfer_automatic  7043 non-null   int64  
 25  payment_method_credit_card_automatic    7043 non-null   int64  
 26  payment_method_electronic_check         7043 non-null   int64  
 27  payment_method_mailed_check             7043 non-null   int64  
 28  phone_service                           7043 non-null   int64  
 29  senior_citizen                          7043 non-null   int64  
 30  streaming_movies_no                     7043 non-null   int64  
 31  streaming_movies_no_internet_service    7043 non-null   int64  
 32  streaming_movies_yes                    7043 non-null   int64  
 33  streaming_tv_no                         7043 non-null   int64  
 34  streaming_tv_no_internet_service        7043 non-null   int64  
 35  streaming_tv_yes                        7043 non-null   int64  
 36  tech_support_no                         7043 non-null   int64  
 37  tech_support_no_internet_service        7043 non-null   int64  
 38  tech_support_yes                        7043 non-null   int64  
 39  tenure                                  7043 non-null   int64  
 40  total_charges                           7043 non-null   float64
dtypes: float64(2), int64(39)
memory usage: 2.2 MB
customers_clean_py.head(n=7)
   churn  contract_month_to_month  contract_one_year  contract_two_year  dependents  device_protection_no  device_protection_no_internet_service  device_protection_yes  gender  internet_service_dsl  internet_service_fiber_optic  internet_service_no  monthly_charges  multiple_lines_no  multiple_lines_no_phone_service  multiple_lines_yes  online_backup_no  online_backup_no_internet_service  online_backup_yes  online_security_no  online_security_no_internet_service  online_security_yes  paperless_billing  partner  payment_method_bank_transfer_automatic  payment_method_credit_card_automatic  payment_method_electronic_check  payment_method_mailed_check  phone_service  senior_citizen  streaming_movies_no  streaming_movies_no_internet_service  streaming_movies_yes  streaming_tv_no  streaming_tv_no_internet_service  streaming_tv_yes  tech_support_no  tech_support_no_internet_service  tech_support_yes  tenure  total_charges
0      0                        1                  0                  0           0                     1                                      0                      0       0                     1                             0                    0            29.85                  0                                1                   0                 0                                  0                  1                   1                                    0                    0                  1        1                                       0                                     0                                1                            0              0               0                    1                                     0                     0                1                                 0                 0                1                                 0                 0       1          29.85
1      0                        0                  1                  0           0                     0                                      0                      1       1                     1                             0                    0            56.95                  1                                0                   0                 1                                  0                  0                   0                                    0                    1                  0        0                                       0                                     0                                0                            1              1               0                    1                                     0                     0                1                                 0                 0                1                                 0                 0      34        1889.50
2      1                        1                  0                  0           0                     1                                      0                      0       1                     1                             0                    0            53.85                  1                                0                   0                 0                                  0                  1                   0                                    0                    1                  1        0                                       0                                     0                                0                            1              1               0                    1                                     0                     0                1                                 0                 0                1                                 0                 0       2         108.15
3      0                        0                  1                  0           0                     0                                      0                      1       1                     1                             0                    0            42.30                  0                                1                   0                 1                                  0                  0                   0                                    0                    1                  0        0                                       1                                     0                                0                            0              0               0                    1                                     0                     0                1                                 0                 0                0                                 0                 1      45        1840.75
4      1                        1                  0                  0           0                     1                                      0                      0       0                     0                             1                    0            70.70                  1                                0                   0                 1                                  0                  0                   1                                    0                    0                  1        0                                       0                                     0                                1                            0              1               0                    1                                     0                     0                1                                 0                 0                1                                 0                 0       2         151.65
5      1                        1                  0                  0           0                     0                                      0                      1       0                     0                             1                    0            99.65                  0                                0                   1                 1                                  0                  0                   1                                    0                    0                  1        0                                       0                                     0                                1                            0              1               0                    0                                     0                     1                0                                 0                 1                1                                 0                 0       8         820.50
6      0                        1                  0                  0           1                     1                                      0                      0       1                     0                             1                    0            89.10                  0                                0                   1                 0                                  0                  1                   1                                    0                    0                  1        0                                       0                                     1                                0                            0              1               0                    1                                     0                     0                0                                 0                 1                1                                 0                 0      22        1949.40
# Create copy of data frame
customers_clean_r <- customers_r

# View structure of clean data frame
str(object=customers_clean_r)
'data.frame':	7043 obs. of  21 variables:
 $ customer_id      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
 $ gender           : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
 $ senior_citizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ partner          : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
 $ dependents       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
 $ tenure           : int  1 34 2 45 2 8 22 10 28 62 ...
 $ phone_service    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
 $ multiple_lines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
 $ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
 $ online_security  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
 $ online_backup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
 $ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
 $ tech_support     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ streaming_tv     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
 $ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
 $ contract         : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
 $ paperless_billing: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
 $ payment_method   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
 $ monthly_charges  : num  29.9 57 53.9 42.3 70.7 ...
 $ total_charges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ churn            : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Performing Exploratory Data Analysis (EDA)

show(describe(customers_clean_jl), allrows=true)
21×7 DataFrame
 Row │ variable           mean      min                        median  max           nmissing  eltype
     │ Symbol             Union…    Any                        Union…  Any           Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ customer_id                  0002-ORFBO                         9995-HOTOH           0  String15
   2 │ gender                       Female                             Male                 0  String7
   3 │ senior_citizen     0.162147  0                          0.0     1                    0  Int64
   4 │ partner                      No                                 Yes                  0  String3
   5 │ dependents                   No                                 Yes                  0  String3
   6 │ tenure             32.3711   0                          29.0    72                   0  Int64
   7 │ phone_service                No                                 Yes                  0  String3
   8 │ multiple_lines               No                                 Yes                  0  String31
   9 │ internet_service             DSL                                No                   0  String15
  10 │ online_security              No                                 Yes                  0  String31
  11 │ online_backup                No                                 Yes                  0  String31
  12 │ device_protection            No                                 Yes                  0  String31
  13 │ tech_support                 No                                 Yes                  0  String31
  14 │ streaming_tv                 No                                 Yes                  0  String31
  15 │ streaming_movies             No                                 Yes                  0  String31
  16 │ contract                     Month-to-month                     Two year             0  String15
  17 │ paperless_billing            No                                 Yes                  0  String3
  18 │ payment_method               Bank transfer (automatic)          Mailed check         0  String31
  19 │ monthly_charges    64.7617   18.25                      70.35   118.75               0  Float64
  20 │ total_charges                                                   999.90               0  String7
  21 │ churn                        No                                 Yes                  0  String3
customers_clean_py.describe(include="all")
             churn  contract_month_to_month  contract_one_year  contract_two_year   dependents  device_protection_no  device_protection_no_internet_service  device_protection_yes       gender  internet_service_dsl  internet_service_fiber_optic  internet_service_no  monthly_charges  multiple_lines_no  multiple_lines_no_phone_service  multiple_lines_yes  online_backup_no  online_backup_no_internet_service  online_backup_yes  online_security_no  online_security_no_internet_service  online_security_yes  paperless_billing      partner  payment_method_bank_transfer_automatic  payment_method_credit_card_automatic  payment_method_electronic_check  payment_method_mailed_check  phone_service  senior_citizen  streaming_movies_no  streaming_movies_no_internet_service  streaming_movies_yes  streaming_tv_no  streaming_tv_no_internet_service  streaming_tv_yes  tech_support_no  tech_support_no_internet_service  tech_support_yes       tenure  total_charges
count  7043.000000              7043.000000        7043.000000        7043.000000  7043.000000           7043.000000                            7043.000000            7043.000000  7043.000000           7043.000000                   7043.000000          7043.000000      7043.000000        7043.000000                      7043.000000         7043.000000       7043.000000                        7043.000000        7043.000000         7043.000000                          7043.000000          7043.000000        7043.000000  7043.000000                             7043.000000                           7043.000000                      7043.000000                  7043.000000    7043.000000     7043.000000          7043.000000                           7043.000000           7043.000000      7043.000000                       7043.000000       7043.000000      7043.000000                       7043.000000       7043.000000  7043.000000    7043.000000
mean      0.265370                 0.550192           0.209144           0.240664     0.299588              0.439443                               0.216669               0.343888     0.504756              0.343746                      0.439585             0.216669        64.761692           0.481329                         0.096834            0.421837          0.438450                           0.216669           0.344881            0.496663                             0.216669             0.286668           0.592219     0.483033                                0.219225                              0.216101                         0.335794                     0.228880       0.903166        0.162147             0.395428                              0.216669              0.387903         0.398978                          0.216669          0.384353         0.493114                          0.216669          0.290217    32.371149    2279.734304
std       0.441561                 0.497510           0.406726           0.427517     0.458110              0.496355                               0.412004               0.475038     0.500013              0.474991                      0.496372             0.412004        30.090047           0.499687                         0.295752            0.493888          0.496232                           0.412004           0.475363            0.500024                             0.412004             0.452237           0.491457     0.499748                                0.413751                              0.411613                         0.472301                     0.420141       0.295752        0.368612             0.488977                              0.412004              0.487307         0.489723                          0.412004          0.486477         0.499988                          0.412004          0.453895    24.559481    2266.794470
min       0.000000                 0.000000           0.000000           0.000000     0.000000              0.000000                               0.000000               0.000000     0.000000              0.000000                      0.000000             0.000000        18.250000           0.000000                         0.000000            0.000000          0.000000                           0.000000           0.000000            0.000000                             0.000000             0.000000           0.000000     0.000000                                0.000000                              0.000000                         0.000000                     0.000000       0.000000        0.000000             0.000000                              0.000000              0.000000         0.000000                          0.000000          0.000000         0.000000                          0.000000          0.000000     0.000000       0.000000
25%       0.000000                 0.000000           0.000000           0.000000     0.000000              0.000000                               0.000000               0.000000     0.000000              0.000000                      0.000000             0.000000        35.500000           0.000000                         0.000000            0.000000          0.000000                           0.000000           0.000000            0.000000                             0.000000             0.000000           0.000000     0.000000                                0.000000                              0.000000                         0.000000                     0.000000       1.000000        0.000000             0.000000                              0.000000              0.000000         0.000000                          0.000000          0.000000         0.000000                          0.000000          0.000000     9.000000     398.550000
50%       0.000000                 1.000000           0.000000           0.000000     0.000000              0.000000                               0.000000               0.000000     1.000000              0.000000                      0.000000             0.000000        70.350000           0.000000                         0.000000            0.000000          0.000000                           0.000000           0.000000            0.000000                             0.000000             0.000000           1.000000     0.000000                                0.000000                              0.000000                         0.000000                     0.000000       1.000000        0.000000             0.000000                              0.000000              0.000000         0.000000                          0.000000          0.000000         0.000000                          0.000000          0.000000    29.000000    1394.550000
75%       1.000000                 1.000000           0.000000           0.000000     1.000000              1.000000                               0.000000               1.000000     1.000000              1.000000                      1.000000             0.000000        89.850000           1.000000                         0.000000            1.000000          1.000000                           0.000000           1.000000            1.000000                             0.000000             1.000000           1.000000     1.000000                                0.000000                              0.000000                         1.000000                     0.000000       1.000000        0.000000             1.000000                              0.000000              1.000000         1.000000                          0.000000          1.000000         1.000000                          0.000000          1.000000    55.000000    3786.600000
max       1.000000                 1.000000           1.000000           1.000000     1.000000              1.000000                               1.000000               1.000000     1.000000              1.000000                      1.000000             1.000000       118.750000           1.000000                         1.000000            1.000000          1.000000                           1.000000           1.000000            1.000000                             1.000000             1.000000           1.000000     1.000000                                1.000000                              1.000000                         1.000000                     1.000000       1.000000        1.000000             1.000000                              1.000000              1.000000         1.000000                          1.000000          1.000000         1.000000                          1.000000          1.000000    72.000000    8684.800000
summary(customers_clean_r)
     customer_id      gender     senior_citizen partner    dependents     tenure   phone_service          multiple_lines    internet_service            online_security             online_backup            device_protection              tech_support               streaming_tv             streaming_movies           contract    paperless_billing                   payment_method monthly_charges total_charges  churn     
 0002-ORFBO:   1   Female:3488   Min.   :0.00   No :3641   No :4933   Min.   : 0   No : 682      No              :3390   DSL        :2421    No                 :3498   No                 :3088   No                 :3095    No                 :3473   No                 :2810   No                 :2785    Month-to-month:3875   No :2872          Bank transfer (automatic):1544   Min.   : 18     Min.   :  19   No :5174  
 0003-MKNFE:   1   Male  :3555   1st Qu.:0.00   Yes:3402   Yes:2110   1st Qu.: 9   Yes:6361      No phone service: 682   Fiber optic:3096    No internet service:1526   No internet service:1526   No internet service:1526    No internet service:1526   No internet service:1526   No internet service:1526    One year      :1473   Yes:4171          Credit card (automatic)  :1522   1st Qu.: 36     1st Qu.: 401   Yes:1869  
 0004-TLHLJ:   1                 Median :0.00                         Median :29                 Yes             :2971   No         :1526    Yes                :2019   Yes                :2429   Yes                :2422    Yes                :2044   Yes                :2707   Yes                :2732    Two year      :1695                     Electronic check         :2365   Median : 70     Median :1397             
 0011-IGKFF:   1                 Mean   :0.16                         Mean   :32                                                                                                                                                                                                                                                                         Mailed check             :1612   Mean   : 65     Mean   :2283             
 0013-EXCHZ:   1                 3rd Qu.:0.00                         3rd Qu.:55                                                                                                                                                                                                                                                                                                          3rd Qu.: 90     3rd Qu.:3795             
 0013-MHZWF:   1                 Max.   :1.00                         Max.   :72                                                                                                                                                                                                                                                                                                          Max.   :119     Max.   :8685             
 (Other)   :7037                                                                                                                                                                                                                                                                                                                                                                                          NA's   :11               

Splitting Data for Training and Testing

# x_py = customers_clean_py.drop("churn", axis=1).values
# y_py = customers_clean_py["churn"].values
# 
# train_x_py, test_x_py, train_y_py, test_y_py = sklearn.model_selection.train_test_split(
#     x_py,
#     y_py,
#     test_size=0.2,
#     random_state=1754,
#     stratify=y_py
# )
set.seed(1754)
train_indices <- caret::createDataPartition(customers_clean_r$churn, p=0.8, list=FALSE)
customers_clean_train_r <- customers_clean_r[train_indices, ]
customers_clean_test_r <- customers_clean_r[-train_indices, ]
# customers_clean_train_x_r <- customers_clean_r[train_index, -c("churn")]
# customers_clean_train_y_r <- customers_clean_r[train_index, "churn"]
# customers_clean_test_x_r <- customers_clean_r[-train_index, -c("churn")]
# customers_clean_test_y_r <- customers_clean_r[-train_index, "churn"]

Training the Initial Logististic Regression Model

# model_1_py = sklearn.linear_model.LogisticRegression()
# model_1_py.fit(train_x_py, train_y_py)
# model_1_r = stats::glm(formula = churn ~ ., data=customers_clean_train_r, family="binomial")
# summary(model_1_r)

Predicting the Probabilities

# model_1_y_prediction_py = model_1_py.predict(test_x_py)
# model_1_y_prediction_py[:7]
# model_1_y_probability_py = model_1_py.predict_proba(test_x_py)[:, 1]
# model_1_y_probability_py[:7]
# rate_false_positive_py, rate_true_positive_py, thresholds_py = sklearn.metrics.roc_curve(test_y_py, model_1_y_probability_py)
# rate_false_positive_py[:7]
# rate_true_positive_py[:7]
# thresholds_py[:7]
# roc_curve_model_1_py = (plotnine.ggplot()
#     + plotnine.geoms.geom_line(
#         plotnine.mapping.aes(x=rate_false_positive_py, y=rate_true_positive_py),
#         color=palette_michaelmallari_py[2],
#         size=1.5
#     )
#     + plotnine.geoms.geom_abline(
#         linetype="dashed",
#         color=palette_michaelmallari_py[12],
#         size=0.5
#     )
#     + plotnine.labels.labs(
#         title="ROC Curve for Churn Prediction",
#         x="False Positive Rate",
#         y="True Positive Rate"
#     )
#     + plotnine.scales.scale_x_continuous(expand=(0, 0))
#     + plotnine.scales.scale_y_continuous(expand=(0, 0))
#     + theme_michaelmallari_py
# )
# 
# roc_curve_model_1_py

Fitting a Simple Logististic Regression Model

Fitting a Multiple Logististic Regression Model


References

  • Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
  • Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
Applied Advanced Analytics & AI in Sports