Categorical Predictions Based on Probability
Predicting telecom customer churn (binary classification)—using logistic regression in R, Python, and Julia.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scikit-learn==1.3.0
import random
import pandas
import plotnine
import sklearn
R.version.string
[1] "R version 4.2.3 (2023-03-15)"
require(devtools)
devtools::install_version("fst", version="0.9.4", repos="http://cran.us.r-project.org")
devtools::install_version("dplyr", version="1.0.4", repos="http://cran.us.r-project.org")
devtools::install_version("tibble", version="3.1.6", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.3.3", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0-94", repos="http://cran.us.r-project.org")
library(dplyr)
library(tibble)
library(ggplot2)
library(ggcorrplot)
library(caret)
VERSION
v"1.9.2"
import Pkg
Pkg.add(name="CSV", version="0.9.11")
Pkg.add(name="DataFrames", version="1.3.0")
Pkg.add(name="CategoricalArrays", version="0.10.7")
Pkg.add(name="Colors", version="0.12.8")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
Pkg.add(name="MLJ", version="0.16.11")
Pkg.add(name="GLM", version="1.5.1")
using CSV
using DataFrames
using Dates
using CategoricalArrays
using Colors
using Cairo
using Gadfly
using MLJ
using GLM
Importing and Examining Dataset
Upon importing and examining the dataset, we can see that the data frame dimension is 7043
rows and 21
columns.
# Semicolon prevents the variable from being printed
customers_jl = DataFrames.DataFrame(CSV.File("../../dataset/telecom-customer-churn.csv"));
size(customers_jl)
(7043, 21)
first(customers_jl, 7)
7×21 DataFrame
Row │ customer_id gender senior_citizen partner dependents tenure phone_service multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies contract paperless_billing payment_method monthly_charges total_charges churn
│ String15 String7 Int64 String3 String3 Int64 String3 String31 String15 String31 String31 String31 String31 String31 String31 String15 String3 String31 Float64 String7 String3
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
2 │ 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
3 │ 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
4 │ 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.3 1840.75 No
5 │ 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.7 151.65 Yes
6 │ 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No No Yes No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 Yes
7 │ 1452-KIOVK Male 0 No Yes 22 Yes Yes Fiber optic No Yes No No Yes No Month-to-month Yes Credit card (automatic) 89.1 1949.40 No
customers_py = pandas.read_csv("../../dataset/telecom-customer-churn.csv")
customers_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 7043 non-null object
1 gender 7043 non-null object
2 senior_citizen 7043 non-null int64
3 partner 7043 non-null object
4 dependents 7043 non-null object
5 tenure 7043 non-null int64
6 phone_service 7043 non-null object
7 multiple_lines 7043 non-null object
8 internet_service 7043 non-null object
9 online_security 7043 non-null object
10 online_backup 7043 non-null object
11 device_protection 7043 non-null object
12 tech_support 7043 non-null object
13 streaming_tv 7043 non-null object
14 streaming_movies 7043 non-null object
15 contract 7043 non-null object
16 paperless_billing 7043 non-null object
17 payment_method 7043 non-null object
18 monthly_charges 7043 non-null float64
19 total_charges 7043 non-null object
20 churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
customers_py.head(n=7)
customer_id gender senior_citizen partner dependents tenure phone_service multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies contract paperless_billing payment_method monthly_charges total_charges churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
5 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No No Yes No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 Yes
6 1452-KIOVK Male 0 No Yes 22 Yes Yes Fiber optic No Yes No No Yes No Month-to-month Yes Credit card (automatic) 89.10 1949.40 No
customers_r <- read.csv("../../dataset/telecom-customer-churn.csv", stringsAsFactors=TRUE)
str(object=customers_r)
'data.frame': 7043 obs. of 21 variables:
$ customer_id : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$ senior_citizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ phone_service : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ multiple_lines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
$ online_security : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$ online_backup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
$ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$ tech_support : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
$ streaming_tv : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
$ contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$ paperless_billing: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
$ payment_method : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$ monthly_charges : num 29.9 57 53.9 42.3 70.7 ...
$ total_charges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
head(x=customers_r, n=7)
customer_id gender senior_citizen partner dependents tenure phone_service multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies contract paperless_billing payment_method monthly_charges total_charges churn
1 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 30 30 No
2 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 57 1890 No
3 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 54 108 Yes
4 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42 1841 No
5 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 71 152 Yes
6 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No No Yes No Yes Yes Month-to-month Yes Electronic check 100 820 Yes
7 1452-KIOVK Male 0 No Yes 22 Yes Yes Fiber optic No Yes No No Yes No Month-to-month Yes Credit card (automatic) 89 1949 No
Wrangling Data
# Create copy of data frame
customers_clean_jl = customers_jl;
# View structure of clean data frame
first(customers_clean_jl, 7)
7×21 DataFrame
Row │ customer_id gender senior_citizen partner dependents tenure phone_service multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies contract paperless_billing payment_method monthly_charges total_charges churn
│ String15 String7 Int64 String3 String3 Int64 String3 String31 String15 String31 String31 String31 String31 String31 String31 String15 String3 String31 Float64 String7 String3
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
2 │ 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
3 │ 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
4 │ 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.3 1840.75 No
5 │ 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.7 151.65 Yes
6 │ 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No No Yes No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 Yes
7 │ 1452-KIOVK Male 0 No Yes 22 Yes Yes Fiber optic No Yes No No Yes No Month-to-month Yes Credit card (automatic) 89.1 1949.40 No
# Create copy of data frame
customers_clean_py = customers_py
# Convert value to float
customers_clean_py["total_charges"] = customers_clean_py["total_charges"].replace([" ", ""], 0)
customers_clean_py["total_charges"] = pandas.to_numeric(customers_clean_py["total_charges"], errors="coerce")
# Create dummy variables (one-hot encoding)
one_hot_multiple_lines = pandas.get_dummies(customers_clean_py["multiple_lines"], prefix="multiple_lines", prefix_sep="_")
one_hot_internet_service = pandas.get_dummies(customers_clean_py["internet_service"], prefix="internet_service", prefix_sep="_")
one_hot_online_security = pandas.get_dummies(customers_clean_py["online_security"], prefix="online_security", prefix_sep="_")
one_hot_online_backup = pandas.get_dummies(customers_clean_py["online_backup"], prefix="online_backup", prefix_sep="_")
one_hot_device_protection = pandas.get_dummies(customers_clean_py["device_protection"], prefix="device_protection", prefix_sep="_")
one_hot_tech_support = pandas.get_dummies(customers_clean_py["tech_support"], prefix="tech_support", prefix_sep="_")
one_hot_streaming_tv = pandas.get_dummies(customers_clean_py["streaming_tv"], prefix="streaming_tv", prefix_sep="_")
one_hot_streaming_movies = pandas.get_dummies(customers_clean_py["streaming_movies"], prefix="streaming_movies", prefix_sep="_")
one_hot_contract = pandas.get_dummies(customers_clean_py["contract"], prefix="contract", prefix_sep="_")
one_hot_payment_method = pandas.get_dummies(customers_clean_py["payment_method"], prefix="payment_method", prefix_sep="_")
# Concatinate vectors with data frame
customers_clean_py = pandas.concat(
[
customers_clean_py,
one_hot_multiple_lines,
one_hot_internet_service,
one_hot_online_security,
one_hot_online_backup,
one_hot_device_protection,
one_hot_tech_support,
one_hot_streaming_tv,
one_hot_streaming_movies,
one_hot_contract,
one_hot_payment_method
],
axis=1
)
# Rename column name to lowercase with underscores
customers_clean_py.columns = customers_clean_py.columns.str.lower().str.replace("-", "_").str.replace(" ", "_").str.replace("(", "").str.replace(")", "")
# Drop columns with 100% variability or duplicative
customers_clean_py = customers_clean_py.drop(
[
"customer_id",
"multiple_lines",
"internet_service",
"online_security",
"online_backup",
"device_protection",
"tech_support",
"streaming_tv",
"streaming_movies",
"contract",
"payment_method"
],
axis=1
)
# Sort columns in alphabetical order of column names
customers_clean_py = customers_clean_py.sort_index(axis=1)
# Convert boolean-like values to 0 or 1
boolean_mapping = {"Yes": 1, "Male": 1, True: 1, "No": 0, "Female": 0, False: 0}
customers_clean_py = customers_clean_py.replace(boolean_mapping)
# View structure of clean data frame
customers_clean_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 churn 7043 non-null int64
1 contract_month_to_month 7043 non-null int64
2 contract_one_year 7043 non-null int64
3 contract_two_year 7043 non-null int64
4 dependents 7043 non-null int64
5 device_protection_no 7043 non-null int64
6 device_protection_no_internet_service 7043 non-null int64
7 device_protection_yes 7043 non-null int64
8 gender 7043 non-null int64
9 internet_service_dsl 7043 non-null int64
10 internet_service_fiber_optic 7043 non-null int64
11 internet_service_no 7043 non-null int64
12 monthly_charges 7043 non-null float64
13 multiple_lines_no 7043 non-null int64
14 multiple_lines_no_phone_service 7043 non-null int64
15 multiple_lines_yes 7043 non-null int64
16 online_backup_no 7043 non-null int64
17 online_backup_no_internet_service 7043 non-null int64
18 online_backup_yes 7043 non-null int64
19 online_security_no 7043 non-null int64
20 online_security_no_internet_service 7043 non-null int64
21 online_security_yes 7043 non-null int64
22 paperless_billing 7043 non-null int64
23 partner 7043 non-null int64
24 payment_method_bank_transfer_automatic 7043 non-null int64
25 payment_method_credit_card_automatic 7043 non-null int64
26 payment_method_electronic_check 7043 non-null int64
27 payment_method_mailed_check 7043 non-null int64
28 phone_service 7043 non-null int64
29 senior_citizen 7043 non-null int64
30 streaming_movies_no 7043 non-null int64
31 streaming_movies_no_internet_service 7043 non-null int64
32 streaming_movies_yes 7043 non-null int64
33 streaming_tv_no 7043 non-null int64
34 streaming_tv_no_internet_service 7043 non-null int64
35 streaming_tv_yes 7043 non-null int64
36 tech_support_no 7043 non-null int64
37 tech_support_no_internet_service 7043 non-null int64
38 tech_support_yes 7043 non-null int64
39 tenure 7043 non-null int64
40 total_charges 7043 non-null float64
dtypes: float64(2), int64(39)
memory usage: 2.2 MB
customers_clean_py.head(n=7)
churn contract_month_to_month contract_one_year contract_two_year dependents device_protection_no device_protection_no_internet_service device_protection_yes gender internet_service_dsl internet_service_fiber_optic internet_service_no monthly_charges multiple_lines_no multiple_lines_no_phone_service multiple_lines_yes online_backup_no online_backup_no_internet_service online_backup_yes online_security_no online_security_no_internet_service online_security_yes paperless_billing partner payment_method_bank_transfer_automatic payment_method_credit_card_automatic payment_method_electronic_check payment_method_mailed_check phone_service senior_citizen streaming_movies_no streaming_movies_no_internet_service streaming_movies_yes streaming_tv_no streaming_tv_no_internet_service streaming_tv_yes tech_support_no tech_support_no_internet_service tech_support_yes tenure total_charges
0 0 1 0 0 0 1 0 0 0 1 0 0 29.85 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 29.85
1 0 0 1 0 0 0 0 1 1 1 0 0 56.95 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 34 1889.50
2 1 1 0 0 0 1 0 0 1 1 0 0 53.85 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 2 108.15
3 0 0 1 0 0 0 0 1 1 1 0 0 42.30 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 45 1840.75
4 1 1 0 0 0 1 0 0 0 0 1 0 70.70 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 2 151.65
5 1 1 0 0 0 0 0 1 0 0 1 0 99.65 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 8 820.50
6 0 1 0 0 1 1 0 0 1 0 1 0 89.10 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 22 1949.40
# Create copy of data frame
customers_clean_r <- customers_r
# View structure of clean data frame
str(object=customers_clean_r)
'data.frame': 7043 obs. of 21 variables:
$ customer_id : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$ senior_citizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ phone_service : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ multiple_lines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
$ online_security : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$ online_backup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
$ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$ tech_support : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
$ streaming_tv : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
$ contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$ paperless_billing: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
$ payment_method : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$ monthly_charges : num 29.9 57 53.9 42.3 70.7 ...
$ total_charges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Performing Exploratory Data Analysis (EDA)
show(describe(customers_clean_jl), allrows=true)
21×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ customer_id 0002-ORFBO 9995-HOTOH 0 String15
2 │ gender Female Male 0 String7
3 │ senior_citizen 0.162147 0 0.0 1 0 Int64
4 │ partner No Yes 0 String3
5 │ dependents No Yes 0 String3
6 │ tenure 32.3711 0 29.0 72 0 Int64
7 │ phone_service No Yes 0 String3
8 │ multiple_lines No Yes 0 String31
9 │ internet_service DSL No 0 String15
10 │ online_security No Yes 0 String31
11 │ online_backup No Yes 0 String31
12 │ device_protection No Yes 0 String31
13 │ tech_support No Yes 0 String31
14 │ streaming_tv No Yes 0 String31
15 │ streaming_movies No Yes 0 String31
16 │ contract Month-to-month Two year 0 String15
17 │ paperless_billing No Yes 0 String3
18 │ payment_method Bank transfer (automatic) Mailed check 0 String31
19 │ monthly_charges 64.7617 18.25 70.35 118.75 0 Float64
20 │ total_charges 999.90 0 String7
21 │ churn No Yes 0 String3
customers_clean_py.describe(include="all")
churn contract_month_to_month contract_one_year contract_two_year dependents device_protection_no device_protection_no_internet_service device_protection_yes gender internet_service_dsl internet_service_fiber_optic internet_service_no monthly_charges multiple_lines_no multiple_lines_no_phone_service multiple_lines_yes online_backup_no online_backup_no_internet_service online_backup_yes online_security_no online_security_no_internet_service online_security_yes paperless_billing partner payment_method_bank_transfer_automatic payment_method_credit_card_automatic payment_method_electronic_check payment_method_mailed_check phone_service senior_citizen streaming_movies_no streaming_movies_no_internet_service streaming_movies_yes streaming_tv_no streaming_tv_no_internet_service streaming_tv_yes tech_support_no tech_support_no_internet_service tech_support_yes tenure total_charges
count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000
mean 0.265370 0.550192 0.209144 0.240664 0.299588 0.439443 0.216669 0.343888 0.504756 0.343746 0.439585 0.216669 64.761692 0.481329 0.096834 0.421837 0.438450 0.216669 0.344881 0.496663 0.216669 0.286668 0.592219 0.483033 0.219225 0.216101 0.335794 0.228880 0.903166 0.162147 0.395428 0.216669 0.387903 0.398978 0.216669 0.384353 0.493114 0.216669 0.290217 32.371149 2279.734304
std 0.441561 0.497510 0.406726 0.427517 0.458110 0.496355 0.412004 0.475038 0.500013 0.474991 0.496372 0.412004 30.090047 0.499687 0.295752 0.493888 0.496232 0.412004 0.475363 0.500024 0.412004 0.452237 0.491457 0.499748 0.413751 0.411613 0.472301 0.420141 0.295752 0.368612 0.488977 0.412004 0.487307 0.489723 0.412004 0.486477 0.499988 0.412004 0.453895 24.559481 2266.794470
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 18.250000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 35.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 398.550000
50% 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 70.350000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 29.000000 1394.550000
75% 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 89.850000 1.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 1.000000 55.000000 3786.600000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 118.750000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 72.000000 8684.800000
summary(customers_clean_r)
customer_id gender senior_citizen partner dependents tenure phone_service multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv streaming_movies contract paperless_billing payment_method monthly_charges total_charges churn
0002-ORFBO: 1 Female:3488 Min. :0.00 No :3641 No :4933 Min. : 0 No : 682 No :3390 DSL :2421 No :3498 No :3088 No :3095 No :3473 No :2810 No :2785 Month-to-month:3875 No :2872 Bank transfer (automatic):1544 Min. : 18 Min. : 19 No :5174
0003-MKNFE: 1 Male :3555 1st Qu.:0.00 Yes:3402 Yes:2110 1st Qu.: 9 Yes:6361 No phone service: 682 Fiber optic:3096 No internet service:1526 No internet service:1526 No internet service:1526 No internet service:1526 No internet service:1526 No internet service:1526 One year :1473 Yes:4171 Credit card (automatic) :1522 1st Qu.: 36 1st Qu.: 401 Yes:1869
0004-TLHLJ: 1 Median :0.00 Median :29 Yes :2971 No :1526 Yes :2019 Yes :2429 Yes :2422 Yes :2044 Yes :2707 Yes :2732 Two year :1695 Electronic check :2365 Median : 70 Median :1397
0011-IGKFF: 1 Mean :0.16 Mean :32 Mailed check :1612 Mean : 65 Mean :2283
0013-EXCHZ: 1 3rd Qu.:0.00 3rd Qu.:55 3rd Qu.: 90 3rd Qu.:3795
0013-MHZWF: 1 Max. :1.00 Max. :72 Max. :119 Max. :8685
(Other) :7037 NA's :11
Splitting Data for Training and Testing
# x_py = customers_clean_py.drop("churn", axis=1).values
# y_py = customers_clean_py["churn"].values
#
# train_x_py, test_x_py, train_y_py, test_y_py = sklearn.model_selection.train_test_split(
# x_py,
# y_py,
# test_size=0.2,
# random_state=1754,
# stratify=y_py
# )
set.seed(1754)
train_indices <- caret::createDataPartition(customers_clean_r$churn, p=0.8, list=FALSE)
customers_clean_train_r <- customers_clean_r[train_indices, ]
customers_clean_test_r <- customers_clean_r[-train_indices, ]
# customers_clean_train_x_r <- customers_clean_r[train_index, -c("churn")]
# customers_clean_train_y_r <- customers_clean_r[train_index, "churn"]
# customers_clean_test_x_r <- customers_clean_r[-train_index, -c("churn")]
# customers_clean_test_y_r <- customers_clean_r[-train_index, "churn"]
Training the Initial Logististic Regression Model
# model_1_py = sklearn.linear_model.LogisticRegression()
# model_1_py.fit(train_x_py, train_y_py)
# model_1_r = stats::glm(formula = churn ~ ., data=customers_clean_train_r, family="binomial")
# summary(model_1_r)
Predicting the Probabilities
# model_1_y_prediction_py = model_1_py.predict(test_x_py)
# model_1_y_prediction_py[:7]
# model_1_y_probability_py = model_1_py.predict_proba(test_x_py)[:, 1]
# model_1_y_probability_py[:7]
# rate_false_positive_py, rate_true_positive_py, thresholds_py = sklearn.metrics.roc_curve(test_y_py, model_1_y_probability_py)
# rate_false_positive_py[:7]
# rate_true_positive_py[:7]
# thresholds_py[:7]
# roc_curve_model_1_py = (plotnine.ggplot()
# + plotnine.geoms.geom_line(
# plotnine.mapping.aes(x=rate_false_positive_py, y=rate_true_positive_py),
# color=palette_michaelmallari_py[2],
# size=1.5
# )
# + plotnine.geoms.geom_abline(
# linetype="dashed",
# color=palette_michaelmallari_py[12],
# size=0.5
# )
# + plotnine.labels.labs(
# title="ROC Curve for Churn Prediction",
# x="False Positive Rate",
# y="True Positive Rate"
# )
# + plotnine.scales.scale_x_continuous(expand=(0, 0))
# + plotnine.scales.scale_y_continuous(expand=(0, 0))
# + theme_michaelmallari_py
# )
#
# roc_curve_model_1_py
Fitting a Simple Logististic Regression Model
Fitting a Multiple Logististic Regression Model
References
- Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
- Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.