Categorical Predictions Based on Similarities
Predicting heart disease based on commonly shared attributes—using the KNN algorithm in R, Python, and Julia.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.
VERSION
v"1.9.2"
import Pkg
Pkg.add(name="CSV", version="0.10.4")
Pkg.add(name="DataFrames", version="1.3.6")
Pkg.add(name="CategoricalArrays", version="0.10.7")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
Pkg.add(name="MLJ", version="0.16.11")
Pkg.add(name="GLM", version="1.5.1")
using Dates
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
using MLJ
using GLM
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==1.3.4
!pip install plotnine==0.10.1
!pip install scikit-learn==1.0.1
import random
import pandas
import datetime
import plotnine
import sklearn
R.version.string
[1] "R version 4.2.3 (2023-03-15)"
require(devtools)
devtools::install_version("fst", version="0.9.4", repos="http://cran.us.r-project.org")
devtools::install_version("dplyr", version="1.0.4", repos="http://cran.us.r-project.org")
devtools::install_version("tibble", version="3.1.6", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.3.3", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("rsample", version="1.1.1", repos="http://cran.us.r-project.org")
library(dplyr)
library(tibble)
library(ggplot2)
library(ggcorrplot)
library(rsample)
Importing and Examining Dataset
Upon importing and examining the dataset, we can see that the data frame dimension is 918
rows and 12
columns.
heart_disease_jl = CSV.File("../../dataset/heart-disease.csv") |> DataFrames.DataFrame
918×12 DataFrame
Row │ Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
│ Int64 String1 String3 Int64 Int64 Int64 String7 Int64 String1 Float64 String7 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
2 │ 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
3 │ 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
4 │ 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
5 │ 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
6 │ 39 M NAP 120 339 0 Normal 170 N 0.0 Up 0
7 │ 45 F ATA 130 237 0 Normal 170 N 0.0 Up 0
8 │ 54 M ATA 110 208 0 Normal 142 N 0.0 Up 0
9 │ 37 M ASY 140 207 0 Normal 130 Y 1.5 Flat 1
10 │ 48 F ATA 120 284 0 Normal 120 N 0.0 Up 0
11 │ 37 F NAP 130 211 0 Normal 142 N 0.0 Up 0
12 │ 58 M ATA 136 164 0 ST 99 Y 2.0 Flat 1
13 │ 39 M ATA 120 204 0 Normal 145 N 0.0 Up 0
14 │ 49 M ASY 140 234 0 Normal 140 Y 1.0 Flat 1
15 │ 42 F NAP 115 211 0 ST 137 N 0.0 Up 0
16 │ 54 F ATA 120 273 0 Normal 150 N 1.5 Flat 0
17 │ 38 M ASY 110 196 0 Normal 166 N 0.0 Flat 1
18 │ 43 F ATA 120 201 0 Normal 165 N 0.0 Up 0
19 │ 60 M ASY 100 248 0 Normal 125 N 1.0 Flat 1
20 │ 36 M ATA 120 267 0 Normal 160 N 3.0 Flat 1
21 │ 43 F TA 100 223 0 Normal 142 N 0.0 Up 0
22 │ 44 M ATA 120 184 0 Normal 142 N 1.0 Flat 0
23 │ 49 F ATA 124 201 0 Normal 164 N 0.0 Up 0
24 │ 44 M ATA 150 288 0 Normal 150 Y 3.0 Flat 1
25 │ 40 M NAP 130 215 0 Normal 138 N 0.0 Up 0
26 │ 36 M NAP 130 209 0 Normal 178 N 0.0 Up 0
27 │ 53 M ASY 124 260 0 ST 112 Y 3.0 Flat 0
28 │ 52 M ATA 120 284 0 Normal 118 N 0.0 Up 0
29 │ 53 F ATA 113 468 0 Normal 127 N 0.0 Up 0
30 │ 51 M ATA 125 188 0 Normal 145 N 0.0 Up 0
31 │ 53 M NAP 145 518 0 Normal 130 N 0.0 Flat 1
32 │ 56 M NAP 130 167 0 Normal 114 N 0.0 Up 0
33 │ 54 M ASY 125 224 0 Normal 122 N 2.0 Flat 1
34 │ 41 M ASY 130 172 0 ST 130 N 2.0 Flat 1
35 │ 43 F ATA 150 186 0 Normal 154 N 0.0 Up 0
36 │ 32 M ATA 125 254 0 Normal 155 N 0.0 Up 0
37 │ 65 M ASY 140 306 1 Normal 87 Y 1.5 Flat 1
38 │ 41 F ATA 110 250 0 ST 142 N 0.0 Up 0
39 │ 48 F ATA 120 177 1 ST 148 N 0.0 Up 0
40 │ 48 F ASY 150 227 0 Normal 130 Y 1.0 Flat 0
41 │ 54 F ATA 150 230 0 Normal 130 N 0.0 Up 0
42 │ 54 F NAP 130 294 0 ST 100 Y 0.0 Flat 1
43 │ 35 M ATA 150 264 0 Normal 168 N 0.0 Up 0
44 │ 52 M NAP 140 259 0 ST 170 N 0.0 Up 0
45 │ 43 M ASY 120 175 0 Normal 120 Y 1.0 Flat 1
46 │ 59 M NAP 130 318 0 Normal 120 Y 1.0 Flat 0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
874 │ 64 M NAP 140 335 0 Normal 158 N 0.0 Up 1
875 │ 43 M ASY 150 247 0 Normal 171 N 1.5 Up 0
876 │ 58 F NAP 120 340 0 Normal 172 N 0.0 Up 0
877 │ 60 M ASY 130 206 0 LVH 132 Y 2.4 Flat 1
878 │ 58 M ATA 120 284 0 LVH 160 N 1.8 Flat 1
879 │ 49 M ATA 130 266 0 Normal 171 N 0.6 Up 0
880 │ 48 M ATA 110 229 0 Normal 168 N 1.0 Down 1
881 │ 52 M NAP 172 199 1 Normal 162 N 0.5 Up 0
882 │ 44 M ATA 120 263 0 Normal 173 N 0.0 Up 0
883 │ 56 F ATA 140 294 0 LVH 153 N 1.3 Flat 0
884 │ 57 M ASY 140 192 0 Normal 148 N 0.4 Flat 0
885 │ 67 M ASY 160 286 0 LVH 108 Y 1.5 Flat 1
886 │ 53 F NAP 128 216 0 LVH 115 N 0.0 Up 0
887 │ 52 M NAP 138 223 0 Normal 169 N 0.0 Up 0
888 │ 43 M ASY 132 247 1 LVH 143 Y 0.1 Flat 1
889 │ 52 M ASY 128 204 1 Normal 156 Y 1.0 Flat 1
890 │ 59 M TA 134 204 0 Normal 162 N 0.8 Up 1
891 │ 64 M TA 170 227 0 LVH 155 N 0.6 Flat 0
892 │ 66 F NAP 146 278 0 LVH 152 N 0.0 Flat 0
893 │ 39 F NAP 138 220 0 Normal 152 N 0.0 Flat 0
894 │ 57 M ATA 154 232 0 LVH 164 N 0.0 Up 1
895 │ 58 F ASY 130 197 0 Normal 131 N 0.6 Flat 0
896 │ 57 M ASY 110 335 0 Normal 143 Y 3.0 Flat 1
897 │ 47 M NAP 130 253 0 Normal 179 N 0.0 Up 0
898 │ 55 F ASY 128 205 0 ST 130 Y 2.0 Flat 1
899 │ 35 M ATA 122 192 0 Normal 174 N 0.0 Up 0
900 │ 61 M ASY 148 203 0 Normal 161 N 0.0 Up 1
901 │ 58 M ASY 114 318 0 ST 140 N 4.4 Down 1
902 │ 58 F ASY 170 225 1 LVH 146 Y 2.8 Flat 1
903 │ 58 M ATA 125 220 0 Normal 144 N 0.4 Flat 0
904 │ 56 M ATA 130 221 0 LVH 163 N 0.0 Up 0
905 │ 56 M ATA 120 240 0 Normal 169 N 0.0 Down 0
906 │ 67 M NAP 152 212 0 LVH 150 N 0.8 Flat 1
907 │ 55 F ATA 132 342 0 Normal 166 N 1.2 Up 0
908 │ 44 M ASY 120 169 0 Normal 144 Y 2.8 Down 1
909 │ 63 M ASY 140 187 0 LVH 144 Y 4.0 Up 1
910 │ 63 F ASY 124 197 0 Normal 136 Y 0.0 Flat 1
911 │ 41 M ATA 120 157 0 Normal 182 N 0.0 Up 0
912 │ 59 M ASY 164 176 1 LVH 90 N 1.0 Flat 1
913 │ 57 F ASY 140 241 0 Normal 123 Y 0.2 Flat 1
914 │ 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1
915 │ 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1
916 │ 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1
917 │ 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1
918 │ 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0
827 rows omitted
heart_disease_py = pandas.read_csv("../../dataset/heart-disease.csv")
heart_disease_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 918 non-null int64
1 Sex 918 non-null object
2 ChestPainType 918 non-null object
3 RestingBP 918 non-null int64
4 Cholesterol 918 non-null int64
5 FastingBS 918 non-null int64
6 RestingECG 918 non-null object
7 MaxHR 918 non-null int64
8 ExerciseAngina 918 non-null object
9 Oldpeak 918 non-null float64
10 ST_Slope 918 non-null object
11 HeartDisease 918 non-null int64
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
heart_disease_py.head(n=8)
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
5 39 M NAP 120 339 0 Normal 170 N 0.0 Up 0
6 45 F ATA 130 237 0 Normal 170 N 0.0 Up 0
7 54 M ATA 110 208 0 Normal 142 N 0.0 Up 0
heart_disease_py.tail(n=8)
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
910 41 M ATA 120 157 0 Normal 182 N 0.0 Up 0
911 59 M ASY 164 176 1 LVH 90 N 1.0 Flat 1
912 57 F ASY 140 241 0 Normal 123 Y 0.2 Flat 1
913 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1
914 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1
915 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1
916 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1
917 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0
heart_disease_r <- read.csv("../../dataset/heart-disease.csv", stringsAsFactors=TRUE)
str(object=heart_disease_r)
'data.frame': 918 obs. of 12 variables:
$ Age : int 40 49 37 48 54 39 45 54 37 48 ...
$ Sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
$ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
$ RestingBP : int 140 160 130 138 150 120 130 110 140 120 ...
$ Cholesterol : int 289 180 283 214 195 339 237 208 207 284 ...
$ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
$ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
$ MaxHR : int 172 156 98 108 122 170 170 142 130 120 ...
$ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
$ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
$ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
$ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
head(x=heart_disease_r, n=8)
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
1 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
2 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
3 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
4 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
5 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
6 39 M NAP 120 339 0 Normal 170 N 0.0 Up 0
7 45 F ATA 130 237 0 Normal 170 N 0.0 Up 0
8 54 M ATA 110 208 0 Normal 142 N 0.0 Up 0
tail(x=heart_disease_r, n=8)
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
911 41 M ATA 120 157 0 Normal 182 N 0.0 Up 0
912 59 M ASY 164 176 1 LVH 90 N 1.0 Flat 1
913 57 F ASY 140 241 0 Normal 123 Y 0.2 Flat 1
914 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1
915 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1
916 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1
917 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1
918 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0
Wrangling Data
References
- Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
- Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.