Categorical Predictions Based on Similarities

Saturday, October 25, 2008

The original post has been migrated from Excel/XLMiner, migrated from http://glue.umd.edu/~mmallari/ (sunsetted), and refreshed with newer datasets and references.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

Julia

VERSION

v"1.9.2"

import Pkg
Pkg.add(name="CSV", version="0.10.4")
Pkg.add(name="DataFrames", version="1.3.6")
Pkg.add(name="CategoricalArrays", version="0.10.7")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
Pkg.add(name="MLJ", version="0.16.11")
Pkg.add(name="GLM", version="1.5.1")

using Dates
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
using MLJ
using GLM

Python

import sys
print(sys.version)

3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]

!pip install pandas==1.3.4
!pip install plotnine==0.10.1
!pip install scikit-learn==1.0.1

import random
import pandas
import datetime
import plotnine
import sklearn

R.version.string

[1] "R version 4.2.3 (2023-03-15)"

require(devtools)
devtools::install_version("fst", version="0.9.4", repos="http://cran.us.r-project.org")
devtools::install_version("dplyr", version="1.0.4", repos="http://cran.us.r-project.org")
devtools::install_version("tibble", version="3.1.6", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.3.3", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("rsample", version="1.1.1", repos="http://cran.us.r-project.org")

library(dplyr)
library(tibble)
library(ggplot2)
library(ggcorrplot)
library(rsample)

Importing and Examining Dataset

Upon importing and examining the dataset, we can see that the data frame dimension is 918 rows and 12 columns.

Julia

heart_disease_jl = CSV.File("../../dataset/heart-disease.csv") |> DataFrames.DataFrame

918×12 DataFrame
 Row │ Age    Sex      ChestPainType  RestingBP  Cholesterol  FastingBS  RestingECG  MaxHR  ExerciseAngina  Oldpeak  ST_Slope  HeartDisease
     │ Int64  String1  String3        Int64      Int64        Int64      String7     Int64  String1         Float64  String7   Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │    40  M        ATA                  140          289          0  Normal        172  N                   0.0  Up                   0
   2 │    49  F        NAP                  160          180          0  Normal        156  N                   1.0  Flat                 1
   3 │    37  M        ATA                  130          283          0  ST             98  N                   0.0  Up                   0
   4 │    48  F        ASY                  138          214          0  Normal        108  Y                   1.5  Flat                 1
   5 │    54  M        NAP                  150          195          0  Normal        122  N                   0.0  Up                   0
   6 │    39  M        NAP                  120          339          0  Normal        170  N                   0.0  Up                   0
   7 │    45  F        ATA                  130          237          0  Normal        170  N                   0.0  Up                   0
   8 │    54  M        ATA                  110          208          0  Normal        142  N                   0.0  Up                   0
   9 │    37  M        ASY                  140          207          0  Normal        130  Y                   1.5  Flat                 1
  10 │    48  F        ATA                  120          284          0  Normal        120  N                   0.0  Up                   0
  11 │    37  F        NAP                  130          211          0  Normal        142  N                   0.0  Up                   0
  12 │    58  M        ATA                  136          164          0  ST             99  Y                   2.0  Flat                 1
  13 │    39  M        ATA                  120          204          0  Normal        145  N                   0.0  Up                   0
  14 │    49  M        ASY                  140          234          0  Normal        140  Y                   1.0  Flat                 1
  15 │    42  F        NAP                  115          211          0  ST            137  N                   0.0  Up                   0
  16 │    54  F        ATA                  120          273          0  Normal        150  N                   1.5  Flat                 0
  17 │    38  M        ASY                  110          196          0  Normal        166  N                   0.0  Flat                 1
  18 │    43  F        ATA                  120          201          0  Normal        165  N                   0.0  Up                   0
  19 │    60  M        ASY                  100          248          0  Normal        125  N                   1.0  Flat                 1
  20 │    36  M        ATA                  120          267          0  Normal        160  N                   3.0  Flat                 1
  21 │    43  F        TA                   100          223          0  Normal        142  N                   0.0  Up                   0
  22 │    44  M        ATA                  120          184          0  Normal        142  N                   1.0  Flat                 0
  23 │    49  F        ATA                  124          201          0  Normal        164  N                   0.0  Up                   0
  24 │    44  M        ATA                  150          288          0  Normal        150  Y                   3.0  Flat                 1
  25 │    40  M        NAP                  130          215          0  Normal        138  N                   0.0  Up                   0
  26 │    36  M        NAP                  130          209          0  Normal        178  N                   0.0  Up                   0
  27 │    53  M        ASY                  124          260          0  ST            112  Y                   3.0  Flat                 0
  28 │    52  M        ATA                  120          284          0  Normal        118  N                   0.0  Up                   0
  29 │    53  F        ATA                  113          468          0  Normal        127  N                   0.0  Up                   0
  30 │    51  M        ATA                  125          188          0  Normal        145  N                   0.0  Up                   0
  31 │    53  M        NAP                  145          518          0  Normal        130  N                   0.0  Flat                 1
  32 │    56  M        NAP                  130          167          0  Normal        114  N                   0.0  Up                   0
  33 │    54  M        ASY                  125          224          0  Normal        122  N                   2.0  Flat                 1
  34 │    41  M        ASY                  130          172          0  ST            130  N                   2.0  Flat                 1
  35 │    43  F        ATA                  150          186          0  Normal        154  N                   0.0  Up                   0
  36 │    32  M        ATA                  125          254          0  Normal        155  N                   0.0  Up                   0
  37 │    65  M        ASY                  140          306          1  Normal         87  Y                   1.5  Flat                 1
  38 │    41  F        ATA                  110          250          0  ST            142  N                   0.0  Up                   0
  39 │    48  F        ATA                  120          177          1  ST            148  N                   0.0  Up                   0
  40 │    48  F        ASY                  150          227          0  Normal        130  Y                   1.0  Flat                 0
  41 │    54  F        ATA                  150          230          0  Normal        130  N                   0.0  Up                   0
  42 │    54  F        NAP                  130          294          0  ST            100  Y                   0.0  Flat                 1
  43 │    35  M        ATA                  150          264          0  Normal        168  N                   0.0  Up                   0
  44 │    52  M        NAP                  140          259          0  ST            170  N                   0.0  Up                   0
  45 │    43  M        ASY                  120          175          0  Normal        120  Y                   1.0  Flat                 1
  46 │    59  M        NAP                  130          318          0  Normal        120  Y                   1.0  Flat                 0
  ⋮  │   ⋮       ⋮           ⋮            ⋮           ⋮           ⋮          ⋮         ⋮          ⋮            ⋮        ⋮           ⋮
 874 │    64  M        NAP                  140          335          0  Normal        158  N                   0.0  Up                   1
 875 │    43  M        ASY                  150          247          0  Normal        171  N                   1.5  Up                   0
 876 │    58  F        NAP                  120          340          0  Normal        172  N                   0.0  Up                   0
 877 │    60  M        ASY                  130          206          0  LVH           132  Y                   2.4  Flat                 1
 878 │    58  M        ATA                  120          284          0  LVH           160  N                   1.8  Flat                 1
 879 │    49  M        ATA                  130          266          0  Normal        171  N                   0.6  Up                   0
 880 │    48  M        ATA                  110          229          0  Normal        168  N                   1.0  Down                 1
 881 │    52  M        NAP                  172          199          1  Normal        162  N                   0.5  Up                   0
 882 │    44  M        ATA                  120          263          0  Normal        173  N                   0.0  Up                   0
 883 │    56  F        ATA                  140          294          0  LVH           153  N                   1.3  Flat                 0
 884 │    57  M        ASY                  140          192          0  Normal        148  N                   0.4  Flat                 0
 885 │    67  M        ASY                  160          286          0  LVH           108  Y                   1.5  Flat                 1
 886 │    53  F        NAP                  128          216          0  LVH           115  N                   0.0  Up                   0
 887 │    52  M        NAP                  138          223          0  Normal        169  N                   0.0  Up                   0
 888 │    43  M        ASY                  132          247          1  LVH           143  Y                   0.1  Flat                 1
 889 │    52  M        ASY                  128          204          1  Normal        156  Y                   1.0  Flat                 1
 890 │    59  M        TA                   134          204          0  Normal        162  N                   0.8  Up                   1
 891 │    64  M        TA                   170          227          0  LVH           155  N                   0.6  Flat                 0
 892 │    66  F        NAP                  146          278          0  LVH           152  N                   0.0  Flat                 0
 893 │    39  F        NAP                  138          220          0  Normal        152  N                   0.0  Flat                 0
 894 │    57  M        ATA                  154          232          0  LVH           164  N                   0.0  Up                   1
 895 │    58  F        ASY                  130          197          0  Normal        131  N                   0.6  Flat                 0
 896 │    57  M        ASY                  110          335          0  Normal        143  Y                   3.0  Flat                 1
 897 │    47  M        NAP                  130          253          0  Normal        179  N                   0.0  Up                   0
 898 │    55  F        ASY                  128          205          0  ST            130  Y                   2.0  Flat                 1
 899 │    35  M        ATA                  122          192          0  Normal        174  N                   0.0  Up                   0
 900 │    61  M        ASY                  148          203          0  Normal        161  N                   0.0  Up                   1
 901 │    58  M        ASY                  114          318          0  ST            140  N                   4.4  Down                 1
 902 │    58  F        ASY                  170          225          1  LVH           146  Y                   2.8  Flat                 1
 903 │    58  M        ATA                  125          220          0  Normal        144  N                   0.4  Flat                 0
 904 │    56  M        ATA                  130          221          0  LVH           163  N                   0.0  Up                   0
 905 │    56  M        ATA                  120          240          0  Normal        169  N                   0.0  Down                 0
 906 │    67  M        NAP                  152          212          0  LVH           150  N                   0.8  Flat                 1
 907 │    55  F        ATA                  132          342          0  Normal        166  N                   1.2  Up                   0
 908 │    44  M        ASY                  120          169          0  Normal        144  Y                   2.8  Down                 1
 909 │    63  M        ASY                  140          187          0  LVH           144  Y                   4.0  Up                   1
 910 │    63  F        ASY                  124          197          0  Normal        136  Y                   0.0  Flat                 1
 911 │    41  M        ATA                  120          157          0  Normal        182  N                   0.0  Up                   0
 912 │    59  M        ASY                  164          176          1  LVH            90  N                   1.0  Flat                 1
 913 │    57  F        ASY                  140          241          0  Normal        123  Y                   0.2  Flat                 1
 914 │    45  M        TA                   110          264          0  Normal        132  N                   1.2  Flat                 1
 915 │    68  M        ASY                  144          193          1  Normal        141  N                   3.4  Flat                 1
 916 │    57  M        ASY                  130          131          0  Normal        115  Y                   1.2  Flat                 1
 917 │    57  F        ATA                  130          236          0  LVH           174  N                   0.0  Flat                 1
 918 │    38  M        NAP                  138          175          0  Normal        173  N                   0.0  Up                   0
                                                                                                                            827 rows omitted

Python

heart_disease_py = pandas.read_csv("../../dataset/heart-disease.csv")
heart_disease_py.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB

heart_disease_py.head(n=8)

   Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  MaxHR ExerciseAngina  Oldpeak ST_Slope  HeartDisease
0   40   M           ATA        140          289          0     Normal    172              N      0.0       Up             0
1   49   F           NAP        160          180          0     Normal    156              N      1.0     Flat             1
2   37   M           ATA        130          283          0         ST     98              N      0.0       Up             0
3   48   F           ASY        138          214          0     Normal    108              Y      1.5     Flat             1
4   54   M           NAP        150          195          0     Normal    122              N      0.0       Up             0
5   39   M           NAP        120          339          0     Normal    170              N      0.0       Up             0
6   45   F           ATA        130          237          0     Normal    170              N      0.0       Up             0
7   54   M           ATA        110          208          0     Normal    142              N      0.0       Up             0

heart_disease_py.tail(n=8)

     Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  MaxHR ExerciseAngina  Oldpeak ST_Slope  HeartDisease
910   41   M           ATA        120          157          0     Normal    182              N      0.0       Up             0
911   59   M           ASY        164          176          1        LVH     90              N      1.0     Flat             1
912   57   F           ASY        140          241          0     Normal    123              Y      0.2     Flat             1
913   45   M            TA        110          264          0     Normal    132              N      1.2     Flat             1
914   68   M           ASY        144          193          1     Normal    141              N      3.4     Flat             1
915   57   M           ASY        130          131          0     Normal    115              Y      1.2     Flat             1
916   57   F           ATA        130          236          0        LVH    174              N      0.0     Flat             1
917   38   M           NAP        138          175          0     Normal    173              N      0.0       Up             0

heart_disease_r <- read.csv("../../dataset/heart-disease.csv", stringsAsFactors=TRUE)
str(object=heart_disease_r)

'data.frame':	918 obs. of  12 variables:
 $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
 $ Sex           : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
 $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
 $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
 $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
 $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RestingECG    : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
 $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
 $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
 $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
 $ ST_Slope      : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
 $ HeartDisease  : int  0 1 0 1 0 0 0 0 1 0 ...

head(x=heart_disease_r, n=8)

  Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
1  40   M           ATA       140         289         0     Normal   172              N     0.0       Up            0
2  49   F           NAP       160         180         0     Normal   156              N     1.0     Flat            1
3  37   M           ATA       130         283         0         ST    98              N     0.0       Up            0
4  48   F           ASY       138         214         0     Normal   108              Y     1.5     Flat            1
5  54   M           NAP       150         195         0     Normal   122              N     0.0       Up            0
6  39   M           NAP       120         339         0     Normal   170              N     0.0       Up            0
7  45   F           ATA       130         237         0     Normal   170              N     0.0       Up            0
8  54   M           ATA       110         208         0     Normal   142              N     0.0       Up            0

tail(x=heart_disease_r, n=8)

    Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
911  41   M           ATA       120         157         0     Normal   182              N     0.0       Up            0
912  59   M           ASY       164         176         1        LVH    90              N     1.0     Flat            1
913  57   F           ASY       140         241         0     Normal   123              Y     0.2     Flat            1
914  45   M            TA       110         264         0     Normal   132              N     1.2     Flat            1
915  68   M           ASY       144         193         1     Normal   141              N     3.4     Flat            1
916  57   M           ASY       130         131         0     Normal   115              Y     1.2     Flat            1
917  57   F           ATA       130         236         0        LVH   174              N     0.0     Flat            1
918  38   M           NAP       138         175         0     Normal   173              N     0.0       Up            0

Wrangling Data

References

Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2^nd ed.). South-Western College Publishing.

Applied Advanced Analytics & AI in Sports