Computational Thinking + Doing

Data Types and Data Structures

Understanding the NHANES data types and data structures—one of the first steps of any data analysis—using Julia, Python, and R.

Let’s face it… there is an abundance of data being generated every second, but we are yet to truly unlock its full potential. The very first step in doing so is understanding data at its atomic level; what is it (data type) and how is it stored and accessed (data structures)? For this post, I’ll focus on commonly used data types and data structures in data science (a subset of computer science).

Getting Started

If you are interested in reproducing this work, here are the versions of Julia, Python, and R used (as well as the respective packages for each). In addition, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted in this work.

VERSION
v"1.5.0"
import Pkg
Pkg.add(name="CSV", version="0.10.4")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.7")
Pkg.add(name="StatsBase", version="0.33.21")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
using Colors
using Cairo
using Gadfly
import sys
print(sys.version)
3.9.6 (v3.9.6:db3ff76da1, Jun 28 2021, 11:49:53) 
[Clang 6.0 (clang-600.0.57)]
!pip install pandas==2.0.0
!pip install plotnine==0.10.1
import pandas
import plotnine
R.version.string
[1] "R version 4.1.1 (2021-08-10)"
require(devtools)
devtools::install_version("dplyr", version="1.0.10", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.0", repos="http://cran.us.r-project.org")
devtools::install_version("epiDisplay", version="3.5.0.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
library(epiDisplay)

Importing and Examining Dataset

The U.S. National Health and Nutrition Examination Study (NHANES) has made its 1999-2004 data available. After importing this data, we then look into the data structure (and data types stored in it), in order to determine whether data preparation is needed. In real life, 80% of a data analyst’s time is spent on data preparation (also known as data wrangling). Note: I’ve already done some data preparation. The data below will look slightly different than yours.

nhanes_jl = CSV.File("../../dataset/nhanes.csv") |> DataFrames.DataFrame
10000×76 DataFrame
   Row │ id     survey_year  gender   age    age_decade  age_months  race_1    race_3     education       marital_status  hh_income    hh_income_mid  poverty   home_rooms  home_own  work        weight    length     head_circ  height     bmi         bmi_cat_under_20yrs  bmi_who       pulse    bp_sys_ave  bp_dia_ave  bp_sys_1  BPDia1   bp_sys_2  bp_dia_2  bp_sys_3  BPDia3   Testosterone  DirectChol  TotChol     UrineVol1  UrineFlow1   UrineVol2  UrineFlow2   diabetes  DiabetesAge  health_gen  DaysPhysHlthBad  DaysMentHlthBad  little_interest  depressed  nPregnancies  nBabies  Age1stBaby  SleepHrsNight  sleep_trouble  phys_active  PhysActiveDays  tv_hrs_day  comp_hrs_day  TVHrsDayChild  CompHrsDayChild  alcohol_12_plus_yr  AlcoholDay  alcohol_year  smoke_now  smoke_100  smoke_100n  smoke_age  marijuana  age_first_marij  regular_marij  age_reg_marij  hard_drugs  sex_ever  sex_age  sex_num_partn_life  sex_num_part_year  same_sex  sex_orientation  pregnant_now
       │ Int64  String7      String7  Int64  String7?    Int64?      String15  String15?  String15?       String15?       String15?    Int64?         Float64?  Int64?      String7?  String15?   Float64?  Float64?   Float64?   Float64?   Float64?    String15?            String15?     Int64?   Int64?      Int64?      Int64?    Int64?   Int64?    Int64?    Int64?    Int64?   Float64?      Float64?    Float64?    Int64?     Float64?     Int64?     Float64?     String3?  Int64?       String15?   Int64?           Int64?           String7?         String7?   Int64?        Int64?   Int64?      Int64?         String3?       String3?     Int64?          String15?   String15?     Int64?         Int64?           String3?            Int64?      Int64?        String3?   String3?   String15?   Int64?     String3?   Int64?           String3?       Int64?         String3?    String3?  Int64?   Int64?              Int64?             String3?  String15?        String7
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │ 51624  2009_10      male        34   30-39             409  White     missing    High School     Married         25000-34999          30000      1.36           6  Own       NotWorking      87.4  missing    missing        164.7       32.22  missing              30.0_plus          70         113          85       114       88       114        88       112       82    missing           1.29        3.49        352  missing        missing  missing      No            missing  Good                      0               15  Most             Several         missing  missing     missing              4  Yes            No                  missing  missing     missing             missing          missing  Yes                    missing             0  No         Yes        Smoker             18  Yes                     17  No                   missing  Yes         Yes            16                   8                  1  No        Heterosexual
     2 │ 51624  2009_10      male        34   30-39             409  White     missing    High School     Married         25000-34999          30000      1.36           6  Own       NotWorking      87.4  missing    missing        164.7       32.22  missing              30.0_plus          70         113          85       114       88       114        88       112       82    missing           1.29        3.49        352  missing        missing  missing      No            missing  Good                      0               15  Most             Several         missing  missing     missing              4  Yes            No                  missing  missing     missing             missing          missing  Yes                    missing             0  No         Yes        Smoker             18  Yes                     17  No                   missing  Yes         Yes            16                   8                  1  No        Heterosexual
     3 │ 51624  2009_10      male        34   30-39             409  White     missing    High School     Married         25000-34999          30000      1.36           6  Own       NotWorking      87.4  missing    missing        164.7       32.22  missing              30.0_plus          70         113          85       114       88       114        88       112       82    missing           1.29        3.49        352  missing        missing  missing      No            missing  Good                      0               15  Most             Several         missing  missing     missing              4  Yes            No                  missing  missing     missing             missing          missing  Yes                    missing             0  No         Yes        Smoker             18  Yes                     17  No                   missing  Yes         Yes            16                   8                  1  No        Heterosexual
     4 │ 51625  2009_10      male         4   0-9                49  Other     missing    missing         missing         20000-24999          22500      1.07           9  Own       missing         17.0  missing    missing        105.4       15.3   missing              12.0_18.5     missing     missing     missing   missing  missing   missing   missing   missing  missing    missing     missing     missing       missing  missing        missing  missing      No            missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing                   4                1  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
     5 │ 51630  2009_10      female      49   40-49             596  White     missing    Some College    LivePartner     35000-44999          40000      1.91           5  Rent      NotWorking      86.7  missing    missing        168.4       30.57  missing              30.0_plus          86         112          75       118       82       108        74       116       76    missing           1.16        6.7          77        0.094    missing  missing      No            missing  Good                      0               10  Several          Several               2        2          27              8  Yes            No                  missing  missing     missing             missing          missing  Yes                          2            20  Yes        Yes        Smoker             38  Yes                     18  No                   missing  Yes         Yes            12                  10                  1  Yes       Heterosexual
     6 │ 51638  2009_10      male         9   0-9               115  White     missing    missing         missing         75000-99999          87500      1.84           6  Rent      missing         29.8  missing    missing        133.1       16.82  missing              12.0_18.5          82          86          47        84       50        84        50        88       44    missing           1.34        4.86        123        1.538    missing  missing      No            missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing                   5                0  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
     7 │ 51646  2009_10      male         8   0-9               101  White     missing    missing         missing         55000-64999          60000      2.33           7  Own       missing         35.2  missing    missing        130.6       20.64  missing              18.5_to_24.9       72         107          37       114       46       108        36       106       38    missing           1.55        4.09        238        1.322    missing  missing      No            missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing                   1                6  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
     8 │ 51647  2009_10      female      45   40-49             541  White     missing    College Grad    Married         75000-99999          87500      5.0            6  Own       Working         75.7  missing    missing        166.7       27.24  missing              25.0_to_29.9       62         118          64       106       62       118        68       118       60    missing           2.12        5.82        106        1.116    missing  missing      No            missing  Vgood                     0                3  None             None                  1  missing     missing              8  No             Yes                       5  missing     missing             missing          missing  Yes                          3            52  missing    No         Non-Smoker    missing  Yes                     13  No                   missing  No          Yes            13                  20                  0  Yes       Bisexual
   ⋮   │   ⋮         ⋮          ⋮       ⋮        ⋮           ⋮          ⋮          ⋮            ⋮               ⋮              ⋮             ⋮           ⋮          ⋮          ⋮          ⋮          ⋮          ⋮          ⋮          ⋮          ⋮                ⋮                ⋮           ⋮         ⋮           ⋮          ⋮         ⋮        ⋮         ⋮         ⋮         ⋮          ⋮            ⋮           ⋮           ⋮           ⋮           ⋮           ⋮          ⋮           ⋮           ⋮              ⋮                ⋮                ⋮             ⋮           ⋮           ⋮         ⋮             ⋮              ⋮             ⋮             ⋮             ⋮            ⋮              ⋮               ⋮                 ⋮               ⋮            ⋮            ⋮          ⋮          ⋮           ⋮          ⋮             ⋮               ⋮              ⋮            ⋮          ⋮         ⋮             ⋮                   ⋮             ⋮             ⋮              ⋮
  9994 │ 71909  2011_12      male        28   20-29         missing  Mexican   Mexican    9 - 11th Grade  NeverMarried     5000-9999            7500      0.46           3  Rent      Working         92.3  missing    missing        177.3       29.4   missing              25.0_to_29.9       68         124          65       124       62       126        64       122       66        490.43        1.22        3.9          97        0.942    missing  missing      No            missing  missing             missing          missing  missing          missing         missing  missing     missing              6  No             Yes                 missing  1_hr        2_hr                missing          missing  missing                missing       missing  Yes        Yes        Smoker             18  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
  9995 │ 71909  2011_12      male        28   20-29         missing  Mexican   Mexican    9 - 11th Grade  NeverMarried     5000-9999            7500      0.46           3  Rent      Working         92.3  missing    missing        177.3       29.4   missing              25.0_to_29.9       68         124          65       124       62       126        64       122       66        490.43        1.22        3.9          97        0.942    missing  missing      No            missing  missing             missing          missing  missing          missing         missing  missing     missing              6  No             Yes                 missing  1_hr        2_hr                missing          missing  missing                missing       missing  Yes        Yes        Smoker             18  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
  9996 │ 71909  2011_12      male        28   20-29         missing  Mexican   Mexican    9 - 11th Grade  NeverMarried     5000-9999            7500      0.46           3  Rent      Working         92.3  missing    missing        177.3       29.4   missing              25.0_to_29.9       68         124          65       124       62       126        64       122       66        490.43        1.22        3.9          97        0.942    missing  missing      No            missing  missing             missing          missing  missing          missing         missing  missing     missing              6  No             Yes                 missing  1_hr        2_hr                missing          missing  missing                missing       missing  Yes        Yes        Smoker             18  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
  9997 │ 71910  2011_12      female       0   0-9                 5  White     White      missing         missing         75000-99999          87500      3.37          10  Own       missing          6.7       67.6       42.2  missing    missing     missing              missing       missing     missing     missing   missing  missing   missing   missing   missing  missing    missing     missing     missing       missing  missing        missing  missing      missing       missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing             missing          missing  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
  9998 │ 71911  2011_12      male        27   20-29         missing  Mexican   Mexican    College Grad    Married         75000-99999          87500      3.25          10  Own       Working         96.7  missing    missing        175.8       31.3   missing              30.0_plus          74         133          74       122       76       132        82       134       66        509.0         1.06        5.72         63        0.6      missing  missing      No            missing  Good                      0                2  None             None            missing  missing     missing              6  No             No                        3  1_hr        0_to_1_hr           missing          missing  Yes                          5             4  missing    No         Non-Smoker    missing  Yes                     22  No                   missing  No          Yes            21                   1                  1  No        Heterosexual
  9999 │ 71915  2011_12      male        60   60-69         missing  White     White      College Grad    NeverMarried    65000-74999          70000      5.0            4  Own       Working         78.4  missing    missing        168.8       27.5   missing              25.0_to_29.9       76         147          73       150       72       148        74       146       72        505.13        0.93        4.94        218        1.253    missing  missing      Yes                56  Good                      0                2  None             None            missing  missing     missing              6  No             No                        1  2_hr        1_hr                missing          missing  Yes                    missing             0  missing    No         Non-Smoker    missing  missing            missing  missing              missing  No          Yes            19                   2            missing  No        missing
 10000 │ 71915  2011_12      male        60   60-69         missing  White     White      College Grad    NeverMarried    65000-74999          70000      5.0            4  Own       Working         78.4  missing    missing        168.8       27.5   missing              25.0_to_29.9       76         147          73       150       72       148        74       146       72        505.13        0.93        4.94        218        1.253    missing  missing      Yes                56  Good                      0                2  None             None            missing  missing     missing              6  No             No                  missing  2_hr        1_hr                missing          missing  Yes                    missing             0  missing    No         Non-Smoker    missing  missing            missing  missing              missing  No          Yes            19                   2            missing  No        missing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        9985 rows omitted
nhanes_py = pandas.read_csv("../../dataset/nhanes.csv")
nhanes_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 76 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   survey_year          10000 non-null  object 
 2   gender               10000 non-null  object 
 3   age                  10000 non-null  int64  
 4   age_decade           9667 non-null   object 
 5   age_months           4962 non-null   float64
 6   race_1               10000 non-null  object 
 7   race_3               5000 non-null   object 
 8   education            7221 non-null   object 
 9   marital_status       7231 non-null   object 
 10  hh_income            9189 non-null   object 
 11  hh_income_mid        9189 non-null   float64
 12  poverty              9274 non-null   float64
 13  home_rooms           9931 non-null   float64
 14  home_own             9937 non-null   object 
 15  work                 7771 non-null   object 
 16  weight               9922 non-null   float64
 17  length               543 non-null    float64
 18  head_circ            88 non-null     float64
 19  height               9647 non-null   float64
 20  bmi                  9634 non-null   float64
 21  bmi_cat_under_20yrs  1274 non-null   object 
 22  bmi_who              9603 non-null   object 
 23  pulse                8563 non-null   float64
 24  bp_sys_ave           8551 non-null   float64
 25  bp_dia_ave           8551 non-null   float64
 26  bp_sys_1             8237 non-null   float64
 27  BPDia1               8237 non-null   float64
 28  bp_sys_2             8353 non-null   float64
 29  bp_dia_2             8353 non-null   float64
 30  bp_sys_3             8365 non-null   float64
 31  BPDia3               8365 non-null   float64
 32  Testosterone         4126 non-null   float64
 33  DirectChol           8474 non-null   float64
 34  TotChol              8474 non-null   float64
 35  UrineVol1            9013 non-null   float64
 36  UrineFlow1           8397 non-null   float64
 37  UrineVol2            1478 non-null   float64
 38  UrineFlow2           1476 non-null   float64
 39  diabetes             9858 non-null   object 
 40  DiabetesAge          629 non-null    float64
 41  health_gen           7539 non-null   object 
 42  DaysPhysHlthBad      7532 non-null   float64
 43  DaysMentHlthBad      7534 non-null   float64
 44  little_interest      1564 non-null   object 
 45  depressed            1427 non-null   object 
 46  nPregnancies         2604 non-null   float64
 47  nBabies              2416 non-null   float64
 48  Age1stBaby           1884 non-null   float64
 49  SleepHrsNight        7755 non-null   float64
 50  sleep_trouble        7772 non-null   object 
 51  phys_active          8326 non-null   object 
 52  PhysActiveDays       4663 non-null   float64
 53  tv_hrs_day           4859 non-null   object 
 54  comp_hrs_day         4863 non-null   object 
 55  TVHrsDayChild        653 non-null    float64
 56  CompHrsDayChild      653 non-null    float64
 57  alcohol_12_plus_yr   6580 non-null   object 
 58  AlcoholDay           4914 non-null   float64
 59  alcohol_year         5922 non-null   float64
 60  smoke_now            3211 non-null   object 
 61  smoke_100            7235 non-null   object 
 62  smoke_100n           7235 non-null   object 
 63  smoke_age            3080 non-null   float64
 64  marijuana            4941 non-null   object 
 65  age_first_marij      2891 non-null   float64
 66  regular_marij        4941 non-null   object 
 67  age_reg_marij        1366 non-null   float64
 68  hard_drugs           5765 non-null   object 
 69  sex_ever             5767 non-null   object 
 70  sex_age              5540 non-null   float64
 71  sex_num_partn_life   5725 non-null   float64
 72  sex_num_part_year    4928 non-null   float64
 73  same_sex             5768 non-null   object 
 74  sex_orientation      4842 non-null   object 
 75  pregnant_now         10000 non-null  object 
dtypes: float64(43), int64(2), object(31)
memory usage: 5.8+ MB
nhanes_py.head(n=8)
      id survey_year  gender  age age_decade  age_months race_1 race_3     education marital_status    hh_income  hh_income_mid  poverty  home_rooms home_own        work  weight  length  head_circ  height    bmi bmi_cat_under_20yrs       bmi_who  pulse  bp_sys_ave  bp_dia_ave  bp_sys_1  BPDia1  bp_sys_2  bp_dia_2  bp_sys_3  BPDia3  Testosterone  DirectChol  TotChol  UrineVol1  UrineFlow1  UrineVol2  UrineFlow2 diabetes  DiabetesAge health_gen  DaysPhysHlthBad  DaysMentHlthBad little_interest depressed  nPregnancies  nBabies  Age1stBaby  SleepHrsNight sleep_trouble phys_active  PhysActiveDays tv_hrs_day comp_hrs_day  TVHrsDayChild  CompHrsDayChild alcohol_12_plus_yr  AlcoholDay  alcohol_year smoke_now smoke_100  smoke_100n  smoke_age marijuana  age_first_marij regular_marij  age_reg_marij hard_drugs sex_ever  sex_age  sex_num_partn_life  sex_num_part_year same_sex sex_orientation pregnant_now
0  51624     2009_10    male   34      30-39       409.0  White    NaN   High School        Married  25000-34999        30000.0     1.36         6.0      Own  NotWorking    87.4     NaN        NaN   164.7  32.22                 NaN     30.0_plus   70.0       113.0        85.0     114.0    88.0     114.0      88.0     112.0    82.0           NaN        1.29     3.49      352.0         NaN        NaN         NaN       No          NaN       Good              0.0             15.0            Most   Several           NaN      NaN         NaN            4.0           Yes          No             NaN        NaN          NaN            NaN              NaN                Yes         NaN           0.0        No       Yes      Smoker       18.0       Yes             17.0            No            NaN        Yes      Yes     16.0                 8.0                1.0       No    Heterosexual             
1  51624     2009_10    male   34      30-39       409.0  White    NaN   High School        Married  25000-34999        30000.0     1.36         6.0      Own  NotWorking    87.4     NaN        NaN   164.7  32.22                 NaN     30.0_plus   70.0       113.0        85.0     114.0    88.0     114.0      88.0     112.0    82.0           NaN        1.29     3.49      352.0         NaN        NaN         NaN       No          NaN       Good              0.0             15.0            Most   Several           NaN      NaN         NaN            4.0           Yes          No             NaN        NaN          NaN            NaN              NaN                Yes         NaN           0.0        No       Yes      Smoker       18.0       Yes             17.0            No            NaN        Yes      Yes     16.0                 8.0                1.0       No    Heterosexual             
2  51624     2009_10    male   34      30-39       409.0  White    NaN   High School        Married  25000-34999        30000.0     1.36         6.0      Own  NotWorking    87.4     NaN        NaN   164.7  32.22                 NaN     30.0_plus   70.0       113.0        85.0     114.0    88.0     114.0      88.0     112.0    82.0           NaN        1.29     3.49      352.0         NaN        NaN         NaN       No          NaN       Good              0.0             15.0            Most   Several           NaN      NaN         NaN            4.0           Yes          No             NaN        NaN          NaN            NaN              NaN                Yes         NaN           0.0        No       Yes      Smoker       18.0       Yes             17.0            No            NaN        Yes      Yes     16.0                 8.0                1.0       No    Heterosexual             
3  51625     2009_10    male    4        0-9        49.0  Other    NaN           NaN            NaN  20000-24999        22500.0     1.07         9.0      Own         NaN    17.0     NaN        NaN   105.4  15.30                 NaN     12.0_18.5    NaN         NaN         NaN       NaN     NaN       NaN       NaN       NaN     NaN           NaN         NaN      NaN        NaN         NaN        NaN         NaN       No          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            NaN           NaN         NaN             NaN        NaN          NaN            4.0              1.0                NaN         NaN           NaN       NaN       NaN         NaN        NaN       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
4  51630     2009_10  female   49      40-49       596.0  White    NaN  Some College    LivePartner  35000-44999        40000.0     1.91         5.0     Rent  NotWorking    86.7     NaN        NaN   168.4  30.57                 NaN     30.0_plus   86.0       112.0        75.0     118.0    82.0     108.0      74.0     116.0    76.0           NaN        1.16     6.70       77.0       0.094        NaN         NaN       No          NaN       Good              0.0             10.0         Several   Several           2.0      2.0        27.0            8.0           Yes          No             NaN        NaN          NaN            NaN              NaN                Yes         2.0          20.0       Yes       Yes      Smoker       38.0       Yes             18.0            No            NaN        Yes      Yes     12.0                10.0                1.0      Yes    Heterosexual             
5  51638     2009_10    male    9        0-9       115.0  White    NaN           NaN            NaN  75000-99999        87500.0     1.84         6.0     Rent         NaN    29.8     NaN        NaN   133.1  16.82                 NaN     12.0_18.5   82.0        86.0        47.0      84.0    50.0      84.0      50.0      88.0    44.0           NaN        1.34     4.86      123.0       1.538        NaN         NaN       No          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            NaN           NaN         NaN             NaN        NaN          NaN            5.0              0.0                NaN         NaN           NaN       NaN       NaN         NaN        NaN       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
6  51646     2009_10    male    8        0-9       101.0  White    NaN           NaN            NaN  55000-64999        60000.0     2.33         7.0      Own         NaN    35.2     NaN        NaN   130.6  20.64                 NaN  18.5_to_24.9   72.0       107.0        37.0     114.0    46.0     108.0      36.0     106.0    38.0           NaN        1.55     4.09      238.0       1.322        NaN         NaN       No          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            NaN           NaN         NaN             NaN        NaN          NaN            1.0              6.0                NaN         NaN           NaN       NaN       NaN         NaN        NaN       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
7  51647     2009_10  female   45      40-49       541.0  White    NaN  College Grad        Married  75000-99999        87500.0     5.00         6.0      Own     Working    75.7     NaN        NaN   166.7  27.24                 NaN  25.0_to_29.9   62.0       118.0        64.0     106.0    62.0     118.0      68.0     118.0    60.0           NaN        2.12     5.82      106.0       1.116        NaN         NaN       No          NaN      Vgood              0.0              3.0             NaN       NaN           1.0      NaN         NaN            8.0            No         Yes             5.0        NaN          NaN            NaN              NaN                Yes         3.0          52.0       NaN        No  Non-Smoker        NaN       Yes             13.0            No            NaN         No      Yes     13.0                20.0                0.0      Yes        Bisexual             
nhanes_py.tail(n=8)
         id survey_year  gender  age age_decade  age_months   race_1   race_3       education marital_status    hh_income  hh_income_mid  poverty  home_rooms home_own     work  weight  length  head_circ  height   bmi bmi_cat_under_20yrs       bmi_who  pulse  bp_sys_ave  bp_dia_ave  bp_sys_1  BPDia1  bp_sys_2  bp_dia_2  bp_sys_3  BPDia3  Testosterone  DirectChol  TotChol  UrineVol1  UrineFlow1  UrineVol2  UrineFlow2 diabetes  DiabetesAge health_gen  DaysPhysHlthBad  DaysMentHlthBad little_interest depressed  nPregnancies  nBabies  Age1stBaby  SleepHrsNight sleep_trouble phys_active  PhysActiveDays tv_hrs_day comp_hrs_day  TVHrsDayChild  CompHrsDayChild alcohol_12_plus_yr  AlcoholDay  alcohol_year smoke_now smoke_100  smoke_100n  smoke_age marijuana  age_first_marij regular_marij  age_reg_marij hard_drugs sex_ever  sex_age  sex_num_partn_life  sex_num_part_year same_sex sex_orientation pregnant_now
9992  71908     2011_12  female   66      60-69         NaN    White    White    College Grad        Widowed  65000-74999        70000.0     4.55         8.0      Own  Working    88.7     NaN        NaN   159.0  35.1                 NaN     30.0_plus   76.0       114.0        70.0     110.0    74.0     114.0      68.0     114.0    72.0         26.00        1.86     6.47       29.0       0.659       94.0       0.627       No          NaN  Excellent              0.0              0.0             NaN       NaN           2.0      2.0        22.0            6.0            No          No             NaN       2_hr    0_to_1_hr            NaN              NaN                 No         1.0           5.0       NaN        No  Non-Smoker        NaN       NaN              NaN           NaN            NaN         No      Yes     18.0                 1.0                NaN       No             NaN             
9993  71909     2011_12    male   28      20-29         NaN  Mexican  Mexican  9 - 11th Grade   NeverMarried    5000-9999         7500.0     0.46         3.0     Rent  Working    92.3     NaN        NaN   177.3  29.4                 NaN  25.0_to_29.9   68.0       124.0        65.0     124.0    62.0     126.0      64.0     122.0    66.0        490.43        1.22     3.90       97.0       0.942        NaN         NaN       No          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            6.0            No         Yes             NaN       1_hr         2_hr            NaN              NaN                NaN         NaN           NaN       Yes       Yes      Smoker       18.0       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
9994  71909     2011_12    male   28      20-29         NaN  Mexican  Mexican  9 - 11th Grade   NeverMarried    5000-9999         7500.0     0.46         3.0     Rent  Working    92.3     NaN        NaN   177.3  29.4                 NaN  25.0_to_29.9   68.0       124.0        65.0     124.0    62.0     126.0      64.0     122.0    66.0        490.43        1.22     3.90       97.0       0.942        NaN         NaN       No          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            6.0            No         Yes             NaN       1_hr         2_hr            NaN              NaN                NaN         NaN           NaN       Yes       Yes      Smoker       18.0       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
9995  71909     2011_12    male   28      20-29         NaN  Mexican  Mexican  9 - 11th Grade   NeverMarried    5000-9999         7500.0     0.46         3.0     Rent  Working    92.3     NaN        NaN   177.3  29.4                 NaN  25.0_to_29.9   68.0       124.0        65.0     124.0    62.0     126.0      64.0     122.0    66.0        490.43        1.22     3.90       97.0       0.942        NaN         NaN       No          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            6.0            No         Yes             NaN       1_hr         2_hr            NaN              NaN                NaN         NaN           NaN       Yes       Yes      Smoker       18.0       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
9996  71910     2011_12  female    0        0-9         5.0    White    White             NaN            NaN  75000-99999        87500.0     3.37        10.0      Own      NaN     6.7    67.6       42.2     NaN   NaN                 NaN           NaN    NaN         NaN         NaN       NaN     NaN       NaN       NaN       NaN     NaN           NaN         NaN      NaN        NaN         NaN        NaN         NaN      NaN          NaN        NaN              NaN              NaN             NaN       NaN           NaN      NaN         NaN            NaN           NaN         NaN             NaN        NaN          NaN            NaN              NaN                NaN         NaN           NaN       NaN       NaN         NaN        NaN       NaN              NaN           NaN            NaN        NaN      NaN      NaN                 NaN                NaN      NaN             NaN             
9997  71911     2011_12    male   27      20-29         NaN  Mexican  Mexican    College Grad        Married  75000-99999        87500.0     3.25        10.0      Own  Working    96.7     NaN        NaN   175.8  31.3                 NaN     30.0_plus   74.0       133.0        74.0     122.0    76.0     132.0      82.0     134.0    66.0        509.00        1.06     5.72       63.0       0.600        NaN         NaN       No          NaN       Good              0.0              2.0             NaN       NaN           NaN      NaN         NaN            6.0            No          No             3.0       1_hr    0_to_1_hr            NaN              NaN                Yes         5.0           4.0       NaN        No  Non-Smoker        NaN       Yes             22.0            No            NaN         No      Yes     21.0                 1.0                1.0       No    Heterosexual             
9998  71915     2011_12    male   60      60-69         NaN    White    White    College Grad   NeverMarried  65000-74999        70000.0     5.00         4.0      Own  Working    78.4     NaN        NaN   168.8  27.5                 NaN  25.0_to_29.9   76.0       147.0        73.0     150.0    72.0     148.0      74.0     146.0    72.0        505.13        0.93     4.94      218.0       1.253        NaN         NaN      Yes         56.0       Good              0.0              2.0             NaN       NaN           NaN      NaN         NaN            6.0            No          No             1.0       2_hr         1_hr            NaN              NaN                Yes         NaN           0.0       NaN        No  Non-Smoker        NaN       NaN              NaN           NaN            NaN         No      Yes     19.0                 2.0                NaN       No             NaN             
9999  71915     2011_12    male   60      60-69         NaN    White    White    College Grad   NeverMarried  65000-74999        70000.0     5.00         4.0      Own  Working    78.4     NaN        NaN   168.8  27.5                 NaN  25.0_to_29.9   76.0       147.0        73.0     150.0    72.0     148.0      74.0     146.0    72.0        505.13        0.93     4.94      218.0       1.253        NaN         NaN      Yes         56.0       Good              0.0              2.0             NaN       NaN           NaN      NaN         NaN            6.0            No          No             NaN       2_hr         1_hr            NaN              NaN                Yes         NaN           0.0       NaN        No  Non-Smoker        NaN       NaN              NaN           NaN            NaN         No      Yes     19.0                 2.0                NaN       No             NaN             
nhanes_r <- read.csv("../../dataset/nhanes.csv", stringsAsFactors=TRUE)
str(object=nhanes_r)
'data.frame':	10000 obs. of  76 variables:
 $ id                 : int  51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
 $ survey_year        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender             : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
 $ age                : int  34 34 34 4 49 9 8 45 45 45 ...
 $ age_decade         : Factor w/ 9 levels ""," 0-9"," 10-19",..: 5 5 5 2 6 2 2 6 6 6 ...
 $ age_months         : int  409 409 409 49 596 115 101 541 541 541 ...
 $ race_1             : Factor w/ 5 levels "Black","Hispanic",..: 5 5 5 4 5 5 5 5 5 5 ...
 $ race_3             : Factor w/ 7 levels "","Asian","Black",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ education          : Factor w/ 6 levels "","8th Grade",..: 5 5 5 1 6 1 1 4 4 4 ...
 $ marital_status     : Factor w/ 7 levels "","Divorced",..: 4 4 4 1 3 1 1 4 4 4 ...
 $ hh_income          : Factor w/ 13 levels ""," 0-4999"," 5000-9999",..: 7 7 7 6 8 12 10 12 12 12 ...
 $ hh_income_mid      : int  30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
 $ poverty            : num  1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
 $ home_rooms         : int  6 6 6 9 5 6 7 6 6 6 ...
 $ home_own           : Factor w/ 4 levels "","Other","Own",..: 3 3 3 3 4 4 3 3 3 3 ...
 $ work               : Factor w/ 4 levels "","Looking","NotWorking",..: 3 3 3 1 3 1 1 4 4 4 ...
 $ weight             : num  87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
 $ length             : num  NA NA NA NA NA NA NA NA NA NA ...
 $ head_circ          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ height             : num  165 165 165 105 168 ...
 $ bmi                : num  32.2 32.2 32.2 15.3 30.6 ...
 $ bmi_cat_under_20yrs: Factor w/ 5 levels "","NormWeight",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ bmi_who            : Factor w/ 5 levels "","12.0_18.5",..: 5 5 5 2 5 2 3 4 4 4 ...
 $ pulse              : int  70 70 70 NA 86 82 72 62 62 62 ...
 $ bp_sys_ave         : int  113 113 113 NA 112 86 107 118 118 118 ...
 $ bp_dia_ave         : int  85 85 85 NA 75 47 37 64 64 64 ...
 $ bp_sys_1           : int  114 114 114 NA 118 84 114 106 106 106 ...
 $ BPDia1             : int  88 88 88 NA 82 50 46 62 62 62 ...
 $ bp_sys_2           : int  114 114 114 NA 108 84 108 118 118 118 ...
 $ bp_dia_2           : int  88 88 88 NA 74 50 36 68 68 68 ...
 $ bp_sys_3           : int  112 112 112 NA 116 88 106 118 118 118 ...
 $ BPDia3             : int  82 82 82 NA 76 44 38 60 60 60 ...
 $ Testosterone       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DirectChol         : num  1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
 $ TotChol            : num  3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
 $ UrineVol1          : int  352 352 352 NA 77 123 238 106 106 106 ...
 $ UrineFlow1         : num  NA NA NA NA 0.094 ...
 $ UrineVol2          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ UrineFlow2         : num  NA NA NA NA NA NA NA NA NA NA ...
 $ diabetes           : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ DiabetesAge        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ health_gen         : Factor w/ 6 levels "","Excellent",..: 4 4 4 1 4 1 1 6 6 6 ...
 $ DaysPhysHlthBad    : int  0 0 0 NA 0 NA NA 0 0 0 ...
 $ DaysMentHlthBad    : int  15 15 15 NA 10 NA NA 3 3 3 ...
 $ little_interest    : Factor w/ 4 levels "","Most","None",..: 2 2 2 1 4 1 1 3 3 3 ...
 $ depressed          : Factor w/ 4 levels "","Most","None",..: 4 4 4 1 4 1 1 3 3 3 ...
 $ nPregnancies       : int  NA NA NA NA 2 NA NA 1 1 1 ...
 $ nBabies            : int  NA NA NA NA 2 NA NA NA NA NA ...
 $ Age1stBaby         : int  NA NA NA NA 27 NA NA NA NA NA ...
 $ SleepHrsNight      : int  4 4 4 NA 8 NA NA 8 8 8 ...
 $ sleep_trouble      : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 2 2 2 ...
 $ phys_active        : Factor w/ 3 levels "","No","Yes": 2 2 2 1 2 1 1 3 3 3 ...
 $ PhysActiveDays     : int  NA NA NA NA NA NA NA 5 5 5 ...
 $ tv_hrs_day         : Factor w/ 8 levels "","0_hrs","0_to_1_hr",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ comp_hrs_day       : Factor w/ 8 levels "","0_hrs","0_to_1_hr",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ TVHrsDayChild      : int  NA NA NA 4 NA 5 1 NA NA NA ...
 $ CompHrsDayChild    : int  NA NA NA 1 NA 0 6 NA NA NA ...
 $ alcohol_12_plus_yr : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 3 3 3 ...
 $ AlcoholDay         : int  NA NA NA NA 2 NA NA 3 3 3 ...
 $ alcohol_year       : int  0 0 0 NA 20 NA NA 52 52 52 ...
 $ smoke_now          : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 1 1 1 1 1 ...
 $ smoke_100          : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 2 2 2 ...
 $ smoke_100n         : Factor w/ 3 levels "","Non-Smoker",..: 3 3 3 1 3 1 1 2 2 2 ...
 $ smoke_age          : int  18 18 18 NA 38 NA NA NA NA NA ...
 $ marijuana          : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 3 3 3 ...
 $ age_first_marij    : int  17 17 17 NA 18 NA NA 13 13 13 ...
 $ regular_marij      : Factor w/ 3 levels "","No","Yes": 2 2 2 1 2 1 1 2 2 2 ...
 $ age_reg_marij      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hard_drugs         : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 2 2 2 ...
 $ sex_ever           : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 3 3 3 ...
 $ sex_age            : int  16 16 16 NA 12 NA NA 13 13 13 ...
 $ sex_num_partn_life : int  8 8 8 NA 10 NA NA 20 20 20 ...
 $ sex_num_part_year  : int  1 1 1 NA 1 NA NA 0 0 0 ...
 $ same_sex           : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 1 1 3 3 3 ...
 $ sex_orientation    : Factor w/ 4 levels "","Bisexual",..: 3 3 3 1 3 1 1 2 2 2 ...
 $ pregnant_now       : Factor w/ 4 levels " ","No","Unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
head(x=nhanes_r, n=8)
     id survey_year gender age age_decade age_months race_1 race_3    education marital_status   hh_income hh_income_mid poverty home_rooms home_own       work weight length head_circ height bmi bmi_cat_under_20yrs      bmi_who pulse bp_sys_ave bp_dia_ave bp_sys_1 BPDia1 bp_sys_2 bp_dia_2 bp_sys_3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 diabetes DiabetesAge health_gen DaysPhysHlthBad DaysMentHlthBad little_interest depressed nPregnancies nBabies Age1stBaby SleepHrsNight sleep_trouble phys_active PhysActiveDays tv_hrs_day comp_hrs_day TVHrsDayChild CompHrsDayChild alcohol_12_plus_yr AlcoholDay alcohol_year smoke_now smoke_100 smoke_100n smoke_age marijuana age_first_marij regular_marij age_reg_marij hard_drugs sex_ever sex_age sex_num_partn_life sex_num_part_year same_sex sex_orientation pregnant_now
1 51624     2009_10   male  34      30-39        409  White         High School        Married 25000-34999         30000     1.4          6      Own NotWorking     87     NA        NA    165  32                        30.0_plus    70        113         85      114     88      114       88      112     82           NA        1.3     3.5       352         NA        NA         NA       No          NA       Good               0              15            Most   Several           NA      NA         NA             4           Yes          No             NA                                    NA              NA                Yes         NA            0        No       Yes     Smoker        18       Yes              17            No            NA        Yes      Yes      16                  8                 1       No    Heterosexual             
2 51624     2009_10   male  34      30-39        409  White         High School        Married 25000-34999         30000     1.4          6      Own NotWorking     87     NA        NA    165  32                        30.0_plus    70        113         85      114     88      114       88      112     82           NA        1.3     3.5       352         NA        NA         NA       No          NA       Good               0              15            Most   Several           NA      NA         NA             4           Yes          No             NA                                    NA              NA                Yes         NA            0        No       Yes     Smoker        18       Yes              17            No            NA        Yes      Yes      16                  8                 1       No    Heterosexual             
3 51624     2009_10   male  34      30-39        409  White         High School        Married 25000-34999         30000     1.4          6      Own NotWorking     87     NA        NA    165  32                        30.0_plus    70        113         85      114     88      114       88      112     82           NA        1.3     3.5       352         NA        NA         NA       No          NA       Good               0              15            Most   Several           NA      NA         NA             4           Yes          No             NA                                    NA              NA                Yes         NA            0        No       Yes     Smoker        18       Yes              17            No            NA        Yes      Yes      16                  8                 1       No    Heterosexual             
4 51625     2009_10   male   4        0-9         49  Other                                    20000-24999         22500     1.1          9      Own                17     NA        NA    105  15                        12.0_18.5    NA         NA         NA       NA     NA       NA       NA       NA     NA           NA         NA      NA        NA         NA        NA         NA       No          NA                         NA              NA                                     NA      NA         NA            NA                                       NA                                     4               1                            NA           NA                                       NA                        NA                          NA                          NA                 NA                NA                                      
5 51630     2009_10 female  49      40-49        596  White        Some College    LivePartner 35000-44999         40000     1.9          5     Rent NotWorking     87     NA        NA    168  31                        30.0_plus    86        112         75      118     82      108       74      116     76           NA        1.2     6.7        77      0.094        NA         NA       No          NA       Good               0              10         Several   Several            2       2         27             8           Yes          No             NA                                    NA              NA                Yes          2           20       Yes       Yes     Smoker        38       Yes              18            No            NA        Yes      Yes      12                 10                 1      Yes    Heterosexual             
6 51638     2009_10   male   9        0-9        115  White                                    75000-99999         87500     1.8          6     Rent                30     NA        NA    133  17                        12.0_18.5    82         86         47       84     50       84       50       88     44           NA        1.3     4.9       123      1.538        NA         NA       No          NA                         NA              NA                                     NA      NA         NA            NA                                       NA                                     5               0                            NA           NA                                       NA                        NA                          NA                          NA                 NA                NA                                      
7 51646     2009_10   male   8        0-9        101  White                                    55000-64999         60000     2.3          7      Own                35     NA        NA    131  21                     18.5_to_24.9    72        107         37      114     46      108       36      106     38           NA        1.6     4.1       238      1.322        NA         NA       No          NA                         NA              NA                                     NA      NA         NA            NA                                       NA                                     1               6                            NA           NA                                       NA                        NA                          NA                          NA                 NA                NA                                      
8 51647     2009_10 female  45      40-49        541  White        College Grad        Married 75000-99999         87500     5.0          6      Own    Working     76     NA        NA    167  27                     25.0_to_29.9    62        118         64      106     62      118       68      118     60           NA        2.1     5.8       106      1.116        NA         NA       No          NA      Vgood               0               3            None      None            1      NA         NA             8            No         Yes              5                                    NA              NA                Yes          3           52                  No Non-Smoker        NA       Yes              13            No            NA         No      Yes      13                 20                 0      Yes        Bisexual             
tail(x=nhanes_r, n=8)
         id survey_year gender age age_decade age_months  race_1  race_3      education marital_status   hh_income hh_income_mid poverty home_rooms home_own    work weight length head_circ height bmi bmi_cat_under_20yrs      bmi_who pulse bp_sys_ave bp_dia_ave bp_sys_1 BPDia1 bp_sys_2 bp_dia_2 bp_sys_3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 diabetes DiabetesAge health_gen DaysPhysHlthBad DaysMentHlthBad little_interest depressed nPregnancies nBabies Age1stBaby SleepHrsNight sleep_trouble phys_active PhysActiveDays tv_hrs_day comp_hrs_day TVHrsDayChild CompHrsDayChild alcohol_12_plus_yr AlcoholDay alcohol_year smoke_now smoke_100 smoke_100n smoke_age marijuana age_first_marij regular_marij age_reg_marij hard_drugs sex_ever sex_age sex_num_partn_life sex_num_part_year same_sex sex_orientation pregnant_now
9993  71908     2011_12 female  66      60-69         NA   White   White   College Grad        Widowed 65000-74999         70000    4.55          8      Own Working   88.7     NA        NA    159  35                        30.0_plus    76        114         70      110     74      114       68      114     72           26       1.86     6.5        29       0.66        94       0.63       No          NA  Excellent               0               0            None      None            2       2         22             6            No          No             NA       2_hr    0_to_1_hr            NA              NA                 No          1            5                  No Non-Smoker        NA                        NA                          NA         No      Yes      18                  1                NA       No                             
9994  71909     2011_12   male  28      20-29         NA Mexican Mexican 9 - 11th Grade   NeverMarried   5000-9999          7500    0.46          3     Rent Working   92.3     NA        NA    177  29                     25.0_to_29.9    68        124         65      124     62      126       64      122     66          490       1.22     3.9        97       0.94        NA         NA       No          NA                         NA              NA                                     NA      NA         NA             6            No         Yes             NA       1_hr         2_hr            NA              NA                            NA           NA       Yes       Yes     Smoker        18                        NA                          NA                          NA                 NA                NA                                      
9995  71909     2011_12   male  28      20-29         NA Mexican Mexican 9 - 11th Grade   NeverMarried   5000-9999          7500    0.46          3     Rent Working   92.3     NA        NA    177  29                     25.0_to_29.9    68        124         65      124     62      126       64      122     66          490       1.22     3.9        97       0.94        NA         NA       No          NA                         NA              NA                                     NA      NA         NA             6            No         Yes             NA       1_hr         2_hr            NA              NA                            NA           NA       Yes       Yes     Smoker        18                        NA                          NA                          NA                 NA                NA                                      
9996  71909     2011_12   male  28      20-29         NA Mexican Mexican 9 - 11th Grade   NeverMarried   5000-9999          7500    0.46          3     Rent Working   92.3     NA        NA    177  29                     25.0_to_29.9    68        124         65      124     62      126       64      122     66          490       1.22     3.9        97       0.94        NA         NA       No          NA                         NA              NA                                     NA      NA         NA             6            No         Yes             NA       1_hr         2_hr            NA              NA                            NA           NA       Yes       Yes     Smoker        18                        NA                          NA                          NA                 NA                NA                                      
9997  71910     2011_12 female   0        0-9          5   White   White                               75000-99999         87500    3.37         10      Own            6.7     68        42     NA  NA                                     NA         NA         NA       NA     NA       NA       NA       NA     NA           NA         NA      NA        NA         NA        NA         NA                   NA                         NA              NA                                     NA      NA         NA            NA                                       NA                                    NA              NA                            NA           NA                                       NA                        NA                          NA                          NA                 NA                NA                                      
9998  71911     2011_12   male  27      20-29         NA Mexican Mexican   College Grad        Married 75000-99999         87500    3.25         10      Own Working   96.7     NA        NA    176  31                        30.0_plus    74        133         74      122     76      132       82      134     66          509       1.06     5.7        63       0.60        NA         NA       No          NA       Good               0               2            None      None           NA      NA         NA             6            No          No              3       1_hr    0_to_1_hr            NA              NA                Yes          5            4                  No Non-Smoker        NA       Yes              22            No            NA         No      Yes      21                  1                 1       No    Heterosexual             
9999  71915     2011_12   male  60      60-69         NA   White   White   College Grad   NeverMarried 65000-74999         70000    5.00          4      Own Working   78.4     NA        NA    169  28                     25.0_to_29.9    76        147         73      150     72      148       74      146     72          505       0.93     4.9       218       1.25        NA         NA      Yes          56       Good               0               2            None      None           NA      NA         NA             6            No          No              1       2_hr         1_hr            NA              NA                Yes         NA            0                  No Non-Smoker        NA                        NA                          NA         No      Yes      19                  2                NA       No                             
10000 71915     2011_12   male  60      60-69         NA   White   White   College Grad   NeverMarried 65000-74999         70000    5.00          4      Own Working   78.4     NA        NA    169  28                     25.0_to_29.9    76        147         73      150     72      148       74      146     72          505       0.93     4.9       218       1.25        NA         NA      Yes          56       Good               0               2            None      None           NA      NA         NA             6            No          No             NA       2_hr         1_hr            NA              NA                Yes         NA            0                  No Non-Smoker        NA                        NA                          NA         No      Yes      19                  2                NA       No                             

Preparing Data and Helper Functions

typeof(missing)
Missing

Missing data is data. In Julia, substituting the missing variable (a Missing data type and object) for actual missing data is a best practice. In this data frame, missing data is represented with an NA value, which will need to take on a new value of missing.

nhanes_clean_jl = nhanes_jl;
first(nhanes_clean_jl,7)
7×76 DataFrame
 Row │ id     survey_year  gender   age    age_decade  age_months  race_1    race_3     education     marital_status  hh_income    hh_income_mid  poverty   home_rooms  home_own  work        weight    length    head_circ  height    bmi       bmi_cat_under_20yrs  bmi_who       pulse    bp_sys_ave  bp_dia_ave  bp_sys_1  BPDia1   bp_sys_2  bp_dia_2  bp_sys_3  BPDia3   Testosterone  DirectChol  TotChol     UrineVol1  UrineFlow1   UrineVol2  UrineFlow2  diabetes  DiabetesAge  health_gen  DaysPhysHlthBad  DaysMentHlthBad  little_interest  depressed  nPregnancies  nBabies  Age1stBaby  SleepHrsNight  sleep_trouble  phys_active  PhysActiveDays  tv_hrs_day  comp_hrs_day  TVHrsDayChild  CompHrsDayChild  alcohol_12_plus_yr  AlcoholDay  alcohol_year  smoke_now  smoke_100  smoke_100n  smoke_age  marijuana  age_first_marij  regular_marij  age_reg_marij  hard_drugs  sex_ever  sex_age  sex_num_partn_life  sex_num_part_year  same_sex  sex_orientation  pregnant_now
     │ Int64  String7      String7  Int64  String7?    Int64?      String15  String15?  String15?     String15?       String15?    Int64?         Float64?  Int64?      String7?  String15?   Float64?  Float64?  Float64?   Float64?  Float64?  String15?            String15?     Int64?   Int64?      Int64?      Int64?    Int64?   Int64?    Int64?    Int64?    Int64?   Float64?      Float64?    Float64?    Int64?     Float64?     Int64?     Float64?    String3?  Int64?       String15?   Int64?           Int64?           String7?         String7?   Int64?        Int64?   Int64?      Int64?         String3?       String3?     Int64?          String15?   String15?     Int64?         Int64?           String3?            Int64?      Int64?        String3?   String3?   String15?   Int64?     String3?   Int64?           String3?       Int64?         String3?    String3?  Int64?   Int64?              Int64?             String3?  String15?        String7
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 51624  2009_10      male        34   30-39             409  White     missing    High School   Married         25000-34999          30000      1.36           6  Own       NotWorking      87.4   missing    missing     164.7     32.22  missing              30.0_plus          70         113          85       114       88       114        88       112       82       missing        1.29        3.49        352  missing        missing     missing  No            missing  Good                      0               15  Most             Several         missing  missing     missing              4  Yes            No                  missing  missing     missing             missing          missing  Yes                    missing             0  No         Yes        Smoker             18  Yes                     17  No                   missing  Yes         Yes            16                   8                  1  No        Heterosexual
   2 │ 51624  2009_10      male        34   30-39             409  White     missing    High School   Married         25000-34999          30000      1.36           6  Own       NotWorking      87.4   missing    missing     164.7     32.22  missing              30.0_plus          70         113          85       114       88       114        88       112       82       missing        1.29        3.49        352  missing        missing     missing  No            missing  Good                      0               15  Most             Several         missing  missing     missing              4  Yes            No                  missing  missing     missing             missing          missing  Yes                    missing             0  No         Yes        Smoker             18  Yes                     17  No                   missing  Yes         Yes            16                   8                  1  No        Heterosexual
   3 │ 51624  2009_10      male        34   30-39             409  White     missing    High School   Married         25000-34999          30000      1.36           6  Own       NotWorking      87.4   missing    missing     164.7     32.22  missing              30.0_plus          70         113          85       114       88       114        88       112       82       missing        1.29        3.49        352  missing        missing     missing  No            missing  Good                      0               15  Most             Several         missing  missing     missing              4  Yes            No                  missing  missing     missing             missing          missing  Yes                    missing             0  No         Yes        Smoker             18  Yes                     17  No                   missing  Yes         Yes            16                   8                  1  No        Heterosexual
   4 │ 51625  2009_10      male         4   0-9                49  Other     missing    missing       missing         20000-24999          22500      1.07           9  Own       missing         17.0   missing    missing     105.4     15.3   missing              12.0_18.5     missing     missing     missing   missing  missing   missing   missing   missing  missing       missing  missing     missing       missing  missing        missing     missing  No            missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing                   4                1  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
   5 │ 51630  2009_10      female      49   40-49             596  White     missing    Some College  LivePartner     35000-44999          40000      1.91           5  Rent      NotWorking      86.7   missing    missing     168.4     30.57  missing              30.0_plus          86         112          75       118       82       108        74       116       76       missing        1.16        6.7          77        0.094    missing     missing  No            missing  Good                      0               10  Several          Several               2        2          27              8  Yes            No                  missing  missing     missing             missing          missing  Yes                          2            20  Yes        Yes        Smoker             38  Yes                     18  No                   missing  Yes         Yes            12                  10                  1  Yes       Heterosexual
   6 │ 51638  2009_10      male         9   0-9               115  White     missing    missing       missing         75000-99999          87500      1.84           6  Rent      missing         29.8   missing    missing     133.1     16.82  missing              12.0_18.5          82          86          47        84       50        84        50        88       44       missing        1.34        4.86        123        1.538    missing     missing  No            missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing                   5                0  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing
   7 │ 51646  2009_10      male         8   0-9               101  White     missing    missing       missing         55000-64999          60000      2.33           7  Own       missing         35.2   missing    missing     130.6     20.64  missing              18.5_to_24.9       72         107          37       114       46       108        36       106       38       missing        1.55        4.09        238        1.322    missing     missing  No            missing  missing             missing          missing  missing          missing         missing  missing     missing        missing  missing        missing             missing  missing     missing                   1                6  missing                missing       missing  missing    missing    missing       missing  missing            missing  missing              missing  missing     missing   missing             missing            missing  missing   missing

Now that we’ve handled missing data, the creating a helper function (such as frequency_table_simple_categorical_jl() below) will help us summarize categorical data.

function frequency_table_simple_categorical_jl(df, df_column)
    frequency_table_jl = sort(
        combine(groupby(df, df_column), nrow => :frequency),
        :frequency,
        rev=true
    );
    frequency_table_jl.percent_relative = frequency_table_jl.frequency / sum(frequency_table_jl.frequency) * 100;
    frequency_table_jl.percent_cumulative = cumsum(frequency_table_jl.percent_relative) ./ sum(frequency_table_jl.percent_relative) * 100;
    frequency_table_jl
end
frequency_table_simple_categorical_jl (generic function with 1 method)
# Wrangling categorical data
nhanes_py["health_gen"].dtypes
dtype('O')
nhanes_py["health_gen"] = nhanes_py["health_gen"].astype("category")
nhanes_py["health_gen"].dtypes
CategoricalDtype(categories=['Excellent', 'Fair', 'Good', 'Poor', 'Vgood'], ordered=False)
nhanes_py["health_gen"].cat.categories
#nhanes_py["health_gen"].cat.reorder_categories(["Poor", "Fair", "Good", "Vgood", "Excellent"], inplace=True)
#nhanes_py["health_gen"].cat.categories
Index(['Excellent', 'Fair', 'Good', 'Poor', 'Vgood'], dtype='object')
nhanes_r$health_gen <- ordered(nhanes_r$health_gen, levels = c("Poor", "Fair", "Good", "Vgood", "Excellent"))
nhanes_clean_r <- nhanes_r
str(nhanes_clean_r)
'data.frame':	10000 obs. of  77 variables:
 $ id                 : int  51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
 $ survey_year        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender             : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
 $ age                : int  34 34 34 4 49 9 8 45 45 45 ...
 $ age_decade         : Ord.factor w/ 8 levels " 0-9"<" 10-19"<..: 4 4 4 1 5 1 1 5 5 5 ...
 $ age_months         : int  409 409 409 49 596 115 101 541 541 541 ...
 $ race_1             : Factor w/ 5 levels "Black","Hispanic",..: 5 5 5 4 5 5 5 5 5 5 ...
 $ race_3             : Factor w/ 7 levels "","Asian","Black",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ education          : Ord.factor w/ 5 levels "8th Grade"<"9 - 11th Grade"<..: 3 3 3 NA 4 NA NA 5 5 5 ...
 $ marital_status     : Factor w/ 7 levels "","Divorced",..: 4 4 4 1 3 1 1 4 4 4 ...
 $ hh_income          : Ord.factor w/ 12 levels " 0-4999"<" 5000-9999"<..: 6 6 6 5 7 11 9 11 11 11 ...
 $ hh_income_mid      : int  30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
 $ poverty            : num  1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
 $ home_rooms         : int  6 6 6 9 5 6 7 6 6 6 ...
 $ home_own           : Factor w/ 4 levels "","Other","Own",..: 3 3 3 3 4 4 3 3 3 3 ...
 $ work               : Factor w/ 4 levels "","Looking","NotWorking",..: 3 3 3 1 3 1 1 4 4 4 ...
 $ weight             : num  87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
 $ length             : num  NA NA NA NA NA NA NA NA NA NA ...
 $ head_circ          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ height             : num  165 165 165 105 168 ...
 $ bmi                : num  32.2 32.2 32.2 15.3 30.6 ...
 $ bmi_cat_under_20yrs: Ord.factor w/ 4 levels "UnderWeight"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ bmi_who            : Ord.factor w/ 4 levels "12.0_18.5"<"18.5_to_24.9"<..: 4 4 4 1 4 1 2 3 3 3 ...
 $ pulse              : int  70 70 70 NA 86 82 72 62 62 62 ...
 $ bp_sys_ave         : int  113 113 113 NA 112 86 107 118 118 118 ...
 $ bp_dia_ave         : int  85 85 85 NA 75 47 37 64 64 64 ...
 $ bp_sys_1           : int  114 114 114 NA 118 84 114 106 106 106 ...
 $ BPDia1             : int  88 88 88 NA 82 50 46 62 62 62 ...
 $ bp_sys_2           : int  114 114 114 NA 108 84 108 118 118 118 ...
 $ bp_dia_2           : int  88 88 88 NA 74 50 36 68 68 68 ...
 $ bp_sys_3           : int  112 112 112 NA 116 88 106 118 118 118 ...
 $ BPDia3             : int  82 82 82 NA 76 44 38 60 60 60 ...
 $ Testosterone       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DirectChol         : num  1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
 $ TotChol            : num  3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
 $ UrineVol1          : int  352 352 352 NA 77 123 238 106 106 106 ...
 $ UrineFlow1         : num  NA NA NA NA 0.094 ...
 $ UrineVol2          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ UrineFlow2         : num  NA NA NA NA NA NA NA NA NA NA ...
 $ diabetes           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ DiabetesAge        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ health_gen         : Ord.factor w/ 5 levels "Poor"<"Fair"<..: 3 3 3 NA 3 NA NA 4 4 4 ...
 $ DaysPhysHlthBad    : int  0 0 0 NA 0 NA NA 0 0 0 ...
 $ DaysMentHlthBad    : int  15 15 15 NA 10 NA NA 3 3 3 ...
 $ little_interest    : Ord.factor w/ 3 levels "None"<"Several"<..: 3 3 3 NA 2 NA NA 1 1 1 ...
 $ depressed          : Ord.factor w/ 3 levels "None"<"Several"<..: 2 2 2 NA 2 NA NA 1 1 1 ...
 $ nPregnancies       : int  NA NA NA NA 2 NA NA 1 1 1 ...
 $ nBabies            : int  NA NA NA NA 2 NA NA NA NA NA ...
 $ Age1stBaby         : int  NA NA NA NA 27 NA NA NA NA NA ...
 $ SleepHrsNight      : int  4 4 4 NA 8 NA NA 8 8 8 ...
 $ sleep_trouble      : logi  TRUE TRUE TRUE FALSE TRUE FALSE ...
 $ phys_active        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PhysActiveDays     : int  NA NA NA NA NA NA NA 5 5 5 ...
 $ tv_hrs_day         : Ord.factor w/ 7 levels "0_hrs"<"0_to_1_hr"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ comp_hrs_day       : Ord.factor w/ 7 levels "0_hrs"<"0_to_1_hr"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ TVHrsDayChild      : int  NA NA NA 4 NA 5 1 NA NA NA ...
 $ CompHrsDayChild    : int  NA NA NA 1 NA 0 6 NA NA NA ...
 $ alcohol_12_plus_yr : logi  TRUE TRUE TRUE FALSE TRUE FALSE ...
 $ AlcoholDay         : int  NA NA NA NA 2 NA NA 3 3 3 ...
 $ alcohol_year       : int  0 0 0 NA 20 NA NA 52 52 52 ...
 $ smoke_now          : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ smoke_100          : logi  TRUE TRUE TRUE FALSE TRUE FALSE ...
 $ smoke_100n         : Factor w/ 3 levels "","Non-Smoker",..: 3 3 3 1 3 1 1 2 2 2 ...
 $ smoke_age          : int  18 18 18 NA 38 NA NA NA NA NA ...
 $ marijuana          : logi  TRUE TRUE TRUE FALSE TRUE FALSE ...
 $ age_first_marij    : int  17 17 17 NA 18 NA NA 13 13 13 ...
 $ regular_marij      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ age_reg_marij      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hard_drugs         : logi  TRUE TRUE TRUE FALSE TRUE FALSE ...
 $ sex_ever           : Factor w/ 3 levels "","No","Yes": 3 3 3 1 3 1 1 3 3 3 ...
 $ sex_age            : int  16 16 16 NA 12 NA NA 13 13 13 ...
 $ sex_num_partn_life : int  8 8 8 NA 10 NA NA 20 20 20 ...
 $ sex_num_part_year  : int  1 1 1 NA 1 NA NA 0 0 0 ...
 $ same_sex           : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ sex_orientation    : Factor w/ 4 levels "","Bisexual",..: 3 3 3 1 3 1 1 2 2 2 ...
 $ pregnant_now       : Factor w/ 4 levels " ","No","Unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex_eve            : logi  TRUE TRUE TRUE FALSE TRUE FALSE ...
head(nhanes_clean_r, 7)
     id survey_year gender age age_decade age_months race_1 race_3    education marital_status   hh_income hh_income_mid poverty home_rooms home_own       work weight length head_circ height bmi bmi_cat_under_20yrs      bmi_who pulse bp_sys_ave bp_dia_ave bp_sys_1 BPDia1 bp_sys_2 bp_dia_2 bp_sys_3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 diabetes DiabetesAge health_gen DaysPhysHlthBad DaysMentHlthBad little_interest depressed nPregnancies nBabies Age1stBaby SleepHrsNight sleep_trouble phys_active PhysActiveDays tv_hrs_day comp_hrs_day TVHrsDayChild CompHrsDayChild alcohol_12_plus_yr AlcoholDay alcohol_year smoke_now smoke_100 smoke_100n smoke_age marijuana age_first_marij regular_marij age_reg_marij hard_drugs sex_ever sex_age sex_num_partn_life sex_num_part_year same_sex sex_orientation pregnant_now sex_eve
1 51624     2009_10   male  34      30-39        409  White         High School        Married 25000-34999         30000     1.4          6      Own NotWorking     87     NA        NA    165  32                <NA>    30.0_plus    70        113         85      114     88      114       88      112     82           NA        1.3     3.5       352         NA        NA         NA    FALSE          NA       Good               0              15            Most   Several           NA      NA         NA             4          TRUE       FALSE             NA       <NA>         <NA>            NA              NA               TRUE         NA            0     FALSE      TRUE     Smoker        18      TRUE              17         FALSE            NA       TRUE      Yes      16                  8                 1    FALSE    Heterosexual                 TRUE
2 51624     2009_10   male  34      30-39        409  White         High School        Married 25000-34999         30000     1.4          6      Own NotWorking     87     NA        NA    165  32                <NA>    30.0_plus    70        113         85      114     88      114       88      112     82           NA        1.3     3.5       352         NA        NA         NA    FALSE          NA       Good               0              15            Most   Several           NA      NA         NA             4          TRUE       FALSE             NA       <NA>         <NA>            NA              NA               TRUE         NA            0     FALSE      TRUE     Smoker        18      TRUE              17         FALSE            NA       TRUE      Yes      16                  8                 1    FALSE    Heterosexual                 TRUE
3 51624     2009_10   male  34      30-39        409  White         High School        Married 25000-34999         30000     1.4          6      Own NotWorking     87     NA        NA    165  32                <NA>    30.0_plus    70        113         85      114     88      114       88      112     82           NA        1.3     3.5       352         NA        NA         NA    FALSE          NA       Good               0              15            Most   Several           NA      NA         NA             4          TRUE       FALSE             NA       <NA>         <NA>            NA              NA               TRUE         NA            0     FALSE      TRUE     Smoker        18      TRUE              17         FALSE            NA       TRUE      Yes      16                  8                 1    FALSE    Heterosexual                 TRUE
4 51625     2009_10   male   4        0-9         49  Other                <NA>                20000-24999         22500     1.1          9      Own                17     NA        NA    105  15                <NA>    12.0_18.5    NA         NA         NA       NA     NA       NA       NA       NA     NA           NA         NA      NA        NA         NA        NA         NA    FALSE          NA       <NA>              NA              NA            <NA>      <NA>           NA      NA         NA            NA         FALSE       FALSE             NA       <NA>         <NA>             4               1              FALSE         NA           NA     FALSE     FALSE                   NA     FALSE              NA         FALSE            NA      FALSE               NA                 NA                NA    FALSE                                FALSE
5 51630     2009_10 female  49      40-49        596  White        Some College    LivePartner 35000-44999         40000     1.9          5     Rent NotWorking     87     NA        NA    168  31                <NA>    30.0_plus    86        112         75      118     82      108       74      116     76           NA        1.2     6.7        77      0.094        NA         NA    FALSE          NA       Good               0              10         Several   Several            2       2         27             8          TRUE       FALSE             NA       <NA>         <NA>            NA              NA               TRUE          2           20      TRUE      TRUE     Smoker        38      TRUE              18         FALSE            NA       TRUE      Yes      12                 10                 1     TRUE    Heterosexual                 TRUE
6 51638     2009_10   male   9        0-9        115  White                <NA>                75000-99999         87500     1.8          6     Rent                30     NA        NA    133  17                <NA>    12.0_18.5    82         86         47       84     50       84       50       88     44           NA        1.3     4.9       123      1.538        NA         NA    FALSE          NA       <NA>              NA              NA            <NA>      <NA>           NA      NA         NA            NA         FALSE       FALSE             NA       <NA>         <NA>             5               0              FALSE         NA           NA     FALSE     FALSE                   NA     FALSE              NA         FALSE            NA      FALSE               NA                 NA                NA    FALSE                                FALSE
7 51646     2009_10   male   8        0-9        101  White                <NA>                55000-64999         60000     2.3          7      Own                35     NA        NA    131  21                <NA> 18.5_to_24.9    72        107         37      114     46      108       36      106     38           NA        1.6     4.1       238      1.322        NA         NA    FALSE          NA       <NA>              NA              NA            <NA>      <NA>           NA      NA         NA            NA         FALSE       FALSE             NA       <NA>         <NA>             1               6              FALSE         NA           NA     FALSE     FALSE                   NA     FALSE              NA         FALSE            NA      FALSE               NA                 NA                NA    FALSE                                FALSE

Identifiers

describe(nhanes_jl.id)
Summary Stats:
Length:         10000
Missing Count:  0
Mean:           61944.643800
Minimum:        51624.000000
1st Quartile:   56904.500000
Median:         62159.500000
3rd Quartile:   67039.000000
Maximum:        71915.000000
Type:           Int64
length(nhanes_jl.id)
10000
length(unique(nhanes_jl.id))
6779
nhanes_py.id.describe()
count    10000.00000
mean     61944.64380
std       5871.16716
min      51624.00000
25%      56904.50000
50%      62159.50000
75%      67039.00000
max      71915.00000
Name: id, dtype: float64
len(nhanes_py.id)
10000
len(pandas.unique(nhanes_py.id))
6779
summary(nhanes_r$id)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  51624   56904   62160   61945   67039   71915 
length(nhanes_r$id)
[1] 10000
length(unique(nhanes_r$id))
[1] 6779

Quantitative or Numerical Data

Discrete

describe(nhanes_jl.age)
Summary Stats:
Length:         10000
Missing Count:  0
Mean:           36.742100
Minimum:        0.000000
1st Quartile:   17.000000
Median:         36.000000
3rd Quartile:   54.000000
Maximum:        80.000000
Type:           Int64
histogram_age_jl = Gadfly.plot(
    nhanes_clean_jl,
    x=:age,
    Geom.histogram(bincount=81),
    theme_michaelmallari_jl
);

nhanes_py["age"].describe()
count    10000.000000
mean        36.742100
std         22.397566
min          0.000000
25%         17.000000
50%         36.000000
75%         54.000000
max         80.000000
Name: age, dtype: float64
histogram_age_py = (plotnine.ggplot(data=nhanes_py, mapping=plotnine.mapping.aes("age")) +
    plotnine.geoms.geom_histogram()
)

histogram_age_py
<ggplot: (8763219188682)>

/Volumes/Personal/Mami/__Netlify/hello@michaelmallari.com/www.michaelmallari.com/pythonenv/lib/python3.9/site-packages/plotnine/stats/stat_bin.py:95: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.

summary(nhanes_r$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      17      36      37      54      80 
histogram_age_r <- ggplot2::ggplot(nhanes_r, aes(x=age)) +
    geom_histogram() +
    theme_michaelmallari_r()

histogram_age_r

As with any numerical data (regardless of whether it’s discrete or continuous), we’re interested in knowing the summary statistics and visualize the distribution of the data.

Continuous

describe(nhanes_clean_jl.bmi)
Summary Stats:
Length:         10000
Missing Count:  366
Mean:           26.660136
Minimum:        12.880000
1st Quartile:   21.580000
Median:         25.980000
3rd Quartile:   30.890000
Maximum:        81.250000
Type:           Union{Missing, Float64}
histogram_bmi_jl = Gadfly.plot(
    nhanes_clean_jl,
    x=:bmi,
    Geom.histogram(bincount=83),
    theme_michaelmallari_jl
);

nhanes_py["bmi"].describe()
count    9634.000000
mean       26.660136
std         7.376579
min        12.880000
25%        21.580000
50%        25.980000
75%        30.890000
max        81.250000
Name: bmi, dtype: float64
summary(nhanes_r$bmi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     13      22      26      27      31      81     366 

Qualitative or Categorical Data

Binary

CategoricalArrays.levels(nhanes_clean_jl.sleep_trouble)
2-element Array{String3,1}:
 "No"
 "Yes"
nhanes_jl[1:7, "sleep_trouble"]
7-element PooledArrays.PooledArray{Union{Missing, String3},UInt32,1,Array{UInt32,1}}:
 "Yes"
 "Yes"
 "Yes"
 missing
 "Yes"
 missing
 missing
frequency_table_simple_categorical_jl(nhanes_clean_jl, :sleep_trouble)
3×4 DataFrame
 Row │ sleep_trouble  frequency  percent_relative  percent_cumulative
     │ String3?       Int64      Float64           Float64
─────┼────────────────────────────────────────────────────────────────
   1 │ No                  5799             57.99               57.99
   2 │ missing             2228             22.28               80.27
   3 │ Yes                 1973             19.73              100.0
nhanes_py.sleep_trouble.head(7)
0    Yes
1    Yes
2    Yes
3    NaN
4    Yes
5    NaN
6    NaN
Name: sleep_trouble, dtype: object
head(nhanes_r$sleep_trouble, 7)
[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE

Nominal

CategoricalArrays.levels(nhanes_clean_jl.sex_orientation)
3-element Array{String15,1}:
 "Bisexual"
 "Heterosexual"
 "Homosexual"
nhanes_clean_jl[1:7, "sex_orientation"]
7-element PooledArrays.PooledArray{Union{Missing, String15},UInt32,1,Array{UInt32,1}}:
 "Heterosexual"
 "Heterosexual"
 "Heterosexual"
 missing
 "Heterosexual"
 missing
 missing
frequency_table_simple_categorical_jl(nhanes_clean_jl, :sex_orientation)
4×4 DataFrame
 Row │ sex_orientation  frequency  percent_relative  percent_cumulative
     │ String15?        Int64      Float64           Float64
─────┼──────────────────────────────────────────────────────────────────
   1 │ missing               5158             51.58               51.58
   2 │ Heterosexual          4638             46.38               97.96
   3 │ Bisexual               119              1.19               99.15
   4 │ Homosexual              85              0.85              100.0
# histogram_sex_orientation_jl = Gadfly.plot(
#     nhanes_jl,
#     x = :sex_orientation,
#     Geom.histogram
# );
nhanes_py.sex_orientation.head(7)
0    Heterosexual
1    Heterosexual
2    Heterosexual
3             NaN
4    Heterosexual
5             NaN
6             NaN
Name: sex_orientation, dtype: object
head(nhanes_r$work, 7)
[1] NotWorking NotWorking NotWorking            NotWorking                      
Levels:  Looking NotWorking Working
histogram_work_r <- ggplot2::ggplot(nhanes_r, aes(y=work)) +
    geom_histogram(stat="count", colour=palette_michaelmallari_r[19], fill=palette_michaelmallari_r[19]) +
    geom_text(
        aes(label=..count..),
        stat="count",
        hjust=1.5,
        colour=palette_michaelmallari_r[1]
    ) +
    scale_y_discrete(limits=c("Working", "NotWorking", "NA", "Looking"), expand=c(0, 0), position="right") +
    labs(
        title="Employment Status",
        alt="Employment Status",
        subtitle="Frequency, n = 10,000",
        x=NULL,
        y=NULL,
        caption="Data Source: https://www.cdc.gov/nchs/nhanes_r/"
    ) +
    scale_y_discrete(expand=c(0, 0), position="right") +
    theme_michaelmallari_r()

histogram_work_r

summary(nhanes_r$work)
              Looking NotWorking    Working 
      2229        311       2847       4613 
epiDisplay::tab1(nhanes_r$work, sort.group="decreasing", cum.percent=TRUE, graph=FALSE)
nhanes_r$work : 
           Frequency Percent Cum. percent
Working         4613    46.1           46
NotWorking      2847    28.5           75
                2229    22.3           97
Looking          311     3.1          100
  Total        10000   100.0          100

Ordinal

CategoricalArrays.levels(nhanes_clean_jl.health_gen)
5-element Array{String15,1}:
 "Excellent"
 "Fair"
 "Good"
 "Poor"
 "Vgood"
nhanes_clean_jl[1:7, "sex_orientation"]
7-element PooledArrays.PooledArray{Union{Missing, String15},UInt32,1,Array{UInt32,1}}:
 "Heterosexual"
 "Heterosexual"
 "Heterosexual"
 missing
 "Heterosexual"
 missing
 missing

We can see that this ordinal data (column or vector in the nhanes_clean_jl data frame) is not in the correct order. Setting the correct order is one of the common data wrangling tasks needed when working with ordinal data. Let’s set the correct order as Poor < Fair < Good < Vgood < Excellent.

nhanes_clean_jl.health_gen = CategoricalArrays.CategoricalArray{Union{Missing, String}}(nhanes_clean_jl.health_gen, ordered=true);
CategoricalArrays.levels!(nhanes_clean_jl.health_gen, ["Poor", "Fair", "Good", "Vgood", "Excellent"]);
CategoricalArrays.levels(nhanes_clean_jl.health_gen)
5-element Array{String,1}:
 "Poor"
 "Fair"
 "Good"
 "Vgood"
 "Excellent"

Now that the health_gen vector is in the correct order, it is now “clean” for data analyses, modeling, etc.

frequency_table_health_gen_jl = frequency_table_simple_categorical_jl(nhanes_clean_jl, :health_gen)
6×4 DataFrame
 Row │ health_gen  frequency  percent_relative  percent_cumulative
     │ Cat…?       Int64      Float64           Float64
─────┼─────────────────────────────────────────────────────────────
   1 │ Good             2956             29.56               29.56
   2 │ Vgood            2508             25.08               54.64
   3 │ missing          2461             24.61               79.25
   4 │ Fair             1010             10.1                89.35
   5 │ Excellent         878              8.78               98.13
   6 │ Poor              187              1.87              100.0
nhanes_py.loc[1:7, "health_gen"]
1     Good
2     Good
3      NaN
4     Good
5      NaN
6      NaN
7    Vgood
Name: health_gen, dtype: category
Categories (5, object): ['Excellent', 'Fair', 'Good', 'Poor', 'Vgood']
nhanes_r[1:7, "health_gen"]
[1] Good Good Good <NA> Good <NA> <NA>
Levels: Poor < Fair < Good < Vgood < Excellent
histogram_health_gen_r <- ggplot2::ggplot(nhanes_clean_r, aes(x=health_gen)) +
    geom_histogram(stat="count", colour=palette_michaelmallari_r[19], fill=palette_michaelmallari_r[19]) +
    geom_text(
        aes(label=..count..),
        stat="count",
        hjust=1.5,
        colour=palette_michaelmallari_r[1]
    ) +
    scale_x_discrete(limits=rev(levels(nhanes_clean_r$health_gen)), expand=c(0, 0), position="right") +
    scale_x_discrete(position="right") +
    # labs(
    #     title="",
    #     alt="",
    #     subtitle="",
    #     x=NULL,
    #     y=NULL,
    #     caption="Data Source: https://www.kaggle.com/datasets/sveta151/tiktok-popular-songs-2019"
    # ) +
    theme_michaelmallari_r()

histogram_health_gen_r

Applied Computing