Data-Informed Thinking + Doing Via Supervised Learning

Stroke of Insight: How Classification Trees Save Lives

Predicting stroke risks and improving preventive care (and costs)—with examples of classification trees in R, Python, and Julia.

Vital Permanente, a fictional hospital system serving both rural and urban communities, is facing a critical problem: stroke-related hospitalizations are surging—putting lives, families, and the healthcare system under strain. Three alarming trends have emerged:

  1. Late Risk Identification: Traditional methods flags high-risk patients only after the damage was done.
  2. Disparities in Care: Rural patients, with fewer routine checkups, have been disproportionately affected.
  3. Complex Risk Factors: Conditions like hypertension, diabetes, and smoking are interdependent—making them difficult to evaluate comprehensively.

For doctors, the stakes are high: preventing strokes means saving lives, preserving patient independence, and restoring faith in preventive care.

Ignoring the Problem Comes at a High Cost

Without immediate intervention, the consequences could ripple through patients’ lives and the healthcare system:

  • Lives Lost or Irreversibly Changed: Stroke rates were projected to rise by 20% over the next decade, leading to untold suffering.
  • Unsustainable Costs: Stroke-related rehabilitation and long-term care threatened to overwhelm the system financially.
  • Eroding Trust in Healthcare: Physicians felt powerless without better tools—leaving patients to fall through the cracks.

Failing to act means missing the opportunity to make a meaningful difference.

How might supervised learning—or more specifically, decision trees—help doctors be more data-informed, in to prevent strokes before they happen?


Using Decision Trees to Predict Stroke Risk

Armed with large datasets from its electronic medical records (EMR), Vital Permanente launched a data-driven mission: to leverage classification decision trees (or classification trees) to proactively identify stroke risk and empower earlier, personalized interventions. This project’s key objectives include:

  1. Building a Predictive Model: Analyze patient data with decision trees to flag high-risk individuals.
  2. Enhancing Physician Confidence: Deliver interpretable insights to guide preventive care decisions, such as medication and lifestyle counseling.
  3. Addressing Care Inequities: Ensure the solution works effectively across diverse patient populations, especially those underserved.
  4. Tracking Impact: Measure reductions in stroke incidence and healthcare costs over five years.

This initiative promised to transform Vital Permanente’s approach to preventive care—turning data into action, and potential tragedy into triumph.

Now that we have context on this healthcare problem …

♪ Let’s get technical, technical. I wanna get technical. ♪



Appendix A: Environment, Reproducibility, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia, and the respective packages that I used. Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
library(caret)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scikit-learn==1.3.0
import random
import datetime
import pandas
import plotnine
import sklearn
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
Pkg.add(name="Gadfly", version="1.4.0")
using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
using Gadfly

Further Readings

  • Hildebrand, D. K., Ott, R. L., & Gray, J. B. (2005). Basic Statistical Ideas for Managers (2nd ed.). Thompson Brooks/Cole.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
  • James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning: With Applications in Python. Springer. https://doi.org/10.1007/978-3-031-38747-0
  • Kuhn, M. & Johnson, K. (2013). Applied Predictive Modeling. Springer. https://doi.org/10.1007/978-1-4614-6849-3
  • Stine, R. & Foster, D. (2017). Statistics for Business: Decision Making and Analysis (3rd ed.). Pearson.