Data-Informed Thinking + Doing

Explanatory Modeling Using Linear Regression

—using R, Python, and Julia.

As demonstrated in a previous post, there are tremendous opportunities in making numerical predictions using linear regression. By fitting a regression model to the available data, we can make reliable predictions for future observations. The simplicity and interpretability of linear regression make it a versatile tool for both understanding the relationships between variables and making accurate predictions based on data.

Likewise, linear regression also offers opportunities for both explanatory modeling. In explanatory modeling, linear regression allows us to understand the relationships between variables, identify important predictors, and quantify their impact on the outcome variable. It helps uncover causal relationships and provides insights into the underlying mechanisms.

Let’s look at this technique by using an X dataset.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scipy==1.11.1
import numpy
import pandas
import plotnine
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
Pkg.add(name="StatsBase", version="0.33.21")
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
using StatsBase

Importing and Examining Dataset

# counties_r <- readRDS("../../dataset/counties.rds")
# str(object=counties_r)
# head(x=counties_r, n=8)
# tail(x=counties_r, n=8)
# counties_py = pyreadr.read_r("../../dataset/counties.rds")
# counties_py = counties_py[None]
# counties_py.info()
# counties_py.head(n=8)
# counties_py.tail(n=8)
# counties_jl = RData.load("../../dataset/counties.rds")

References

  • Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
  • Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
Applied Advanced Analytics & AI in Sports