Data-Informed Thinking + Doing

Probabilistic Thinking

Making probabilistic inferences about on-time flight arrival—using R, Python, and Julia.

Probability-based inference is one of the most valuable framework/method for reasoning under uncertainty. By using probability theory and statistical methods, we can quantify uncertainty, make predictions, and draw meaningful conclusions from data. It allows us to estimate parameters, test hypotheses, and assess the strength of evidence. It provides a solid foundation for decision-making, risk assessment, and understanding the uncertainty associated with our conclusions. While this tool enhances our ability to make informed decisions in a wide range of fields (from science and engineering to business and policy-making), we’ll specifically look at applying this on on-flight arrival time (as an example).

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pyreadr==0.4.7
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scipy==1.11.1
import pyreadr
import numpy
import pandas
import plotnine
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="RData", version="0.8.3")
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
Pkg.add(name="StatsBase", version="0.33.21")
using RData
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
using StatsBase

Importing and Examining Dataset

Bernoulli Trial

The experiment has two outcomes–true/false, head/tail, success/failure.

What’s the probability that four coin flips will result in four heads (our event or trial). In this experiment, we actually got four heads in four coin flips.

numpy.random.seed(1856)  # The University of Maryland, College Park was established in 1856.
random_numbers = numpy.random.random(size = 4)
random_numbers
array([0.43757175, 0.37668039, 0.1854241 , 0.01220306])
heads = random_numbers < 0.5
heads
array([ True,  True,  True,  True])
numpy.sum(heads)
4

We know, however, that it’s unlikely that this event will occur again. Let’s repeat this 10,000 times and get an ideal what the probability is with multiple trials in this experiment.

n_all_heads = 0
for _ in range(10000):
    heads = numpy.random.random(size = 4) < 0.5
    n_heads = numpy.sum(heads)
    if n_heads == 4:
        n_all_heads += 1
    
n_all_heads / 10000
0.066

After 10,000 trials, four heads in every four flips occurred 6.6% of the time.


References

  • Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
  • Stine, R. & Foster, D. (2017). Statistics for Business: Decision Making and Anlysis (3rd ed.). Pearson.
  • Hildebrand, D. K., Ott, R. L., & Gray, J. B. (2005). Basic Statistical Ideas for Managers (2nd ed.). Thompson Brooks/Cole.
Applied Advanced Analytics & AI in Sports