Scaled & Efficient Supervised Learning with AutoML

Saturday, April 18, 2020

In continuation of the binary classification project hypothetically chartered by the Blue Moon Brewing Company 12 year ago, this project is a longitudinal study to re-validate the relevance of Blue Moon’s STP (Segmentation, Targeting, Positioning) marketing strategy on today’s evolving beer consumers.

The objective of this data analysis is the same: to infer whether demographic data around gender, age, marital status, and income continue to indicate a consumer preference for light beer. To achieve this, I (again, hypothetically) collected survey data from 1,500 beer consumers. Employing automated machine learning (AutoML), this project explores broad alternative classification methods, beyond a baseline logistic regression that was applied in the prior project.

Data Understanding

For data understanding, I imported a CSV file with 100 records and 5 columns. These columns include gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This initial analysis is critical for identifying the dataset’s structure and preparing for subsequent data exploration and modeling.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is essential for understanding single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It reveals trends, patterns, and anomalies, informing subsequent analysis and hypothesis development.

Univariate Analysis

Univariate analysis summarizes and identifies patterns in individual variables. It informs subsequent analysis, revealing insights into distribution, central tendency, and variability. By examining one variable at a time, it detects outliers, assesses data quality, and sets the stage for more complex bivariate and multivariate analyses.

summary(beer_r)

     gender        married         income           age      prefer_light
 Min.   :0.00   Min.   :0.00   Min.   :24796   Min.   :21   Min.   :0.0  
 1st Qu.:0.00   1st Qu.:0.00   1st Qu.:46279   1st Qu.:49   1st Qu.:0.0  
 Median :0.00   Median :0.00   Median :52306   Median :56   Median :0.5  
 Mean   :0.43   Mean   :0.45   Mean   :52561   Mean   :55   Mean   :0.5  
 3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:58955   3rd Qu.:62   3rd Qu.:1.0  
 Max.   :1.00   Max.   :1.00   Max.   :84031   Max.   :87   Max.   :1.0

Python

# Equivalent of summary(beer_r) in Python
beer_py.describe(include="all")

            gender     married        income          age  prefer_light
count  1500.000000  1500.00000   1500.000000  1500.000000   1500.000000
mean      0.430000     0.45000  52560.539444    55.352281      0.500000
std       0.495241     0.49766   9487.082677    10.221699      0.500167
min       0.000000     0.00000  24795.654470    21.000000      0.000000
25%       0.000000     0.00000  46278.626699    48.705995      0.000000
50%       0.000000     0.00000  52306.099551    55.614237      0.500000
75%       1.000000     1.00000  58954.552335    62.349463      1.000000
max       1.000000     1.00000  84030.572926    87.000000      1.000000

Bivariate Analysis

#' Create a frequency table as a data frame (tidy format)
#'
#' @param data A data frame
#' @param row_var Character string of the row variable name
#' @param col_var Character string of the column variable name
#' @param row_labels Named vector for row variable labels
#' @param col_labels Named vector for column variable labels
#' @param row_name Character string for row dimension name
#' @param col_name Character string for column dimension name
#' @return A data frame with frequencies in wide format
create_frequency_dataframe <- function(data,
                                       row_var,
                                       col_var,
                                       row_labels = NULL,
                                       col_labels = NULL,
                                       row_name = "Row Variable",
                                       col_name = "Column Variable") {
  
  # Create working copy
  data_copy <- data
  
  # Apply labels if provided
  if (!is.null(row_labels)) {
    data_copy[[paste0(row_var, "_labeled")]] <- row_labels[as.character(data_copy[[row_var]])]
    row_var_to_use <- paste0(row_var, "_labeled")
  } else {
    row_var_to_use <- row_var
  }
  
  if (!is.null(col_labels)) {
    data_copy[[paste0(col_var, "_labeled")]] <- col_labels[as.character(data_copy[[col_var]])]
    col_var_to_use <- paste0(col_var, "_labeled")
  } else {
    col_var_to_use <- col_var
  }
  
  # Create frequency data frame
  freq_df <- data_copy %>%
    dplyr::count(
      .data[[row_var_to_use]],
      .data[[col_var_to_use]],
      name = "Frequency"
    ) %>%
    tidyr::pivot_wider(
      names_from = .data[[col_var_to_use]],
      values_from = Frequency,
      values_fill = 0
    ) %>%
    dplyr::rename(!!row_name := .data[[row_var_to_use]])
  
  return(freq_df)
}

Python

# Frequency table on prefer_light and gender

Multivariate Analysis

Multivariate analysis plays a pivotal role in understanding intricate relationships within datasets. By exploring patterns, identifying outliers, and revealing the underlying structure, it provides valuable insights. One essential tool in multivariate analysis is the correlation matrix, which quantifies the strength and direction of relationships between variables. Positive values indicate direct associations, while negative values imply inverse relationships. Leveraging insights from the correlation matrix, we can make informed decisions about feature selection, hypothesis testing, and model accuracy. Detecting multicollinearity and identifying potential predictors becomes more effective with this analytical approach.

correlation_matrix_r <- round(cor(beer_r), 2)
head(correlation_matrix_r[5:1, 1:5])

             gender married income   age prefer_light
prefer_light  -0.09    0.05   0.40 -0.41         1.00
age            0.19    0.22   0.11  1.00        -0.41
income         0.03    0.33   1.00  0.11         0.40
married       -0.04    1.00   0.33  0.22         0.05
gender         1.00   -0.04   0.03  0.19        -0.09

Python

Data Preparation

Data Frame Conversion to H2O Data Frame

beer_r$gender <- factor(beer_r$gender, levels=c(0, 1), labels=c("Female", "Male"))
beer_r$married <- as.logical(beer_r$married)
beer_r$prefer_light <- as.logical(beer_r$prefer_light)

local_h2o <- h2o.init()
beer_h2o_r <- as.h2o(beer_r)

dim(beer_h2o_r)

[1] 1500    5

head(beer_h2o_r)

  gender married income age prefer_light
1   Male    TRUE  35885  48        FALSE
2 Female   FALSE  37737  66        FALSE
3 Female   FALSE  26388  62        FALSE
4 Female    TRUE  43483  61        FALSE
5 Female   FALSE  38079  54        FALSE
6 Female   FALSE  44328  41         TRUE

Python

Train & Test Data Splitting

beer_splits_h2o_r <- h2o.splitFrame(data=beer_h2o_r, ratios=0.7, seed=1754)  #RoarLionRoar 🦁
beer_train_h2o_r <- beer_splits_h2o_r[[1]]
beer_test_h2o_r <- beer_splits_h2o_r[[2]]
dim(beer_train_h2o_r)

[1] 1040    5

head(beer_train_h2o_r)

  gender married income age prefer_light
1   Male    TRUE  35885  48        FALSE
2 Female   FALSE  37737  66        FALSE
3 Female    TRUE  43483  61        FALSE
4 Female   FALSE  44328  41         TRUE
5 Female    TRUE  40865  64        FALSE
6 Female   FALSE  54499  45         TRUE

dim(beer_test_h2o_r)

[1] 460   5

head(beer_test_h2o_r)

  gender married income age prefer_light
1 Female   FALSE  26388  62        FALSE
2 Female   FALSE  38079  54        FALSE
3   Male    TRUE  62118  62         TRUE
4   Male    TRUE  67201  85        FALSE
5 Female   FALSE  40382  62        FALSE
6   Male    TRUE  54339  38         TRUE

Python

Data Modeling

AutoML Classification Models: Training

models_classification_predictors_r <- c("gender", "married", "income", "age")
models_classification_response_r <- "prefer_light"

models_classification_r <- h2o.automl(
    x=models_classification_predictors_r,
    y=models_classification_response_r,
    training_frame=beer_train_h2o_r,
    max_models=12,
    seed=1754  #RoarLionRoar 🦁
)

Python

AutoML Regression Models: Training

Suppose, within its brewery & restaurant in the RiNo district of Denver, Blue Moon seeks to optimize upselling opportunities by predicting income range. A numerical prediction can be made using the same data points: gender, marital status, age range, and preference for light beer. Using AutoML, I can perform this regression task efficiently and accurately, beyond simply using a baseline linear regression.

models_regression_predictors_r <- c("gender", "married", "age", "prefer_light")
models_regression_response_r <- "income"

models_regression_r <- h2o.automl(
    y=models_classification_response_r,
    training_frame=beer_train_h2o_r,
    leaderboard_frame=beer_test_h2o_r,
    max_runtime_secs=30,
    seed=1754  #RoarLionRoar 🦁
)

Python

Model Evaluation

AutoML Classification Models

# print(models_classification_r@leaderboard, n=nrow(models_classification_r@leaderboard))
h2o.get_leaderboard(object=models_classification_r, extra_columns="ALL")

                                                 model_id  auc logloss aucpr mean_per_class_error rmse  mse training_time_ms predict_time_per_row_ms            algo
1 StackedEnsemble_BestOfFamily_1_AutoML_7_20260113_141731 0.85    0.47  0.85                 0.23 0.39 0.16             1077                  0.0209 StackedEnsemble
2                          GLM_1_AutoML_7_20260113_141731 0.85    0.47  0.85                 0.22 0.39 0.16               26                  0.0043             GLM
3    StackedEnsemble_AllModels_1_AutoML_7_20260113_141731 0.85    0.47  0.85                 0.22 0.40 0.16             1267                  0.0186 StackedEnsemble
4                          GBM_1_AutoML_7_20260113_141731 0.85    0.49  0.84                 0.24 0.40 0.16               60                  0.0115             GBM
5                 DeepLearning_1_AutoML_7_20260113_141731 0.84    0.49  0.85                 0.25 0.40 0.16               64                  0.0065    DeepLearning
6                      XGBoost_1_AutoML_7_20260113_141731 0.84    0.49  0.84                 0.23 0.40 0.16               76                  0.0060         XGBoost

[14 rows x 10 columns]

models_classification_predictions_r <- h2o.predict(models_classification_r, beer_test_h2o_r)

head(models_classification_predictions_r)

  predict FALSE  TRUE
1   FALSE 0.979 0.021
2   FALSE 0.848 0.152
3    TRUE 0.342 0.658
4   FALSE 0.845 0.155
5   FALSE 0.924 0.076
6    TRUE 0.082 0.918

models_classification_performance_r <- h2o.performance(models_classification_r@leader, beer_test_h2o_r)
models_classification_performance_r

H2OBinomialMetrics: stackedensemble

MSE:  0.15
RMSE:  0.39
LogLoss:  0.46
Mean Per-Class Error:  0.19
AUC:  0.87
AUCPR:  0.84
Gini:  0.74

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       FALSE TRUE    Error     Rate
FALSE    186   59 0.240816  =59/245
TRUE      31  184 0.144186  =31/215
Totals   217  243 0.195652  =90/460

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.486926   0.803493 211
2                       max f2  0.281684   0.863830 275
3                 max f0point5  0.730517   0.799537 138
4                 max accuracy  0.510317   0.804348 198
5                max precision  0.996886   1.000000   0
6                   max recall  0.129530   1.000000 342
7              max specificity  0.996886   1.000000   0
8             max absolute_mcc  0.486926   0.614671 211
9   max min_per_class_accuracy  0.523071   0.800000 191
10 max mean_per_class_accuracy  0.486926   0.807499 211
11                     max tns  0.996886 245.000000   0
12                     max fns  0.996886 214.000000   0
13                     max fps  0.002299 245.000000 399
14                     max tps  0.129530 215.000000 342
15                     max tnr  0.996886   1.000000   0
16                     max fnr  0.996886   0.995349   0
17                     max fpr  0.002299   1.000000 399
18                     max tpr  0.129530   1.000000 342

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Python

AutoML Regression Models

print(models_regression_r@leaderboard, n=nrow(models_regression_r@leaderboard))

                                                  model_id  auc logloss aucpr mean_per_class_error rmse  mse
1     DeepLearning_grid_1_AutoML_8_20260113_141743_model_4 0.88    0.44  0.85                 0.19 0.38 0.14
2     DeepLearning_grid_1_AutoML_8_20260113_141743_model_2 0.88    0.44  0.85                 0.20 0.38 0.14
3     DeepLearning_grid_3_AutoML_8_20260113_141743_model_3 0.87    0.45  0.85                 0.20 0.38 0.14
4                           GLM_1_AutoML_8_20260113_141743 0.87    0.45  0.85                 0.20 0.38 0.15
5     DeepLearning_grid_2_AutoML_8_20260113_141743_model_3 0.87    0.45  0.85                 0.19 0.38 0.15
6     DeepLearning_grid_3_AutoML_8_20260113_141743_model_2 0.87    0.45  0.85                 0.20 0.38 0.15
7     DeepLearning_grid_2_AutoML_8_20260113_141743_model_2 0.87    0.45  0.84                 0.19 0.38 0.15
8     DeepLearning_grid_1_AutoML_8_20260113_141743_model_5 0.87    0.47  0.85                 0.19 0.39 0.15
9     DeepLearning_grid_1_AutoML_8_20260113_141743_model_1 0.87    0.46  0.85                 0.20 0.39 0.15
10    DeepLearning_grid_1_AutoML_8_20260113_141743_model_3 0.87    0.48  0.85                 0.20 0.40 0.16
11 StackedEnsemble_BestOfFamily_4_AutoML_8_20260113_141743 0.87    0.46  0.85                 0.19 0.39 0.15
12 StackedEnsemble_BestOfFamily_2_AutoML_8_20260113_141743 0.87    0.46  0.84                 0.19 0.39 0.15
13 StackedEnsemble_BestOfFamily_1_AutoML_8_20260113_141743 0.87    0.46  0.84                 0.19 0.39 0.15
14    StackedEnsemble_AllModels_1_AutoML_8_20260113_141743 0.87    0.46  0.84                 0.19 0.39 0.15
15    DeepLearning_grid_2_AutoML_8_20260113_141743_model_1 0.87    0.48  0.84                 0.20 0.40 0.16
16    DeepLearning_grid_2_AutoML_8_20260113_141743_model_4 0.87    0.46  0.85                 0.20 0.39 0.15
17 StackedEnsemble_BestOfFamily_3_AutoML_8_20260113_141743 0.87    0.46  0.84                 0.19 0.39 0.15
18    StackedEnsemble_AllModels_2_AutoML_8_20260113_141743 0.87    0.46  0.84                 0.19 0.39 0.15
19    StackedEnsemble_AllModels_4_AutoML_8_20260113_141743 0.87    0.46  0.84                 0.19 0.39 0.15
20    StackedEnsemble_AllModels_3_AutoML_8_20260113_141743 0.87    0.47  0.84                 0.19 0.39 0.15
21    DeepLearning_grid_3_AutoML_8_20260113_141743_model_1 0.87    0.47  0.83                 0.20 0.39 0.15
22    DeepLearning_grid_1_AutoML_8_20260113_141743_model_6 0.87    0.46  0.84                 0.21 0.39 0.15
23                 DeepLearning_1_AutoML_8_20260113_141743 0.86    0.50  0.83                 0.20 0.40 0.16
24        XGBoost_grid_1_AutoML_8_20260113_141743_model_14 0.86    0.47  0.81                 0.21 0.39 0.15
25         XGBoost_grid_1_AutoML_8_20260113_141743_model_6 0.86    0.47  0.82                 0.20 0.39 0.15
26            GBM_grid_1_AutoML_8_20260113_141743_model_19 0.86    0.48  0.82                 0.21 0.40 0.16
27            GBM_grid_1_AutoML_8_20260113_141743_model_17 0.86    0.47  0.83                 0.21 0.40 0.16
28                          GBM_1_AutoML_8_20260113_141743 0.86    0.48  0.81                 0.21 0.40 0.16
29                      XGBoost_1_AutoML_8_20260113_141743 0.85    0.48  0.80                 0.20 0.40 0.16
30            GBM_grid_1_AutoML_8_20260113_141743_model_13 0.85    0.49  0.81                 0.21 0.40 0.16
31         XGBoost_grid_1_AutoML_8_20260113_141743_model_3 0.85    0.48  0.81                 0.22 0.40 0.16
32         XGBoost_grid_1_AutoML_8_20260113_141743_model_7 0.85    0.48  0.79                 0.21 0.39 0.16
33         XGBoost_grid_1_AutoML_8_20260113_141743_model_2 0.85    0.49  0.80                 0.22 0.40 0.16
34         XGBoost_grid_1_AutoML_8_20260113_141743_model_1 0.85    0.49  0.81                 0.22 0.40 0.16
35             GBM_grid_1_AutoML_8_20260113_141743_model_6 0.85    0.49  0.79                 0.22 0.40 0.16
36        XGBoost_grid_1_AutoML_8_20260113_141743_model_10 0.85    0.49  0.82                 0.22 0.40 0.16
37        XGBoost_grid_1_AutoML_8_20260113_141743_model_11 0.85    0.49  0.80                 0.23 0.40 0.16
38                      XGBoost_2_AutoML_8_20260113_141743 0.85    0.49  0.80                 0.21 0.40 0.16
39             GBM_grid_1_AutoML_8_20260113_141743_model_2 0.85    0.49  0.81                 0.23 0.40 0.16
40        XGBoost_grid_1_AutoML_8_20260113_141743_model_15 0.85    0.49  0.80                 0.21 0.40 0.16
41            GBM_grid_1_AutoML_8_20260113_141743_model_10 0.85    0.49  0.79                 0.23 0.40 0.16
42                      XGBoost_3_AutoML_8_20260113_141743 0.85    0.49  0.81                 0.22 0.40 0.16
43            GBM_grid_1_AutoML_8_20260113_141743_model_11 0.84    0.49  0.80                 0.23 0.40 0.16
44        XGBoost_grid_1_AutoML_8_20260113_141743_model_13 0.84    0.50  0.80                 0.21 0.40 0.16
45         XGBoost_grid_1_AutoML_8_20260113_141743_model_8 0.84    0.50  0.80                 0.24 0.41 0.17
46             GBM_grid_1_AutoML_8_20260113_141743_model_5 0.84    0.50  0.81                 0.22 0.40 0.16
47         XGBoost_grid_1_AutoML_8_20260113_141743_model_4 0.84    0.50  0.80                 0.25 0.40 0.16
48            GBM_grid_1_AutoML_8_20260113_141743_model_16 0.84    0.51  0.79                 0.23 0.41 0.17
49                          GBM_2_AutoML_8_20260113_141743 0.84    0.50  0.80                 0.22 0.41 0.16
50             GBM_grid_1_AutoML_8_20260113_141743_model_8 0.84    0.50  0.80                 0.25 0.41 0.17
51                          GBM_4_AutoML_8_20260113_141743 0.84    0.50  0.80                 0.22 0.41 0.17
52            GBM_grid_1_AutoML_8_20260113_141743_model_15 0.83    0.52  0.81                 0.25 0.41 0.17
53                          GBM_5_AutoML_8_20260113_141743 0.83    0.51  0.79                 0.22 0.41 0.17
54                          GBM_3_AutoML_8_20260113_141743 0.83    0.51  0.79                 0.22 0.41 0.17
55             GBM_grid_1_AutoML_8_20260113_141743_model_7 0.83    0.53  0.79                 0.25 0.42 0.18
56                          XRT_1_AutoML_8_20260113_141743 0.83    0.52  0.79                 0.23 0.41 0.17
57         XGBoost_grid_1_AutoML_8_20260113_141743_model_9 0.83    0.55  0.81                 0.26 0.42 0.18
58             GBM_grid_1_AutoML_8_20260113_141743_model_3 0.83    0.52  0.79                 0.23 0.42 0.17
59        XGBoost_grid_1_AutoML_8_20260113_141743_model_12 0.82    0.55  0.80                 0.25 0.43 0.18
60            GBM_grid_1_AutoML_8_20260113_141743_model_14 0.82    0.54  0.77                 0.25 0.42 0.18
61                          DRF_1_AutoML_8_20260113_141743 0.81    0.56  0.79                 0.26 0.43 0.18
62         XGBoost_grid_1_AutoML_8_20260113_141743_model_5 0.81    0.64  0.79                 0.25 0.44 0.20
63            GBM_grid_1_AutoML_8_20260113_141743_model_18 0.81    0.54  0.79                 0.25 0.42 0.18
64             GBM_grid_1_AutoML_8_20260113_141743_model_9 0.81    0.54  0.78                 0.29 0.43 0.18
65             GBM_grid_1_AutoML_8_20260113_141743_model_1 0.81    0.54  0.78                 0.26 0.43 0.18
66            GBM_grid_1_AutoML_8_20260113_141743_model_12 0.75    0.60  0.73                 0.35 0.45 0.21
67             GBM_grid_1_AutoML_8_20260113_141743_model_4 0.74    0.60  0.71                 0.35 0.46 0.21

[67 rows x 7 columns]

models_regression_predictions_r <- h2o.predict(models_regression_r, beer_test_h2o_r)

head(models_regression_predictions_r)

  predict FALSE   TRUE
1   FALSE 0.997 0.0025
2   FALSE 0.928 0.0724
3    TRUE 0.443 0.5566
4   FALSE 0.963 0.0370
5   FALSE 0.962 0.0378
6    TRUE 0.093 0.9069

models_regression_performance_r <- h2o.performance(models_regression_r@leader, beer_test_h2o_r)
models_regression_performance_r

H2OBinomialMetrics: deeplearning

MSE:  0.14
RMSE:  0.38
LogLoss:  0.44
Mean Per-Class Error:  0.19
AUC:  0.88
AUCPR:  0.85
Gini:  0.75

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       FALSE TRUE    Error     Rate
FALSE    184   61 0.248980  =61/245
TRUE      30  185 0.139535  =30/215
Totals   214  246 0.197826  =91/460

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.407795   0.802603 218
2                       max f2  0.218208   0.872340 280
3                 max f0point5  0.598899   0.797342 152
4                 max accuracy  0.411384   0.802174 217
5                max precision  0.998098   1.000000   0
6                   max recall  0.047626   1.000000 363
7              max specificity  0.998098   1.000000   0
8             max absolute_mcc  0.407795   0.611666 218
9   max min_per_class_accuracy  0.457299   0.795349 197
10 max mean_per_class_accuracy  0.407795   0.805743 218
11                     max tns  0.998098 245.000000   0
12                     max fns  0.998098 214.000000   0
13                     max fps  0.000199 245.000000 399
14                     max tps  0.047626 215.000000 363
15                     max tnr  0.998098   1.000000   0
16                     max fnr  0.998098   0.995349   0
17                     max fpr  0.000199   1.000000 399
18                     max tpr  0.047626   1.000000 363

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Python

Appendix A: Environment, Language & Package Versions, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)

R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("h2o", version="3.44.0.3", repos="http://cran.us.r-project.org")

library(package=dplyr)
library(package=ggplot2)
library(package=h2o)

Python

import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)

Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
!pip install h20==3.46.0.2

import numpy
import pandas
from scipy import stats
import h2o

Appendix B: H2O.ai Initiation

# Start the H2O cluster (locally)
h2o.init()

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         40 minutes 39 seconds 
    H2O cluster timezone:       America/Denver 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    2 years and 23 days 
    H2O cluster name:           H2O_started_from_R_michael_uvp985 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.75 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.2.3 (2023-03-15)

Python

# Start the H2O cluster (locally)
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.
Attempting to start a local H2O server...
  Java Version: java version "21.0.2" 2024-01-16 LTS; Java(TM) SE Runtime Environment (build 21.0.2+13-LTS-58); Java HotSpot(TM) 64-Bit Server VM (build 21.0.2+13-LTS-58, mixed mode, sharing)
  Starting server from /Volumes/Personal/Mami/__Netlify/hello@michaelmallari.com/www.michaelmallari.com/pythonenv/v3.11.4/lib/python3.11/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/7z/m5xjlk9n4h75sm1nn2p_bpxh0000gn/T/tmptfn12y0z
  JVM stdout: /var/folders/7z/m5xjlk9n4h75sm1nn2p_bpxh0000gn/T/tmptfn12y0z/h2o_michael_started_from_python.out
  JVM stderr: /var/folders/7z/m5xjlk9n4h75sm1nn2p_bpxh0000gn/T/tmptfn12y0z/h2o_michael_started_from_python.err
  Server is running at http://127.0.0.1:54323
Connecting to H2O server at http://127.0.0.1:54323 ... successful.
Warning: Your H2O cluster version is (1 year, 7 months and 30 days) old.  There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
--------------------------  ------------------------------
H2O_cluster_uptime:         04 secs
H2O_cluster_timezone:       America/Denver
H2O_data_parsing_timezone:  UTC
H2O_cluster_version:        3.46.0.2
H2O_cluster_version_age:    1 year, 7 months and 30 days
H2O_cluster_name:           H2O_from_python_michael_5br5z4
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    1.983 Gb
H2O_cluster_total_cores:    8
H2O_cluster_allowed_cores:  8
H2O_cluster_status:         locked, healthy
H2O_connection_url:         http://127.0.0.1:54323
H2O_connection_proxy:       {"http": null, "https": null}
H2O_internal_security:      False
Python_version:             3.11.4 final
--------------------------  ------------------------------

Appendix C: A Case for H2O.ai for AutoML

Advantages

H2O.ai provides an easy-to-use interface that automates the end-to-end data science process.
It has efficient AutoML capabilities, making machine learning more accessible and saving time in model development.
H2O.ai is scalable, accommodating various data volumes for businesses.
It provides a “leaderboard” of H2O models which can be easily exported for use in production.

Disadvantages

One downside of H2O.ai is that it can be difficult to determine who is accountable for the decisions made by the automated Machine Learning models. This could be a concern in regulated industries like Financial Services or Healthcare.

Data Understanding

Exploratory Data Analysis

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Data Preparation

Data Frame Conversion to H2O Data Frame

Train & Test Data Splitting

Data Modeling

AutoML Classification Models: Training

AutoML Regression Models: Training

Model Evaluation

AutoML Classification Models

AutoML Regression Models

Appendix A: Environment, Language & Package Versions, and Coding Style

Appendix B: H2O.ai Initiation

Appendix C: A Case for H2O.ai for AutoML

Advantages

Disadvantages

Further Readings