Scaled & Efficient Supervised Learning with AutoML
Accelerating time-to-value by automating modeling tasks on beer consumer data—using H2O.ai in R and Python.
In continuation of the binary classification project hypothetically chartered by the Blue Moon Brewing Company 12 year ago, this project is a longitudinal study to re-validate the relevance of Blue Moon’s STP (Segmentation, Targeting, Positioning) marketing strategy on today’s evolving beer consumers.
The objective of this data analysis is the same: to infer whether demographic data around gender, age, marital status, and income continue to indicate a consumer preference for light beer. To achieve this, I (again, hypothetically) collected survey data from 1,500 beer consumers. Employing automated machine learning (AutoML), this project explores broad alternative classification methods, beyond a baseline logistic regression that was applied in the prior project.
Data Understanding
For data understanding, I imported a CSV file with 100 records and 5 columns. These columns include gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This initial analysis is critical for identifying the dataset’s structure and preparing for subsequent data exploration and modeling.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is essential for understanding single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It reveals trends, patterns, and anomalies, informing subsequent analysis and hypothesis development.
Univariate Analysis
Univariate analysis summarizes and identifies patterns in individual variables. It informs subsequent analysis, revealing insights into distribution, central tendency, and variability. By examining one variable at a time, it detects outliers, assesses data quality, and sets the stage for more complex bivariate and multivariate analyses.
summary(beer_r)
gender married income age prefer_light
Min. :0.00 Min. :0.00 Min. :24796 Min. :21 Min. :0.0
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:46279 1st Qu.:49 1st Qu.:0.0
Median :0.00 Median :0.00 Median :52306 Median :56 Median :0.5
Mean :0.43 Mean :0.45 Mean :52561 Mean :55 Mean :0.5
3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:58955 3rd Qu.:62 3rd Qu.:1.0
Max. :1.00 Max. :1.00 Max. :84031 Max. :87 Max. :1.0
# Equivalent of summary(beer_r) in Python
beer_py.describe(include="all")
gender married income age prefer_light
count 1500.000000 1500.00000 1500.000000 1500.000000 1500.000000
mean 0.430000 0.45000 52560.539444 55.352281 0.500000
std 0.495241 0.49766 9487.082677 10.221699 0.500167
min 0.000000 0.00000 24795.654470 21.000000 0.000000
25% 0.000000 0.00000 46278.626699 48.705995 0.000000
50% 0.000000 0.00000 52306.099551 55.614237 0.500000
75% 1.000000 1.00000 58954.552335 62.349463 1.000000
max 1.000000 1.00000 84030.572926 87.000000 1.000000
Bivariate Analysis
#' Create a frequency table as a data frame (tidy format)
#'
#' @param data A data frame
#' @param row_var Character string of the row variable name
#' @param col_var Character string of the column variable name
#' @param row_labels Named vector for row variable labels
#' @param col_labels Named vector for column variable labels
#' @param row_name Character string for row dimension name
#' @param col_name Character string for column dimension name
#' @return A data frame with frequencies in wide format
create_frequency_dataframe <- function(data,
row_var,
col_var,
row_labels = NULL,
col_labels = NULL,
row_name = "Row Variable",
col_name = "Column Variable") {
# Create working copy
data_copy <- data
# Apply labels if provided
if (!is.null(row_labels)) {
data_copy[[paste0(row_var, "_labeled")]] <- row_labels[as.character(data_copy[[row_var]])]
row_var_to_use <- paste0(row_var, "_labeled")
} else {
row_var_to_use <- row_var
}
if (!is.null(col_labels)) {
data_copy[[paste0(col_var, "_labeled")]] <- col_labels[as.character(data_copy[[col_var]])]
col_var_to_use <- paste0(col_var, "_labeled")
} else {
col_var_to_use <- col_var
}
# Create frequency data frame
freq_df <- data_copy %>%
dplyr::count(
.data[[row_var_to_use]],
.data[[col_var_to_use]],
name = "Frequency"
) %>%
tidyr::pivot_wider(
names_from = .data[[col_var_to_use]],
values_from = Frequency,
values_fill = 0
) %>%
dplyr::rename(!!row_name := .data[[row_var_to_use]])
return(freq_df)
}
# Frequency table on prefer_light and gender
Multivariate Analysis
Multivariate analysis plays a pivotal role in understanding intricate relationships within datasets. By exploring patterns, identifying outliers, and revealing the underlying structure, it provides valuable insights. One essential tool in multivariate analysis is the correlation matrix, which quantifies the strength and direction of relationships between variables. Positive values indicate direct associations, while negative values imply inverse relationships. Leveraging insights from the correlation matrix, we can make informed decisions about feature selection, hypothesis testing, and model accuracy. Detecting multicollinearity and identifying potential predictors becomes more effective with this analytical approach.
correlation_matrix_r <- round(cor(beer_r), 2)
head(correlation_matrix_r[5:1, 1:5])
gender married income age prefer_light
prefer_light -0.09 0.05 0.40 -0.41 1.00
age 0.19 0.22 0.11 1.00 -0.41
income 0.03 0.33 1.00 0.11 0.40
married -0.04 1.00 0.33 0.22 0.05
gender 1.00 -0.04 0.03 0.19 -0.09
Data Preparation
Data Frame Conversion to H2O Data Frame
beer_r$gender <- factor(beer_r$gender, levels=c(0, 1), labels=c("Female", "Male"))
beer_r$married <- as.logical(beer_r$married)
beer_r$prefer_light <- as.logical(beer_r$prefer_light)
local_h2o <- h2o.init()
beer_h2o_r <- as.h2o(beer_r)
dim(beer_h2o_r)
[1] 1500 5
head(beer_h2o_r)
gender married income age prefer_light
1 Male TRUE 35885 48 FALSE
2 Female FALSE 37737 66 FALSE
3 Female FALSE 26388 62 FALSE
4 Female TRUE 43483 61 FALSE
5 Female FALSE 38079 54 FALSE
6 Female FALSE 44328 41 TRUE
Train & Test Data Splitting
beer_splits_h2o_r <- h2o.splitFrame(data=beer_h2o_r, ratios=0.7, seed=1754) #RoarLionRoar 🦁
beer_train_h2o_r <- beer_splits_h2o_r[[1]]
beer_test_h2o_r <- beer_splits_h2o_r[[2]]
dim(beer_train_h2o_r)
[1] 1040 5
head(beer_train_h2o_r)
gender married income age prefer_light
1 Male TRUE 35885 48 FALSE
2 Female FALSE 37737 66 FALSE
3 Female TRUE 43483 61 FALSE
4 Female FALSE 44328 41 TRUE
5 Female TRUE 40865 64 FALSE
6 Female FALSE 54499 45 TRUE
dim(beer_test_h2o_r)
[1] 460 5
head(beer_test_h2o_r)
gender married income age prefer_light
1 Female FALSE 26388 62 FALSE
2 Female FALSE 38079 54 FALSE
3 Male TRUE 62118 62 TRUE
4 Male TRUE 67201 85 FALSE
5 Female FALSE 40382 62 FALSE
6 Male TRUE 54339 38 TRUE
Data Modeling
AutoML Classification Models: Training
models_classification_predictors_r <- c("gender", "married", "income", "age")
models_classification_response_r <- "prefer_light"
models_classification_r <- h2o.automl(
x=models_classification_predictors_r,
y=models_classification_response_r,
training_frame=beer_train_h2o_r,
max_models=12,
seed=1754 #RoarLionRoar 🦁
)
AutoML Regression Models: Training
Suppose, within its brewery & restaurant in the RiNo district of Denver, Blue Moon seeks to optimize upselling opportunities by predicting income range. A numerical prediction can be made using the same data points: gender, marital status, age range, and preference for light beer. Using AutoML, I can perform this regression task efficiently and accurately, beyond simply using a baseline linear regression.
models_regression_predictors_r <- c("gender", "married", "age", "prefer_light")
models_regression_response_r <- "income"
models_regression_r <- h2o.automl(
y=models_classification_response_r,
training_frame=beer_train_h2o_r,
leaderboard_frame=beer_test_h2o_r,
max_runtime_secs=30,
seed=1754 #RoarLionRoar 🦁
)
Model Evaluation
AutoML Classification Models
# print(models_classification_r@leaderboard, n=nrow(models_classification_r@leaderboard))
h2o.get_leaderboard(object=models_classification_r, extra_columns="ALL")
model_id auc logloss aucpr mean_per_class_error rmse mse training_time_ms predict_time_per_row_ms algo
1 StackedEnsemble_BestOfFamily_1_AutoML_7_20260113_141731 0.85 0.47 0.85 0.23 0.39 0.16 1077 0.0209 StackedEnsemble
2 GLM_1_AutoML_7_20260113_141731 0.85 0.47 0.85 0.22 0.39 0.16 26 0.0043 GLM
3 StackedEnsemble_AllModels_1_AutoML_7_20260113_141731 0.85 0.47 0.85 0.22 0.40 0.16 1267 0.0186 StackedEnsemble
4 GBM_1_AutoML_7_20260113_141731 0.85 0.49 0.84 0.24 0.40 0.16 60 0.0115 GBM
5 DeepLearning_1_AutoML_7_20260113_141731 0.84 0.49 0.85 0.25 0.40 0.16 64 0.0065 DeepLearning
6 XGBoost_1_AutoML_7_20260113_141731 0.84 0.49 0.84 0.23 0.40 0.16 76 0.0060 XGBoost
[14 rows x 10 columns]
models_classification_predictions_r <- h2o.predict(models_classification_r, beer_test_h2o_r)
head(models_classification_predictions_r)
predict FALSE TRUE
1 FALSE 0.979 0.021
2 FALSE 0.848 0.152
3 TRUE 0.342 0.658
4 FALSE 0.845 0.155
5 FALSE 0.924 0.076
6 TRUE 0.082 0.918
models_classification_performance_r <- h2o.performance(models_classification_r@leader, beer_test_h2o_r)
models_classification_performance_r
H2OBinomialMetrics: stackedensemble
MSE: 0.15
RMSE: 0.39
LogLoss: 0.46
Mean Per-Class Error: 0.19
AUC: 0.87
AUCPR: 0.84
Gini: 0.74
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
FALSE TRUE Error Rate
FALSE 186 59 0.240816 =59/245
TRUE 31 184 0.144186 =31/215
Totals 217 243 0.195652 =90/460
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.486926 0.803493 211
2 max f2 0.281684 0.863830 275
3 max f0point5 0.730517 0.799537 138
4 max accuracy 0.510317 0.804348 198
5 max precision 0.996886 1.000000 0
6 max recall 0.129530 1.000000 342
7 max specificity 0.996886 1.000000 0
8 max absolute_mcc 0.486926 0.614671 211
9 max min_per_class_accuracy 0.523071 0.800000 191
10 max mean_per_class_accuracy 0.486926 0.807499 211
11 max tns 0.996886 245.000000 0
12 max fns 0.996886 214.000000 0
13 max fps 0.002299 245.000000 399
14 max tps 0.129530 215.000000 342
15 max tnr 0.996886 1.000000 0
16 max fnr 0.996886 0.995349 0
17 max fpr 0.002299 1.000000 399
18 max tpr 0.129530 1.000000 342
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
AutoML Regression Models
print(models_regression_r@leaderboard, n=nrow(models_regression_r@leaderboard))
model_id auc logloss aucpr mean_per_class_error rmse mse
1 DeepLearning_grid_1_AutoML_8_20260113_141743_model_4 0.88 0.44 0.85 0.19 0.38 0.14
2 DeepLearning_grid_1_AutoML_8_20260113_141743_model_2 0.88 0.44 0.85 0.20 0.38 0.14
3 DeepLearning_grid_3_AutoML_8_20260113_141743_model_3 0.87 0.45 0.85 0.20 0.38 0.14
4 GLM_1_AutoML_8_20260113_141743 0.87 0.45 0.85 0.20 0.38 0.15
5 DeepLearning_grid_2_AutoML_8_20260113_141743_model_3 0.87 0.45 0.85 0.19 0.38 0.15
6 DeepLearning_grid_3_AutoML_8_20260113_141743_model_2 0.87 0.45 0.85 0.20 0.38 0.15
7 DeepLearning_grid_2_AutoML_8_20260113_141743_model_2 0.87 0.45 0.84 0.19 0.38 0.15
8 DeepLearning_grid_1_AutoML_8_20260113_141743_model_5 0.87 0.47 0.85 0.19 0.39 0.15
9 DeepLearning_grid_1_AutoML_8_20260113_141743_model_1 0.87 0.46 0.85 0.20 0.39 0.15
10 DeepLearning_grid_1_AutoML_8_20260113_141743_model_3 0.87 0.48 0.85 0.20 0.40 0.16
11 StackedEnsemble_BestOfFamily_4_AutoML_8_20260113_141743 0.87 0.46 0.85 0.19 0.39 0.15
12 StackedEnsemble_BestOfFamily_2_AutoML_8_20260113_141743 0.87 0.46 0.84 0.19 0.39 0.15
13 StackedEnsemble_BestOfFamily_1_AutoML_8_20260113_141743 0.87 0.46 0.84 0.19 0.39 0.15
14 StackedEnsemble_AllModels_1_AutoML_8_20260113_141743 0.87 0.46 0.84 0.19 0.39 0.15
15 DeepLearning_grid_2_AutoML_8_20260113_141743_model_1 0.87 0.48 0.84 0.20 0.40 0.16
16 DeepLearning_grid_2_AutoML_8_20260113_141743_model_4 0.87 0.46 0.85 0.20 0.39 0.15
17 StackedEnsemble_BestOfFamily_3_AutoML_8_20260113_141743 0.87 0.46 0.84 0.19 0.39 0.15
18 StackedEnsemble_AllModels_2_AutoML_8_20260113_141743 0.87 0.46 0.84 0.19 0.39 0.15
19 StackedEnsemble_AllModels_4_AutoML_8_20260113_141743 0.87 0.46 0.84 0.19 0.39 0.15
20 StackedEnsemble_AllModels_3_AutoML_8_20260113_141743 0.87 0.47 0.84 0.19 0.39 0.15
21 DeepLearning_grid_3_AutoML_8_20260113_141743_model_1 0.87 0.47 0.83 0.20 0.39 0.15
22 DeepLearning_grid_1_AutoML_8_20260113_141743_model_6 0.87 0.46 0.84 0.21 0.39 0.15
23 DeepLearning_1_AutoML_8_20260113_141743 0.86 0.50 0.83 0.20 0.40 0.16
24 XGBoost_grid_1_AutoML_8_20260113_141743_model_14 0.86 0.47 0.81 0.21 0.39 0.15
25 XGBoost_grid_1_AutoML_8_20260113_141743_model_6 0.86 0.47 0.82 0.20 0.39 0.15
26 GBM_grid_1_AutoML_8_20260113_141743_model_19 0.86 0.48 0.82 0.21 0.40 0.16
27 GBM_grid_1_AutoML_8_20260113_141743_model_17 0.86 0.47 0.83 0.21 0.40 0.16
28 GBM_1_AutoML_8_20260113_141743 0.86 0.48 0.81 0.21 0.40 0.16
29 XGBoost_1_AutoML_8_20260113_141743 0.85 0.48 0.80 0.20 0.40 0.16
30 GBM_grid_1_AutoML_8_20260113_141743_model_13 0.85 0.49 0.81 0.21 0.40 0.16
31 XGBoost_grid_1_AutoML_8_20260113_141743_model_3 0.85 0.48 0.81 0.22 0.40 0.16
32 XGBoost_grid_1_AutoML_8_20260113_141743_model_7 0.85 0.48 0.79 0.21 0.39 0.16
33 XGBoost_grid_1_AutoML_8_20260113_141743_model_2 0.85 0.49 0.80 0.22 0.40 0.16
34 XGBoost_grid_1_AutoML_8_20260113_141743_model_1 0.85 0.49 0.81 0.22 0.40 0.16
35 GBM_grid_1_AutoML_8_20260113_141743_model_6 0.85 0.49 0.79 0.22 0.40 0.16
36 XGBoost_grid_1_AutoML_8_20260113_141743_model_10 0.85 0.49 0.82 0.22 0.40 0.16
37 XGBoost_grid_1_AutoML_8_20260113_141743_model_11 0.85 0.49 0.80 0.23 0.40 0.16
38 XGBoost_2_AutoML_8_20260113_141743 0.85 0.49 0.80 0.21 0.40 0.16
39 GBM_grid_1_AutoML_8_20260113_141743_model_2 0.85 0.49 0.81 0.23 0.40 0.16
40 XGBoost_grid_1_AutoML_8_20260113_141743_model_15 0.85 0.49 0.80 0.21 0.40 0.16
41 GBM_grid_1_AutoML_8_20260113_141743_model_10 0.85 0.49 0.79 0.23 0.40 0.16
42 XGBoost_3_AutoML_8_20260113_141743 0.85 0.49 0.81 0.22 0.40 0.16
43 GBM_grid_1_AutoML_8_20260113_141743_model_11 0.84 0.49 0.80 0.23 0.40 0.16
44 XGBoost_grid_1_AutoML_8_20260113_141743_model_13 0.84 0.50 0.80 0.21 0.40 0.16
45 XGBoost_grid_1_AutoML_8_20260113_141743_model_8 0.84 0.50 0.80 0.24 0.41 0.17
46 GBM_grid_1_AutoML_8_20260113_141743_model_5 0.84 0.50 0.81 0.22 0.40 0.16
47 XGBoost_grid_1_AutoML_8_20260113_141743_model_4 0.84 0.50 0.80 0.25 0.40 0.16
48 GBM_grid_1_AutoML_8_20260113_141743_model_16 0.84 0.51 0.79 0.23 0.41 0.17
49 GBM_2_AutoML_8_20260113_141743 0.84 0.50 0.80 0.22 0.41 0.16
50 GBM_grid_1_AutoML_8_20260113_141743_model_8 0.84 0.50 0.80 0.25 0.41 0.17
51 GBM_4_AutoML_8_20260113_141743 0.84 0.50 0.80 0.22 0.41 0.17
52 GBM_grid_1_AutoML_8_20260113_141743_model_15 0.83 0.52 0.81 0.25 0.41 0.17
53 GBM_5_AutoML_8_20260113_141743 0.83 0.51 0.79 0.22 0.41 0.17
54 GBM_3_AutoML_8_20260113_141743 0.83 0.51 0.79 0.22 0.41 0.17
55 GBM_grid_1_AutoML_8_20260113_141743_model_7 0.83 0.53 0.79 0.25 0.42 0.18
56 XRT_1_AutoML_8_20260113_141743 0.83 0.52 0.79 0.23 0.41 0.17
57 XGBoost_grid_1_AutoML_8_20260113_141743_model_9 0.83 0.55 0.81 0.26 0.42 0.18
58 GBM_grid_1_AutoML_8_20260113_141743_model_3 0.83 0.52 0.79 0.23 0.42 0.17
59 XGBoost_grid_1_AutoML_8_20260113_141743_model_12 0.82 0.55 0.80 0.25 0.43 0.18
60 GBM_grid_1_AutoML_8_20260113_141743_model_14 0.82 0.54 0.77 0.25 0.42 0.18
61 DRF_1_AutoML_8_20260113_141743 0.81 0.56 0.79 0.26 0.43 0.18
62 XGBoost_grid_1_AutoML_8_20260113_141743_model_5 0.81 0.64 0.79 0.25 0.44 0.20
63 GBM_grid_1_AutoML_8_20260113_141743_model_18 0.81 0.54 0.79 0.25 0.42 0.18
64 GBM_grid_1_AutoML_8_20260113_141743_model_9 0.81 0.54 0.78 0.29 0.43 0.18
65 GBM_grid_1_AutoML_8_20260113_141743_model_1 0.81 0.54 0.78 0.26 0.43 0.18
66 GBM_grid_1_AutoML_8_20260113_141743_model_12 0.75 0.60 0.73 0.35 0.45 0.21
67 GBM_grid_1_AutoML_8_20260113_141743_model_4 0.74 0.60 0.71 0.35 0.46 0.21
[67 rows x 7 columns]
models_regression_predictions_r <- h2o.predict(models_regression_r, beer_test_h2o_r)
head(models_regression_predictions_r)
predict FALSE TRUE
1 FALSE 0.997 0.0025
2 FALSE 0.928 0.0724
3 TRUE 0.443 0.5566
4 FALSE 0.963 0.0370
5 FALSE 0.962 0.0378
6 TRUE 0.093 0.9069
models_regression_performance_r <- h2o.performance(models_regression_r@leader, beer_test_h2o_r)
models_regression_performance_r
H2OBinomialMetrics: deeplearning
MSE: 0.14
RMSE: 0.38
LogLoss: 0.44
Mean Per-Class Error: 0.19
AUC: 0.88
AUCPR: 0.85
Gini: 0.75
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
FALSE TRUE Error Rate
FALSE 184 61 0.248980 =61/245
TRUE 30 185 0.139535 =30/215
Totals 214 246 0.197826 =91/460
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.407795 0.802603 218
2 max f2 0.218208 0.872340 280
3 max f0point5 0.598899 0.797342 152
4 max accuracy 0.411384 0.802174 217
5 max precision 0.998098 1.000000 0
6 max recall 0.047626 1.000000 363
7 max specificity 0.998098 1.000000 0
8 max absolute_mcc 0.407795 0.611666 218
9 max min_per_class_accuracy 0.457299 0.795349 197
10 max mean_per_class_accuracy 0.407795 0.805743 218
11 max tns 0.998098 245.000000 0
12 max fns 0.998098 214.000000 0
13 max fps 0.000199 245.000000 399
14 max tps 0.047626 215.000000 363
15 max tnr 0.998098 1.000000 0
16 max fnr 0.998098 0.995349 0
17 max fpr 0.000199 1.000000 399
18 max tpr 0.047626 1.000000 363
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Appendix A: Environment, Language & Package Versions, and Coding Style
If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.
cat(
R.version$version.string, "-", R.version$nickname,
"\nOS:", Sys.info()["sysname"], R.version$platform,
"\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle
OS: Darwin x86_64-apple-darwin17.0
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("h2o", version="3.44.0.3", repos="http://cran.us.r-project.org")
library(package=dplyr)
library(package=ggplot2)
library(package=h2o)
import sys
import platform
import os
import cpuinfo
print(
"Python", sys.version,
"\nOS:", platform.system(), platform.platform(),
"\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
OS: Darwin macOS-10.16-x86_64-i386-64bit
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
!pip install h20==3.46.0.2
import numpy
import pandas
from scipy import stats
import h2o
Appendix B: H2O.ai Initiation
# Start the H2O cluster (locally)
h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 40 minutes 39 seconds
H2O cluster timezone: America/Denver
H2O data parsing timezone: UTC
H2O cluster version: 3.44.0.3
H2O cluster version age: 2 years and 23 days
H2O cluster name: H2O_started_from_R_michael_uvp985
H2O cluster total nodes: 1
H2O cluster total memory: 1.75 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 4.2.3 (2023-03-15)
# Start the H2O cluster (locally)
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Attempting to start a local H2O server...
Java Version: java version "21.0.2" 2024-01-16 LTS; Java(TM) SE Runtime Environment (build 21.0.2+13-LTS-58); Java HotSpot(TM) 64-Bit Server VM (build 21.0.2+13-LTS-58, mixed mode, sharing)
Starting server from /Volumes/Personal/Mami/__Netlify/hello@michaelmallari.com/www.michaelmallari.com/pythonenv/v3.11.4/lib/python3.11/site-packages/h2o/backend/bin/h2o.jar
Ice root: /var/folders/7z/m5xjlk9n4h75sm1nn2p_bpxh0000gn/T/tmptfn12y0z
JVM stdout: /var/folders/7z/m5xjlk9n4h75sm1nn2p_bpxh0000gn/T/tmptfn12y0z/h2o_michael_started_from_python.out
JVM stderr: /var/folders/7z/m5xjlk9n4h75sm1nn2p_bpxh0000gn/T/tmptfn12y0z/h2o_michael_started_from_python.err
Server is running at http://127.0.0.1:54323
Connecting to H2O server at http://127.0.0.1:54323 ... successful.
Warning: Your H2O cluster version is (1 year, 7 months and 30 days) old. There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
-------------------------- ------------------------------
H2O_cluster_uptime: 04 secs
H2O_cluster_timezone: America/Denver
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.46.0.2
H2O_cluster_version_age: 1 year, 7 months and 30 days
H2O_cluster_name: H2O_from_python_michael_5br5z4
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 1.983 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://127.0.0.1:54323
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.11.4 final
-------------------------- ------------------------------
Appendix C: A Case for H2O.ai for AutoML
Advantages
- H2O.ai provides an easy-to-use interface that automates the end-to-end data science process.
- It has efficient AutoML capabilities, making machine learning more accessible and saving time in model development.
- H2O.ai is scalable, accommodating various data volumes for businesses.
- It provides a “leaderboard” of H2O models which can be easily exported for use in production.
Disadvantages
- One downside of H2O.ai is that it can be difficult to determine who is accountable for the decisions made by the automated Machine Learning models. This could be a concern in regulated industries like Financial Services or Healthcare.
Further Readings
- H2O AutoML: Automatic Machine Learning. (n.d.). H2O 3.46.0.2 documentation. https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html