Dish It: How Data Mining Serves Smarter Recommendations

Saturday, November 15, 2008

Updated: Saturday, November 18, 2023
The original post has been migrated from a sunset blog host (http://glue.umd.edu/~mmallari/), ported from XLMiner to R, Python, and Julia, and refreshed with newer datasets and references. Image Credit: https://unsplash.com/photos/person-taking-picture-of-the-foods-i_xVfNtQjwI

As the ubiquity of smartphones begin to revolutionize how we interact with the world, Help!, a fictional social app startup has emerged. Help! is designed to be the go-to platform for discovering NYC restaurants and their hidden culinary gems. At its core, Help! aims to combine crowd-sourced reviews with a hint of social networking, all while providing users with data-driven insights about where to eat.

However, as competition in the restaurant discovery space grows fierce, Help! is facing a significant challenge: users are overwhelmed by choice and noise. Despite an abundance of reviews and ratings, it is evident that users crave recommendations tailored to their preferences—a feature that could transform Help! into the trusted advisor users turned to for their dining decisions.

But then, something even more pressing has come to light: the impact of restaurant sanitary grades on user trust and behavior. Public health data revealed that diners were increasingly avoiding restaurants with lower sanitary grades (e.g., “C”), regardless of how good the food was. Moreover, emerging user feedback highlighted a desire for transparency about restaurants’ cuisines and health standards. Users weren’t just looking for a place to eat; they were looking for a safe and satisfying experience.

Ignoring Health Data Puts User Confidence At-Risk

Without acting on this insight, Help! would risk irrelevance in the fiercely competitive food app landscape. Users could lose confidence in the app’s value if they feel it isn’t addressing their core concerns about health and safety. Competitors could likely seize the initiative and build features around data-driven insights—leaving Help! in the dust.

If this project fails to materialize:

Diners might continue to eat at low-graded restaurants unknowingly, sparking reputational harm for the app.
Help! would miss the chance to differentiate itself as the platform that goes beyond surface-level reviews and prioritizes informed decision-making for users.
The startup might burn its marketing budget on vague recommendations, leading to wasted effort and dwindling user retention.
In contrast, embracing association rule mining could turn raw data into real value, enabling personalized recommendations that align with user trust, safety, and satisfaction.

How might unsupervised learning—or more specifically, association rules—power valuable recommendation systems? For example:

Are certain cuisines more likely to have higher sanitary grades?
Do neighborhoods play a role in the relationship between cuisine and health compliance?
Ignoring this opportunity meant risking user trust, engagement, and market share.

Leverage Data to Prioritize Health-Conscious Dining

The primary objective of this project is clear: to use association rule mining to uncover actionable relationships between cuisine type, location, and sanitary grades. These insights would then power Help!’s new recommendation engine—one that prioritizes health-conscious dining while providing users with deeper, data-driven transparency. This project will:

Understand User Behavior: Identify patterns in restaurant sanitary grades that influence user trust.
Enhance User Experience: Integrate insights into app features, like customizable filters (e.g., “Show only A-grade Chinese restaurants in Brooklyn”).
Strengthen Competitive Positioning: Position “Help!” as the safest and smartest choice for NYC food discovery in 2008.

This isn’t just about mining patterns; it’s about making data work for the people—by ensuring every dining choice feels like a safe, informed, and delightful decision.

Now that we have context on this user experience (UX) problem …

♪ Let’s get technical, technical. I wanna get technical. ♪

Appendix A: Environment, Reproducibility, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia, and the respective packages that I used. Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)

R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")

library(dplyr)
library(ggplot2)
library(caret)

Python

import sys
print(sys.version)

3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]

!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scikit-learn==1.3.0

import random
import datetime
import pandas
import plotnine
import sklearn

Julia

using InteractiveUtils
InteractiveUtils.versioninfo()

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server

using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
Pkg.add(name="Gadfly", version="1.4.0")

using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
using Gadfly

Ignoring Health Data Puts User Confidence At-Risk

Leverage Data to Prioritize Health-Conscious Dining

Appendix A: Environment, Reproducibility, and Coding Style

Further Readings