Don’t Wing It: Predicting Sales with Regression Trees

Saturday, October 25, 2008

Updated: Saturday, October 28, 2023
The original post has been migrated from a sunset blog host (http://glue.umd.edu/~mmallari/), ported from XLMiner to R, Python, and Julia, and refreshed with newer datasets and references.

Pluckeye’s Cajun Chicken, a fictional fast-food chain famous for its crispy fried chicken with spicy Cajun flavors, finds itself navigating uncharted territory. With over 500 locations across the U.S., the restaurant has built its reputation on delivering bold flavors and Southern hospitality. However, as the U.S. economy begins to show cracks following the housing market collapse, consumers are tightening their wallets. Rising food costs, declining consumer spending, and an increasingly cautious customer base are threatening the chain’s reliable revenue streams.

Every location faces its own unique challenges. A busy urban store surrounded by office workers is thriving, while a suburban outlet, only a short drive away, struggles to keep the fryer running during peak hours. Leadership is left asking:

Why the stark differences?
Is it local demographics, regional crime rates, or something else entirely?

Without clear answers, Pluckeye’s is at risk of losing its competitive edge at the worst possible time.

Pinpointing Risks to Prevent Costly Closures and Missed Opportunities

For Pluckeye’s, the stakes couldn’t be higher. Poor decision-making during this volatile time could have devastating consequences:

Closing Viable Locations: Struggling stores with untapped potential could be closed prematurely.
Investing in Underperforming Areas: Expansion or marketing efforts in the wrong regions could drain critical resources.
Missing Expansion Opportunities: Strong, high-potential regions might go unnoticed in the chaos.

Beyond the financial risk, there are deeper concerns. Each decision impacts local communities—store closures means job losses, reduced dining options for loyal customers, and diminished confidence in a brand that is a household name. Without a data-driven approach, Pluckeye’s could be flying blind in a turbulent economic environment.

How might supervised learning—or more specifically, decision trees—help replicate the success of thriving locations, especially during uncertain times?

Leveraging Regression Trees to Safeguard the Brand

To weather the storm and prepare for an uncertain future, the leadership team at Pluckeye’s turned to regression decision trees (or regression trees), a machine learning method designed to analyze complex relationships and predict numerical outcomes like sales revenue. By leveraging insights from their nationwide network of restaurants, they hope to uncover the factors driving performance and make informed decisions that would protect the chain’s legacy.

The objectives of the project are clear:

Identify Key Drivers of Success: Determine which factors (e.g., crime rate, property tax, average family size, or median income) most strongly influence restaurant sales.
Forecast Restaurant Performance: Use historical data to predict revenue for existing locations and evaluate the potential of new regions.
Support Better Decisions: Provide actionable, data-backed recommendations for marketing, resource allocation, and store closures or expansions.

Now that we have context on this marketing problem …

♪ Let’s get technical, technical. I wanna get technical. ♪

Appendix A: Environment, Reproducibility, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia, and the respective packages that I used. Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)

R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")

library(dplyr)
library(ggplot2)
library(caret)

Python

import sys
print(sys.version)

3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]

!pip install pandas==2.0.3
!pip install plotnine==0.12.1
!pip install scikit-learn==1.3.0

import random
import datetime
import pandas
import plotnine
import sklearn

Julia

using InteractiveUtils
InteractiveUtils.versioninfo()

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server

using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
Pkg.add(name="Gadfly", version="1.4.0")

using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
using Gadfly

Pinpointing Risks to Prevent Costly Closures and Missed Opportunities

Leveraging Regression Trees to Safeguard the Brand

Appendix A: Environment, Reproducibility, and Coding Style

Further Readings