Data Visualization and Random Forest (Live) – May 2025

- 2 (Registered)
This course offers a deep dive into the realms of data mining, visualization, and advanced predictive modeling. It starts by building a solid foundation in data mining, where you’ll learn about model fitting, overfitting prevention, variable selection strategies, and the nuances of generalized linear models. From there, the curriculum transitions into effective data visualization techniques, guiding you through bivariate and multi-panel plots, added variable and partial residual plots, and the interpretation of interactions and nonlinear effects. The course then bridges these insights into the world of machine learning with a deep dive into Random Forests in R, covering key aspects such as variable importance, out-of-bag error estimation, and methods like VSURF for variable selection. Ultimately, you’ll gain the skills to integrate visual analysis with linear modeling to enhance both your exploratory data analysis and predictive model building.
Classes are held via zoom on Mondays from 12pm-2pm EST during the first four weeks of May. Space is limited to 9 students.
Curriculum
- 5 Sections
- 7 Lessons
- Lifetime
- Canvas AccessThis course is best taken on Canvas, where you can post discussion questions, take quizzes, and practice in R. The first lesson contains instructions on accessing canvas and the passkey for doing so.1
- Unit 1 Overview
Data Mining
Objectives
- Know what we buy estimates with
- Perfect fitting models
- Spent vs. unspent df
- Complexity vs. fit tradeoff
- Four reasons to do a GLM
- Why GLMs suck at last two purposes
- Understand what data mining is
- Understand what overfitting is
- How to prevent overfitting
- Three strategies for selecting variables
- Linking hypotheses to specific parameters
- Research questions versus hypotheses (and tools appropriate for each)
- Why having more than 3 predictors is a red flag
1 - Unit 2 Overview - Visualizing Data
Learning Objectives
What we’re looking for in bivariate plot
Problems to look out for: curvilinear relationships
What binning is
Rules for identifying which variable to put on the x axis
Identifying main effects of paneled variables
Purpose of added variable plots (AVPs)
Weakness of AVPs
Identifying interactions from visuals
When should you NOT use AVPs
Visualizing 3+ variables
What are you looking for in multi panel plots
The purpose of marginal plots
How to interpret marginal plots
Conceptually what a three-way interaction isBuilding models from visual vs. building visuals from models
How to build models from visuals
Why we specifically look for interactions and nonlinear effects
What added variable plots (AVPs) are doing
AVPs as approximations
Partial residual plots vs. AVPs
PRPs and showing what we failed to fit
Dustin’s extension to PRPs
How to do PRPs in flexplot
Adding back fit to the residuals versus not
What you can visualize with PRPs
Residual dependence plots vs. partial residual plots
Using PRPs to detect three-way interactions
Reducing bin size for multi panel plotsWhy reporting with tables is stupid
Why we prefer visuals for reporting results
What visual partitions are
Three rules for visual partitions
Biggest threat: failing to miss something you could have modeled
5 Step strategy for identifying visual partitions4 - Unit 3 Overview
Random Forest and LMs
Objectives
- General strategy for RF
- How to model RF in R
- How to Compute importance/OOB
- How to visualize RF in flexplot
- Different uses of RF
- Building LMs from RF visuals
- Two methods for variable selection using RF
- Basics of VSURF
- Three steps of VSURF
- Pros/Cons of VSURF
- Four steps of my approach
- Pros/Cons of my approach
2 - Unit 4 Overview
Objectives
- General strategy for RF
- How to model RF in R
- How to Compute importance/OOB
- How to visualize RF in flexplot
- Different uses of RF
- What visual partitions are
- Rules for plotting visual partitions
- The steps for identifying visual partitions
3