Random Forest

Predict outcomes and identify the most important variables.

Definition

Random Forest is a supervised machine learning algorithm based on an ensemble of decision trees. It is robust, insensitive to outliers, and automatically provides a measure of variable importance.

When to use it

Prediction with many variables (classification or regression)
Identify the most predictive variables
Data with missing values or mixed variable types
When relationships are complex and non-linear

Requirements

Binary (classification) or continuous (regression) dependent variable
Continuous or categorical independent variables
N ≥ 50 recommended for stability

What StatsLab computes

Variable importance (Gini)
Importance bar chart
AUC-ROC (classification)
Confusion matrix
OOB error rate (Out-Of-Bag)
Precision, recall, F1-score

Worked example

Context : Predicting school dropout (Yes/No) from 12 socio-demographic and academic variables.

Result : AUC = 0.89 · OOB error = 8.2% · Top variable: Absenteeism (importance = 0.31)

Interpretation : Excellent predictive power (AUC = 0.89). Absenteeism is by far the most predictive variable. The model correctly classifies 91.8% of students.

Run this analysis