Predict outcomes and identify the most important variables.
Definition
Random Forest is a supervised machine learning algorithm based on an ensemble of decision trees. It is robust, insensitive to outliers, and automatically provides a measure of variable importance.
When to use it
Prediction with many variables (classification or regression)
Identify the most predictive variables
Data with missing values or mixed variable types
When relationships are complex and non-linear
Requirements
Binary (classification) or continuous (regression) dependent variable
Continuous or categorical independent variables
N ≥ 50 recommended for stability
What StatsLab computes
Variable importance (Gini)
Importance bar chart
AUC-ROC (classification)
Confusion matrix
OOB error rate (Out-Of-Bag)
Precision, recall, F1-score
Worked example
Context : Predicting school dropout (Yes/No) from 12 socio-demographic and academic variables.
Result : AUC = 0.89 · OOB error = 8.2% · Top variable: Absenteeism (importance = 0.31)
Interpretation : Excellent predictive power (AUC = 0.89). Absenteeism is by far the most predictive variable. The model correctly classifies 91.8% of students.