Random Forest: an ensemble of decision trees
A single decision tree can make a lot of wrong calls because it has very black-and-white judgments. A random forest is a meta-estimator that aggregates many decision trees.
- Classification and Regression
- Every tree votes on the outcome, we pick the outcome with most votes
- Less interpretable
- More accurate prediction than CART
- Randomness element: Each tree splits on random subset of the independent varibles
- Strong performance with high tolerance for less-cleaned data
- Quickly figuring out which features matters most
- The number of features that can be split on at each node is limited to some % of the total. This ensures that the ensemble model does not rely too heavily on any individual feature, and makes fair use of all potentially predictive features.
- Each tree draws a random sample from the original data set when generating its splits, adding a further element of randomness that prevents overfitting.
9 decision tree classifiers
can be aggregated into a forest which combines their input. These results are aggregated, through modal votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.
In regression cases:
Cross Validation:
In CART and Random Forest the minbucket/nodesize can affect the model’s out-of-sample accuracy (test set). small minbucket = overfitt, large minbucket = too simple. To prepare model for new data:
Nodesize/minbucket is called cp (somplexity parameter) in cross validation. Like AIC or R^2. So small cp leads to bigger trees and might overfit, bigger cp too simple
(try XGBoost tree?)
set.seed(3000)
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)
library(caret)
library(e1071)
# Define cross-validation experiment
numFolds = trainControl( method = “cv”, number = 10 )
# method = cross validation, number of folds = 10
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))
#range of cp 0.01 – 0.5 (0.01 steps)
# Perform the cross validation
train(Reverse ~ Circuit + Issue + Petitioner + Respondent, data = Train, method = “rpart”,
trControl = numFolds, tuneGrid = cpGrid)
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)