Random Forest

 

Random Forest: an ensemble of decision trees

A single decision tree can make a lot of wrong calls because it has very black-and-white judgments. A random forest is a meta-estimator that aggregates many decision trees.

 

  • Classification and Regression
  • Every tree votes on the outcome, we pick the outcome with most votes
  • Less interpretable
  • More accurate prediction than CART
  • Randomness element: Each tree splits on random subset of the independent varibles
  • Strong performance with high tolerance for less-cleaned data
  • Quickly figuring out which features matters most

 

 

random forest models, random forest classification, ml, dl, machine learning system

 

  1. The number of features that can be split on at each node is limited to some % of the total. This ensures that the ensemble model does not rely too heavily on any individual feature, and makes fair use of all potentially predictive features.
  2. Each tree draws a random sample from the original data set when generating its splits, adding a further element of randomness that prevents overfitting.

 

9 decision tree classifiers 

These decision tree classifiers can be aggregated into a random forest ensemble which combines their input. Think of the horizontal and vertical axes of each decision tree output as features x1 and x2. At certain values of each feature, the decision tree outputs a classification of “blue”, “green”, “red”, etc.

can be aggregated into a forest which combines their input. These results are aggregated, through modal votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.

These results are aggregated, through modal votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output. Random forests are an excellent starting point for the modeling process, since they tend to have strong performance with a high tolerance for less-cleaned data and can be useful for figuring out which features actually matter among many features.

In regression cases: 

random forest gif random forest regression animation random forest modelling regression

 

Cross Validation:

In CART and Random Forest the minbucket/nodesize can affect the model’s out-of-sample accuracy (test set). small minbucket = overfitt, large minbucket = too simple. To prepare model for new data:

output of k-fold cross validation, parameter value, accuracy, cp, minbucket, nodesize

Nodesize/minbucket is called cp (somplexity parameter) in cross validation. Like AIC or R^2. So small cp leads to bigger trees and might overfit, bigger cp too simple

 

 

 

(try XGBoost tree?)


R:
library(caTools)
set.seed(3000)
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)
library(randomForest)
StevensForest = randomForest(Reverse ~ Circuit + Issue +…, data = Train, ntree=200, nodesize=25)
#nodesize is like minbucket, controlls min observations in each subset, small nodesize = long time
Om varning!! Den sökta variabeln måste vara factor for category class model
Train$Reverse = as.factor(Train$Reverse)
Test$Reverse = as.factor(Test$Reverse)
PredictForest = predict(StevensForest, newdata = Test)
table(Test$Reverse, PredictForest)
confusionMatrix(table(pred, actual))
Cross validation
library(caret)
library(e1071)

# Define cross-validation experiment
numFolds = trainControl( method = “cv”, number = 10 )
# method = cross validation, number of folds = 10
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))
#range of cp 0.01 – 0.5 (0.01 steps)

# Perform the cross validation
train(Reverse ~ Circuit + Issue + Petitioner + Respondent, data = Train, method = “rpart”,
trControl = numFolds, tuneGrid = cpGrid)


P:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Randomforest:
#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)