Decision Trees and Random Forest

Jay Vinay
2 min readSep 2, 2020

Random Forest”: An ensemble approach to finding the decision tree that best fits the training data by creating many decision trees and then determining the “average” one. The “random” part of the term refers to building each of the decision trees from a random selection of features; the “forest” refers to the set of decision trees.

**How does a decision tree work** A decision tree is arriving at an estimate by asking a series of questions to the data, each question narrowing our possible values until the model get confident enough to make a single prediction. The order of the question as well as their content are being determined by the model. In addition, the questions asked are all in a True/False form. This is a little tough to grasp because it is not how humans naturally think, and perhaps the best way to show this difference is to create a real decision tree from. In the above problem x1, x2 are two features which allow us to make predictions for the target variable y by asking True/False questions. This is the entirety of creating a decision tree regressor and will stop when some stopping condition (defined by hyperparamters) is met: 1.When you hit a limit that was requested (for example:max_depth) 2.When your leaf nodes only have one thing in them (no further split is possible, MSE for the train will be zero but will overfit for any other set -not a useful model) Creating an ensemble of these trees seemed like a remedy to solve the above disadvantages. Random Forest was first proposed by Tin Kam Ho at Bell Laboratories in 1995. A large number of trees can over-perform an individual tree by reducing the errors that usually arise whilst considering a single tree. When one tree goes wrong, the other tree might perform well. This is an added advantage that comes along, and this ensemble formed is known as the Random Forest. The randomly split dataset is distributed among all the trees wherein each tree focuses on the data that it has been provided with. Votes are collected from every tree, and the most popular class is chosen as the final output, this is for classification. In regression, an average is taken over all the outputs and is considered as the final result. Unlike Decision Trees, where the best performing features are taken as the split nodes, in Random Forest, these features are selected randomly. Only a selected bag of features are taken into consideration, and a randomized threshold is used to create the Decision tree.**Birth of Random Forest**

Originally published at https://www.jayvinay.com on September 2, 2020.

--

--

Jay Vinay

Computer Science Engneering Student.Interested in Psycology and Cognitive Sciences and Love to Code.