| |

Running a Random Forest – Data Analysis and Intrepretation

Overview

My research work deals with Ghana, a country from the Gapminder dataset as has already been discussed from the beginning and progression through this course.

The variables in my observation dataset are all quantitative.

For the purposes of this assignment, I have binned my quantitative target variable, Life Expectancy (lifeexpectancy) into a 2-level binary categorical target variable. I have named this categorical target variable, lifeExpectancyCat. It has been coded as 0 = low life expectancy and 1 = high life expectancy

I have also binned 2 of the other predictor categorical variables (incomeperperson and exports) for the purpose of this assignment.

Thus these are the respective categorical end result variables of my quantitative predictor variables:

Quantitative           Binary Categorical variable

incomeperperson – incomeLevelGrp

exports – exportsCatGrp

I have also added 2 more explanatory variables which is obtained
from the Gapminder website: http://www.gapminder.org/data/

to my list of variables which are used for this assignment. This is to get more explanatory variables for this Random Forest Assignment.

These new variables are:

agriculture which represents Agriculture, value added (% of GDP)

democracyscore which represents Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. It is a summary measure of a country’s democratic and free nature. -10 is the lowest value, 10 the highest.

Running a Random Forest

Random forest analysis was performed to evaluate the
importance of a series of explanatory variables in predicting my binary,
categorical response variable – . The following explanatory variables were
included as possible contributors to a random forest evaluating;

income per person, exports, inflation, agriculture, and democracy score

The explanatory variable with the highest relative importance
scores was agriculture which has relative importance score of 0.4005748 followed by democracy score with a score of 0.35894682, followed by inflation with a relative importance score of 0.12744075. Then comes exports
with a score of 0.05900944. The explanatory variable with the lowest relative
importance score is income per person with a relative importance score of 0.05402819

 

Relative Importance Scores

image

 

The accuracy of the random forest was 95% (0.94999999999999996),
with the subsequent growing of multiple trees rather than a single tree, adding
little to the overall accuracy of the model, and suggesting that interpretation
of a single decision tree may be appropriate.

 

Random Forest Graph

image

 

Requesting the shape of my predictor training sample it can be seen that it has 30 observation or rows which represents 60% of our original sample and 5 explanatory variables as indicated above.

image

The test sample has 20 observation or rows which is 40% of
the original sample and 5 explanatory variables or columns

image

A Confusion Matrix was used to estimate the prediction
accuracy of my model. From the results of the Confusion Matrix, the models
accurately classified 19 of the total 20 observations included in my data set
for the classification and misclassified just 1 of the 20 observations included
in the classification process. This means the model correctly classified 95% of
the observations as having High Life Expectancy or Low Life Expectancy and
misclassified 5% of the observations in my data set.

The Confusion Matrix result can be seen below:

image

This
can be interpreted as the model having low prediction error as it correctly
classified high percentage of the observations and misclassified low percentage
of the observations in my dataset

This can further be seen by running a Test Accuracy score on my
model and the result was 0.94999999999999996 (95%) as can also be seen below:

image

 

 

################################

PYTHON
CODE

#################################

################################

CODE
OUTPUT

#################################

Want more information like this?

Similar Posts

  • Python PyQt Gui Basics Button Window PyQt4 Pt1

    This is just some quick scrap notes on the basics of Python GUI building using PyQt. In [1]:

     

      the ‘init‘ statements runs anytime we call call the Window class we have created and it runs all the code in the Window class the _init__ method calls the super constructor and then sets…

  • | |

    Test a Logistic Regression Model – Data Analysis and Intrepretation

    OVERVIEW My research work deals with Ghana, a country from the Gapminder dataset.     What I found in my logistic regression analysis. Discussion of the results for the associations between all of my explanatory variables and my response variable   The primary quantitative explanatory variable in my regression analysis is the Income Per Person…

  • |

    Python iloc, loc, ix Data Retrieving Selection Functions

      Pandas iloc, loc, and ix functions are very powerful ways to quickly select data from your dataframe. Today , we take a quick look at these 3 functions. Credits to Data School, you can check him out in Youtube  In [1]:

      In [2]:

      In [3]:

      Out[3]: City Colors Reported Shape Reported State…

  • | |

    Step By Step Methodology or Guide to Tackle A Data Science Competition or Project

    Have you ever thought of tackling Data Science competitions from top competition websites such as Kaggle.com or you have a Data Science project and you are not sure how to start on it or where to start exactly from and how to go through it ? Or you simply have an idea but want to…

  • Array Transposition – Numpy Python Data Analysis

    Welcome Guys, We will be looking at Array transposition in this quick notes. This is part of lectures on Learning Python for Data Analysis and Visualization by Jose Portilla on Udemy.   In [1]:

      In [2]:

      Out[2]:

    In [17]:

      Out[17]:

    In [10]:

      Out[10]:

    In [11]:

      Out[11]:

    In [23]:

     

Leave a Reply

Your email address will not be published. Required fields are marked *