Chosen Dataset

I will be working with Data from the Gapminder dataset.

This happens to be the same dataset I worked with under the Data Management and Visualization course  assignments. As elaborated and discussed in under the Data Management and Visualization course  assignments, I have chosen to focus on the country, Ghana.

Hence I will be particularly interested in some data about the country Ghana (as dealt with under Data Management and Visualization course
assignments) and these are primarily the

a.      incomeperperson

b.     lifeexpectancy

c.      literacyrate

 

It is worth pointing out that, the gapminder.csv provided for the assignment
comprise all the countries in the world without a laser focus on my country of
interest –Ghana. This means that only one year of value for each of these core
variables are entered in the gapminder.csv data provided for the assigment.

The result is that when I run my python program with just these single values there are no other more yearly based values for each of these various to make a
meaningful frequency distribution which are only specific to Ghana,
unless I compare the values of Ghana in general to all the other countries, which is NOT the focus of my research work.

To be able to achieve this laser focus research on only the country Ghana I will
fetch this data from the http://www.gapminder.org/ website, specifically their data section which can be found herehttp://www.gapminder.org/data/.

 

I will therefore need to compile a new data csv file with focus on Ghana which will give me all the variables I will need for my analysis. In a nutshell, this new
data csv file seeks to enable me load and call the relevant variables and columns
in my python program and more importantly to be get the relevant variables I
will need for my research work going forward.

I will, therefore, call the new data csv file for the assignment: gapminder_ghana_updated.csv

This will be the Gapminder csv data file I will be calling and loading into my
python program

The   gapminder_ghana_updated.csv  dataset csv for this project can be view and dowloaded here:

 

https://drive.google.com/file/d/0B2KfPRxy4ootQnRzVUZQQXdFX1U/view?usp=sharing

 

see screenshot here for guide (http://prntscr.com/9gctxn)

 

Data Variables

All the data variables I worked with on the Gapminder dataset are all quantitative, however, as stated in the requirements for the Running An Analysis Of Variance assignment, I will need one of my variables (explanatory) to be categorical.

I have therefore added a 4th variable by name Inflation; which I will
categorise in order to get a categorical variable purposely for the Analysis Of
Variance Test.

Inflation, GDP deflator (annual %):

According to the Gapminder codebook, Inflation as measured by the annual growth rate of the GDP implicit deflator shows the rate of price change in the economy as a whole. The GDP implicit deflator is the ratio of GDP in current local currency to GDP in constant local currency. Source: World Bank national accounts data, and OECD National Accounts data files

Hence the 4 variables I will be working with on this and subsequent assignments are:

a.      incomeperperson

b.     lifeexpectancy

c.      literacyrate

d.       Inflation

 

I have duly updated My Personal Codebook to include this 4th Variable. Hence the updated Personal Codebook can be found at this link:

https://docs.google.com/document/d/177YfOjdk4oekFu20OLt4fgmu-n7cgRULAJ_Kd9KdYaM/edit?usp=sharing

 

The Research Question

For the purposes of this assignment; Running An Analysis Of
Variance Test,
I will modify my research question used in the previous
course a little bit.

 

Hence the question I will be looking at in this assignment is: Is there an association OR relation between Inflation and Income Per Person of the Ghanaian population.

 

Hypothesis Testing

The Null and Alternate Hypotheses:

From the above research question, the Null Hypothesis (Hois that there is no association /relations between Inflation and Income Per Person of the Ghanaian population.

Whereas the Alternate Hypothesis (Hastates that there is an association / relation between Inflation and Income Per Person of the Ghanaian population

 

Sample:

Sample is the data from the Gapminder dataset with focus on Ghana

 

Assessing the evidence:

This is done by Running an analysis of variance (ANOVA TEST ) on the hypotheses.

I do this by using running the test using the Python program.

 

MY PYTHON PROGRAM CODE:

# -*- coding: utf-8 -*-

"""

Created on Mon Jan  4 00:59:30 2016



@author: Bernard

"""



import numpy

import pandas

#to be able to get the p
value and conduct the ANOVA F TEST we import this library

import
statsmodels.formula.api as smf

import
statsmodels.stats.multicomp as multi



#load the
gapminder_ghana_updated dataset csv into the program

data =
pandas.read_csv('gapminder_ghana_updated.csv', low_memory = False)



#Converting data to
numeric

data["incomeperperson"]
= data["incomeperperson"].convert_objects(convert_numeric=True)

data["lifeexpectancy"]
= data["lifeexpectancy"].convert_objects(convert_numeric=True)

data["literacyrate"]
= data["literacyrate"].convert_objects(convert_numeric= True)

data["Inflation"]
= data["Inflation"].convert_objects(convert_numeric= True)



#create a variable for
inflationcategory

data["inflationCategory"]
= data["Inflation"]



#categorical groupings for
inflation. This is to get one categorical variable for the

#ANOVA test

data["inflationCategory"]
= pandas.cut(data.inflationCategory, [-4, 32, 64, 96, 128])



#including only data
relevant for our testing by droping irrelavant data

dataSub =
data[["incomeperperson", "inflationCategory"]].dropna()





#Change format from
numberic to categorical

dataSub["inflationCategory"]
= dataSub["inflationCategory"].astype("category")



#describe inflation
category

print("describe
inflation Category")

desc1 =
dataSub["inflationCategory"].describe()

print(desc1)



#inflationCategory count

print("inflation
category")

c1 =
dataSub["inflationCategory"].value_counts(sort=False, dropna=True)

print(c1)





#print size of
incomeperperson

ct1 =
dataSub.groupby('incomeperperson').size()

print("Incomeperperson
- 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana. ")

print (ct1)





# using ordinaary least
squares (ols)function for calculating the F-statistic and associated p value

model1 =
smf.ols(formula='incomeperperson ~ C(inflationCategory)', data=dataSub)

results1 = model1.fit()

print (results1.summary())



#Examining the means of
incomeperperson and inflationcategory, hence i will be

#looking at only the
variables concerned

dataSub1 =
data[['incomeperperson', 'inflationCategory']].dropna()



print ('means for
incomeperperson by inflationCategory')

mean1=
dataSub1.groupby('inflationCategory').mean()

print (mean1)



print ('standard
deviations for incomeperperson by inflationCategory')

sd1 =
dataSub1.groupby('inflationCategory').std()

print (sd1)



#running a post hoc ANOVA
TEST for levels of the categorical variables using TUKEYHSD

mc1 =
multi.MultiComparison(dataSub1['incomeperperson'],
dataSub1['inflationCategory'])

res1 = mc1.tukeyhsd()

print(res1.summary())



 

 

CODE OUTPUT:

<<<<<<<<<<<<<
CODE OUTPUT BEGIN
>>>>>>>>>>>>>>>>>>>

describe inflation
Category

count           51

unique           4

top       (-4, 32]

freq            37

Name: inflationCategory,
dtype: object

inflation category

(-4, 32]     37

(32, 64]      9

(64, 96]      4

(96, 128]     1

dtype: int64

Incomeperperson – 2010
Gross Domestic Product per capita in constant 2000 US$ of Ghana.

incomeperperson

1628    1

1710    1

1740    1

1766    1

1778    1

1813    1

1865    1

1909    1

1920    1

1960    1

1965    1

1985    1

2024    1

2036    1

2066    1

2072    1

2085    1

2090    1

2110    1

2130    1

2148    1

2177    1

2190    1

2199    1

2208    1

2226    1

2230    1

2240    1

2244    1

2267    1

2273    1

2290    1

2306    1

2316    1

2322    1

2334    1

2350    1

2396    1

2409    1

2455    1

2458    1

2479    1

2527    1

2558    1

2559    1

2652    1

2751    1

2907    1

2934    1

3091    1

3446    1

dtype: int64

OLS Regression
Results

==============================================================================

Dep. Variable:        incomeperperson   R-squared:                       0.195

Model:                            OLS   Adj. R-squared:                  0.144

Method:                 Least Squares   F-statistic:                     3.799

Date:                Mon, 04 Jan 2016   Prob (F-statistic):             0.0161

Time:                        04:26:43   Log-Likelihood:                -366.21

No. Observations:                  51   AIC:                             740.4

Df Residuals:                      47   BIC:
748.1

Df Model:                           3

Covariance Type:            nonrobust

=====================================================================================================
coef    std err          t
P>|t|      [95.0% Conf. Int.]

—————————————————————————————————–

Intercept                          2329.0541     54.435
42.786      0.000      2219.545
2438.563

C(inflationCategory)[T.(32,
64]]   -343.7207    123.065
-2.793      0.008      -591.296
-96.146

C(inflationCategory)[T.(64,
96]]    -98.3041    174.276
-0.564      0.575      -448.903
252.295

C(inflationCategory)[T.(96,
128]]  -701.0541    335.559
-2.089      0.042     -1376.111
-25.997

==============================================================================

Omnibus:                       14.413   Durbin-Watson:                   0.381

Prob(Omnibus):                  0.001   Jarque-Bera (JB):               17.318

Skew:                           1.048   Prob(JB):                     0.000174

Kurtosis:                       4.938   Cond. No.                         7.40

==============================================================================

 

Warnings:

[1] Standard Errors assume
that the covariance matrix of the errors is correctly specified.

means for incomeperperson
by inflationCategory

incomeperperson

inflationCategory

(-4, 32]               2329.054054

(32, 64]               1985.333333

(64, 96]               2230.750000

(96, 128]              1628.000000

standard deviations for
incomeperperson by inflationCategory

incomeperperson

inflationCategory

(-4, 32]                355.947948

(32, 64]                199.916858

(64, 96]                301.121653

(96, 128]                      NaN

Multiple Comparison of Means – Tukey
HSD,FWER=0.05

=======================================================

group1
group2   meandiff   lower
upper   reject

——————————————————-

(-4, 32]  (32, 64] -343.7207 -671.5121  -15.9293
True

(-4, 32]  (64, 96]
-98.3041 -562.5005  365.8924 False

(-4, 32] (96, 128]
-701.0541 -1594.8362 192.7281 False

(32, 64]  (64, 96]
245.4167 -284.5654  775.3987 False

(32, 64] (96, 128]
-357.3333 -1286.9833 572.3167 False

(64, 96] (96, 128]  -602.75
-1588.7927 383.2927 False

——————————————————-

 

<<<<<<<<<<<<<
CODE OUPUT ENDED
>>>>>>>>>>>>>>>>>>>

 

DRAWING CONCLUSION (SUMMARY):

 

My categorical variable (inflationcategory) has more than 2 levels or groups. It has 4 levels or groups

These are the inflation rate (in terms of percentages) these groups fall within:

(-4, 32)            % percentage group

(32, 64)             % percentage group

(64, 96)           % percentage group

(96, 128)        % percentage group

 

Model Interpretation for ANOVA:

From the output of the code, it can be seen that the p-value = 0.0161 which is far less than the statistically and scientifically testing value of 0.05 (or 5%). And the
F-statistic = 3.799

To be able to interpret this finding fully, the means of the incomeperpersons of the Ghana populations were assessed for the various years with respect to the relative inflation.

From the mean values, it can be seen that the for values for the various inflation categories vary greatly from each other and they are not the same or close to each, the mean values are 2329.054054, 1985.333333, 2230.750000, 1628.000000, relative to the quartile percentiles from least to greatest

 

This means I can reject the Null Hypothesis (Ho) that there is no association /relations between Inflation and Income Per Person of the Ghanaian population.

And accept the Alternate Hypothesis (Ha) that there is an association / relation between Inflation and Income Per Person of the Ghanaian population

 

Model Interpretation for post hoc ANOVA results

However, since there are different categories of inflation rate, to reduce a Type Error between each of the various incomeperperson level with respect to the inflation, a post hoc ANOVA test is conducted using TUNKEY HSD test.

From this, it can realised that the Null Hypothesis was rejected several times within each of the single incomeperperson variables compared to each in terms of inflation except in one instance

This means there is a strong relationship between Inflation and Income Per Person of the Ghanaian population hence accepting the Alternate Hypothesis.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *