Chosen Dataset
I will be working with Data from the Gapminder dataset.
This happens to be the same dataset I worked with under the Data Management and Visualization course assignments. As elaborated and discussed in under the Data Management and Visualization course assignments, I have chosen to focus on the country, Ghana.
Hence I will be particularly interested in some data about the country Ghana (as dealt with under Data Management and Visualization course
assignments) and these are primarily the
a. incomeperperson
b. lifeexpectancy
c. literacyrate
It is worth pointing out that, the gapminder.csv provided for the assignment
comprise all the countries in the world without a laser focus on my country of
interest –Ghana. This means that only one year of value for each of these core
variables are entered in the gapminder.csv data provided for the assigment.
The result is that when I run my python program with just these single values there are no other more yearly based values for each of these various to make a
meaningful frequency distribution which are only specific to Ghana,
unless I compare the values of Ghana in general to all the other countries, which is NOT the focus of my research work.
To be able to achieve this laser focus research on only the country Ghana I will
fetch this data from the http://www.gapminder.org/ website, specifically their data section which can be found herehttp://www.gapminder.org/data/.
I will therefore need to compile a new data csv file with focus on Ghana which will give me all the variables I will need for my analysis. In a nutshell, this new
data csv file seeks to enable me load and call the relevant variables and columns
in my python program and more importantly to be get the relevant variables I
will need for my research work going forward.
I will, therefore, call the new data csv file for the assignment: gapminder_ghana_updated.csv
This will be the Gapminder csv data file I will be calling and loading into my
python program
The gapminder_ghana_updated.csv dataset csv for this project can be view and dowloaded here:
https://drive.google.com/file/d/0B2KfPRxy4ootQnRzVUZQQXdFX1U/view?usp=sharing
see screenshot here for guide (http://prntscr.com/9gctxn)
Data Variables
All the data variables I worked with on the Gapminder dataset are all quantitative, however, as stated in the requirements for the Running An Analysis Of Variance assignment, I will need one of my variables (explanatory) to be categorical.
I have therefore added a 4th variable by name Inflation; which I will
categorise in order to get a categorical variable purposely for the Analysis Of
Variance Test.
Inflation, GDP deflator (annual %):
According to the Gapminder codebook, Inflation as measured by the annual growth rate of the GDP implicit deflator shows the rate of price change in the economy as a whole. The GDP implicit deflator is the ratio of GDP in current local currency to GDP in constant local currency. Source: World Bank national accounts data, and OECD National Accounts data files
Hence the 4 variables I will be working with on this and subsequent assignments are:
a. incomeperperson
b. lifeexpectancy
c. literacyrate
d. Inflation
I have duly updated My Personal Codebook to include this 4th Variable. Hence the updated Personal Codebook can be found at this link:
https://docs.google.com/document/d/177YfOjdk4oekFu20OLt4fgmu-n7cgRULAJ_Kd9KdYaM/edit?usp=sharing
The Research Question
For the purposes of this assignment; Running An Analysis Of
Variance Test, I will modify my research question used in the previous
course a little bit.
Hence the question I will be looking at in this assignment is: Is there an association OR relation between Inflation and Income Per Person of the Ghanaian population.
Hypothesis Testing
The Null and Alternate Hypotheses:
From the above research question, the Null Hypothesis (Ho) is that there is no association /relations between Inflation and Income Per Person of the Ghanaian population.
Whereas the Alternate Hypothesis (Ha) states that there is an association / relation between Inflation and Income Per Person of the Ghanaian population
Sample:
Sample is the data from the Gapminder dataset with focus on Ghana
Assessing the evidence:
This is done by Running an analysis of variance (ANOVA TEST ) on the hypotheses.
I do this by using running the test using the Python program.
MY PYTHON PROGRAM CODE:
# -*- coding: utf-8 -*- """ Created on Mon Jan 4 00:59:30 2016 @author: Bernard """ import numpy import pandas #to be able to get the p value and conduct the ANOVA F TEST we import this library import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi #load the gapminder_ghana_updated dataset csv into the program data = pandas.read_csv('gapminder_ghana_updated.csv', low_memory = False) #Converting data to numeric data["incomeperperson"] = data["incomeperperson"].convert_objects(convert_numeric=True) data["lifeexpectancy"] = data["lifeexpectancy"].convert_objects(convert_numeric=True) data["literacyrate"] = data["literacyrate"].convert_objects(convert_numeric= True) data["Inflation"] = data["Inflation"].convert_objects(convert_numeric= True) #create a variable for inflationcategory data["inflationCategory"] = data["Inflation"] #categorical groupings for inflation. This is to get one categorical variable for the #ANOVA test data["inflationCategory"] = pandas.cut(data.inflationCategory, [-4, 32, 64, 96, 128]) #including only data relevant for our testing by droping irrelavant data dataSub = data[["incomeperperson", "inflationCategory"]].dropna() #Change format from numberic to categorical dataSub["inflationCategory"] = dataSub["inflationCategory"].astype("category") #describe inflation category print("describe inflation Category") desc1 = dataSub["inflationCategory"].describe() print(desc1) #inflationCategory count print("inflation category") c1 = dataSub["inflationCategory"].value_counts(sort=False, dropna=True) print(c1) #print size of incomeperperson ct1 = dataSub.groupby('incomeperperson').size() print("Incomeperperson - 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana. ") print (ct1) # using ordinaary least squares (ols)function for calculating the F-statistic and associated p value model1 = smf.ols(formula='incomeperperson ~ C(inflationCategory)', data=dataSub) results1 = model1.fit() print (results1.summary()) #Examining the means of incomeperperson and inflationcategory, hence i will be #looking at only the variables concerned dataSub1 = data[['incomeperperson', 'inflationCategory']].dropna() print ('means for incomeperperson by inflationCategory') mean1= dataSub1.groupby('inflationCategory').mean() print (mean1) print ('standard deviations for incomeperperson by inflationCategory') sd1 = dataSub1.groupby('inflationCategory').std() print (sd1) #running a post hoc ANOVA TEST for levels of the categorical variables using TUKEYHSD mc1 = multi.MultiComparison(dataSub1['incomeperperson'], dataSub1['inflationCategory']) res1 = mc1.tukeyhsd() print(res1.summary())
CODE OUTPUT:
<<<<<<<<<<<<<
CODE OUTPUT BEGIN
>>>>>>>>>>>>>>>>>>>
describe inflation
Category
count 51
unique 4
top (-4, 32]
freq 37
Name: inflationCategory,
dtype: object
inflation category
(-4, 32] 37
(32, 64] 9
(64, 96] 4
(96, 128] 1
dtype: int64
Incomeperperson – 2010
Gross Domestic Product per capita in constant 2000 US$ of Ghana.
incomeperperson
1628 1
1710 1
1740 1
1766 1
1778 1
1813 1
1865 1
1909 1
1920 1
1960 1
1965 1
1985 1
2024 1
2036 1
2066 1
2072 1
2085 1
2090 1
2110 1
2130 1
2148 1
2177 1
2190 1
2199 1
2208 1
2226 1
2230 1
2240 1
2244 1
2267 1
2273 1
2290 1
2306 1
2316 1
2322 1
2334 1
2350 1
2396 1
2409 1
2455 1
2458 1
2479 1
2527 1
2558 1
2559 1
2652 1
2751 1
2907 1
2934 1
3091 1
3446 1
dtype: int64
OLS Regression
Results
==============================================================================
Dep. Variable: incomeperperson R-squared: 0.195
Model: OLS Adj. R-squared: 0.144
Method: Least Squares F-statistic: 3.799
Date: Mon, 04 Jan 2016 Prob (F-statistic): 0.0161
Time: 04:26:43 Log-Likelihood: -366.21
No. Observations: 51 AIC: 740.4
Df Residuals: 47 BIC:
748.1
Df Model: 3
Covariance Type: nonrobust
=====================================================================================================
coef std err t
P>|t| [95.0% Conf. Int.]
—————————————————————————————————–
Intercept 2329.0541 54.435
42.786 0.000 2219.545
2438.563
C(inflationCategory)[T.(32,
64]] -343.7207 123.065
-2.793 0.008 -591.296
-96.146
C(inflationCategory)[T.(64,
96]] -98.3041 174.276
-0.564 0.575 -448.903
252.295
C(inflationCategory)[T.(96,
128]] -701.0541 335.559
-2.089 0.042 -1376.111
-25.997
==============================================================================
Omnibus: 14.413 Durbin-Watson: 0.381
Prob(Omnibus): 0.001 Jarque-Bera (JB): 17.318
Skew: 1.048 Prob(JB): 0.000174
Kurtosis: 4.938 Cond. No. 7.40
==============================================================================
Warnings:
[1] Standard Errors assume
that the covariance matrix of the errors is correctly specified.
means for incomeperperson
by inflationCategory
incomeperperson
inflationCategory
(-4, 32] 2329.054054
(32, 64] 1985.333333
(64, 96] 2230.750000
(96, 128] 1628.000000
standard deviations for
incomeperperson by inflationCategory
incomeperperson
inflationCategory
(-4, 32] 355.947948
(32, 64] 199.916858
(64, 96] 301.121653
(96, 128] NaN
Multiple Comparison of Means – Tukey
HSD,FWER=0.05
=======================================================
group1
group2 meandiff lower
upper reject
——————————————————-
(-4, 32] (32, 64] -343.7207 -671.5121 -15.9293
True
(-4, 32] (64, 96]
-98.3041 -562.5005 365.8924 False
(-4, 32] (96, 128]
-701.0541 -1594.8362 192.7281 False
(32, 64] (64, 96]
245.4167 -284.5654 775.3987 False
(32, 64] (96, 128]
-357.3333 -1286.9833 572.3167 False
(64, 96] (96, 128] -602.75
-1588.7927 383.2927 False
——————————————————-
<<<<<<<<<<<<<
CODE OUPUT ENDED
>>>>>>>>>>>>>>>>>>>
DRAWING CONCLUSION (SUMMARY):
My categorical variable (inflationcategory) has more than 2 levels or groups. It has 4 levels or groups
These are the inflation rate (in terms of percentages) these groups fall within:
(-4, 32) % percentage group
(32, 64) % percentage group
(64, 96) % percentage group
(96, 128) % percentage group
Model Interpretation for ANOVA:
From the output of the code, it can be seen that the p-value = 0.0161 which is far less than the statistically and scientifically testing value of 0.05 (or 5%). And the
F-statistic = 3.799
To be able to interpret this finding fully, the means of the incomeperpersons of the Ghana populations were assessed for the various years with respect to the relative inflation.
From the mean values, it can be seen that the for values for the various inflation categories vary greatly from each other and they are not the same or close to each, the mean values are 2329.054054, 1985.333333, 2230.750000, 1628.000000, relative to the quartile percentiles from least to greatest
This means I can reject the Null Hypothesis (Ho) that there is no association /relations between Inflation and Income Per Person of the Ghanaian population.
And accept the Alternate Hypothesis (Ha) that there is an association / relation between Inflation and Income Per Person of the Ghanaian population
Model Interpretation for post hoc ANOVA results
However, since there are different categories of inflation rate, to reduce a Type Error between each of the various incomeperperson level with respect to the inflation, a post hoc ANOVA test is conducted using TUNKEY HSD test.
From this, it can realised that the Null Hypothesis was rejected several times within each of the single incomeperperson variables compared to each in terms of inflation except in one instance
This means there is a strong relationship between Inflation and Income Per Person of the Ghanaian population hence accepting the Alternate Hypothesis.