Testing a Basic Linear Regression Model
Background
My research work deals with Ghana, a country from the Gapminder dataset as has already been discussed from the beginning and progression through this course.
1) Program Code and Output
# -*- coding: utf-8 -*- """ Created on Sat Feb 13 12:28:07 2016 @author: Bernard """ import pandas import matplotlib.pyplot as plt import statsmodels.formula.api as smf import seaborn data = pandas.read_csv('gapminder_ghana_updated.csv') data["incomeperperson"] = data["incomeperperson"].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) # listwise deletion of missing values dataSub = data[['incomeperperson', 'lifeexpectancy']].dropna() scat1 = seaborn.regplot(x="incomeperperson", y="lifeexpectancy", scatter=True, data=dataSub) plt.xlabel('Income Per Person') plt.ylabel('Life Expectancy') plt.title ('Scatterplot for the Association Between Income Per Person and Life Expectancy of the People Of Ghana') print(scat1) # center quantitative Explanatory variable for regression analysis dataSub['incomeperperson_c'] = (dataSub['incomeperperson'] - dataSub['incomeperperson'].mean()) print("Describe the centered quantitative Explanatory variable") ds0 = dataSub["incomeperperson_c"].describe() print(ds0) # printing mean print("Mean for centered quantitative explanatory variable: incomeperperson_c") ds1 = dataSub.groupby('incomeperperson_c').mean() print (ds1) print("Standard deviation for centered quantitative explanatory variable:incomeperperson_c") sd1 = dataSub.groupby('incomeperperson_c').std() print (sd1) print("Mean for quantitative explanatory variable: incomeperperson") ds2 = dataSub.groupby('incomeperperson').mean() print (ds2) print("Checking values in incomeperperson_c") print(dataSub["incomeperperson_c"]) #Value counts print("Counts for incomeperperson_c") inc_c_Count = dataSub["incomeperperson_c"].value_counts(sort = False ,dropna=False) #dropna displays missen values print(inc_c_Count) print ("OLS regression model for the association between Income Per Person and Life Expectancy of the People of Ghana") reg1 = smf.ols('lifeexpectancy ~ incomeperperson_c', data=dataSub).fit() print (reg1.summary())
#####################
OUTPUT BEGIN
#####################
Axes(0.125,0.125;0.775×0.775)
Describe the centered quantitative Explanatory variable
count 216.00
mean 0.00
std 773.73
min -692.88
25% -663.88
50% -486.38
75% 634.37
max 2710.12
Name: incomeperperson_c, dtype: float64
Mean for centered quantitative explanatory variable:
incomeperperson_c
incomeperperson lifeexpectancy
incomeperperson_c
-692.88 696 28.00
-691.88 697 28.00
-690.88 698 28.00
-689.88 699 28.00
-688.88 700 28.00
-687.88 701 28.00
-686.88 702 28.00
-685.88 703 28.00
-684.88 704 28.00
-683.88 705 28.00
-682.88 706 28.00
-681.88 707 28.00
-680.88 708 28.00
-679.88 709 28.00
-678.88 710 28.00
-677.88 711 28.00
-676.88 712 28.00
-675.88 713 28.00
-674.88 714 28.00
-673.88 715 28.00
-672.88 716 28.00
-671.88 717 28.00
-670.88 718 28.00
-669.88 719 28.00
-668.88 720 28.00
-667.88 721 28.00
-666.88 722 28.00
-665.88 723 28.00
-664.88 724 28.00
-663.88 725 28.00
… …
841.12
2230 53.64
851.12
2240 54.90
855.12
2244 60.30
857.12
2246 50.61
878.12
2267 51.17
884.12
2273 60.20
901.12
2290 53.19
917.12
2306 60.30
927.12
2316 51.70
933.12
2322 52.72
945.12
2334 52.22
961.12
2350 60.50
1007.12
2396 55.80
1020.12
2409 60.80
1066.12
2455 55.30
1069.12
2458 56.10
1090.12
2479 61.10
1138.12
2527 55.50
1169.12
2558 61.50
1170.12
2559 56.40
1263.12
2652 61.80
1362.12
2751 62.30
1518.12
2907 62.70
1545.12
2934 63.10
1702.12
3091 63.50
2057.12
3446 64.00
2296.12
3685 64.60
2484.12
3873 64.90
2564.12
3953 65.20
2710.12
4099 65.50
[178 rows x 2 columns]
Standard deviation for centered quantitative explanatory
variable:incomeperperson_c
incomeperperson lifeexpectancy
incomeperperson_c
-692.88
0.00 0.00
-691.88 nan nan
-690.88
0.00 0.00
-689.88
0.00 0.00
-688.88
0.00 0.00
-687.88
0.00 0.00
-686.88 nan nan
-685.88
0.00 0.00
-684.88
0.00 0.00
-683.88
0.00 0.00
-682.88 nan nan
-681.88
0.00 0.00
-680.88
0.00 0.00
-679.88
0.00 0.00
-678.88
0.00 0.00
-677.88 nan nan
-676.88
0.00 0.00
-675.88
0.00 0.00
-674.88
0.00 0.00
-673.88 nan nan
-672.88
0.00 0.00
-671.88
0.00 0.00
-670.88
0.00 0.00
-669.88 nan nan
-668.88
0.00 0.00
-667.88
0.00 0.00
-666.88
0.00 0.00
-665.88 nan nan
-664.88
0.00 0.00
-663.88
0.00 0.00
… …
841.12 nan nan
851.12 nan nan
855.12 nan nan
857.12 nan nan
878.12 nan nan
884.12 nan nan
901.12 nan nan
917.12 nan nan
927.12 nan nan
933.12 nan nan
945.12 nan nan
961.12 nan nan
1007.12 nan nan
1020.12 nan nan
1066.12 nan nan
1069.12 nan nan
1090.12 nan nan
1138.12 nan nan
1169.12 nan nan
1170.12 nan nan
1263.12 nan nan
1362.12 nan nan
1518.12 nan nan
1545.12 nan nan
1702.12 nan nan
2057.12 nan nan
2296.12 nan nan
2484.12 nan nan
2564.12 nan nan
2710.12 nan nan
[178 rows x 2 columns]
Mean for quantitative explanatory variable: incomeperperson
lifeexpectancy incomeperperson_c
incomeperperson
696
28.00 -692.88
697 28.00 -691.88
698
28.00 -690.88
699
28.00 -689.88
700
28.00 -688.88
701
28.00 -687.88
702
28.00 -686.88
703
28.00 -685.88
704
28.00 -684.88
705
28.00 -683.88
706
28.00 -682.88
707
28.00 -681.88
708
28.00 -680.88
709
28.00 -679.88
710
28.00 -678.88
711
28.00 -677.88
712
28.00 -676.88
713
28.00 -675.88
714
28.00 -674.88
715
28.00 -673.88
716
28.00 -672.88
717
28.00 -671.88
718
28.00 -670.88
719
28.00 -669.88
720
28.00 -668.88
721
28.00 -667.88
722
28.00 -666.88
723
28.00 -665.88
724
28.00 -664.88
725
28.00 -663.88
… …
2230
53.64 841.12
2240
54.90 851.12
2244
60.30 855.12
2246
50.61 857.12
2267
51.17 878.12
2273
60.20 884.12
2290
53.19 901.12
2306
60.30 917.12
2316
51.70 927.12
2322
52.72 933.12
2334
52.22 945.12
2350
60.50 961.12
2396
55.80 1007.12
2409
60.80 1020.12
2455
55.30 1066.12
2458
56.10 1069.12
2479
61.10 1090.12
2527
55.50 1138.12
2558
61.50 1169.12
2559
56.40 1170.12
2652
61.80 1263.12
2751
62.30 1362.12
2907
62.70 1518.12
2934
63.10 1545.12
3091
63.50 1702.12
3446
64.00 2057.12
3685
64.60 2296.12
3873
64.90 2484.12
3953
65.20 2564.12
4099
65.50 2710.12
[178 rows x 2 columns]
Checking values in incomeperperson_c
0 -692.88
1 -692.88
2 -691.88
3 -690.88
4 -690.88
5 -689.88
6 -689.88
7 -688.88
8 -688.88
9 -687.88
10 -687.88
11 -686.88
12 -685.88
13 -685.88
14 -684.88
15 -684.88
16 -683.88
17 -683.88
18 -682.88
19 -681.88
20 -681.88
21 -680.88
22 -680.88
23 -679.88
24 -679.88
25 -678.88
26 -678.88
27 -677.88
28 -676.88
29 -676.88
186 389.12
187 424.12
188 476.12
189 520.12
190 531.12
191 576.12
192 596.12
193 635.12
194 647.12
195 677.12
196 721.12
197 759.12
198 810.12
199 855.12
200 884.12
201 917.12
202 961.12
203 1020.12
204 1090.12
205 1169.12
206 1263.12
207 1362.12
208 1518.12
209 1545.12
210 1702.12
211 2057.12
212 2296.12
213 2484.12
214 2564.12
215 2710.12
Name: incomeperperson_c, dtype: float64
Counts for incomeperperson_c
513.12 1
2564.12 1
517.12 1
-250.88 1
263.12 1
520.12 1
1545.12 1
429.12 1
781.12 1
626.12 1
-239.88 1
563.12 1
531.12 1
-492.88 1
374.12 1
1263.12 1
634.12 1
1069.12 1
27.12 1
703.12 1
-479.88 1
677.12 1
547.12 1
351.12 1
293.12 1
927.12 1
819.12 1
46.12 1
303.12 1
48.12 1
..
1362.12 1
810.12 1
464.12 1
612.12 1
1066.12 1
788.12 1
472.12 1
1020.12 1
571.12 1
476.12 1
759.12 1
884.12 1
2057.12 1
741.12 1
696.12 1
492.12 1
1518.12 1
1007.12 1
721.12 1
424.12 1
755.12 1
756.12 1
-436.88 1
758.12 1
-521.88 1
2296.12 1
801.12 1
857.12 1
855.12 1
-257.88 1
Name: incomeperperson_c, dtype: int64
OLS regression model for the association between Income Per
Person and Life Expectancy of the People of Ghana
OLS Regression Results
==============================================================================
Dep. Variable:
lifeexpectancy R-squared: 0.718
Model: OLS Adj. R-squared: 0.717
Method:
Least Squares F-statistic: 544.9
Date: Sun, 14 Feb 2016 Prob (F-statistic): 9.62e-61
Time:
02:50:12 Log-Likelihood: -732.24
No. Observations: 216 AIC: 1468.
Df Residuals:
214 BIC: 1475.
Df Model: 1
Covariance Type:
nonrobust
=====================================================================================
coef std err t
P>|t| [95.0% Conf. Int.]
————————————————————————————-
Intercept
37.7407 0.491 76.914
0.000 36.774 38.708
incomeperperson_c
0.0148 0.001 23.343
0.000 0.014 0.016
==============================================================================
Omnibus: 16.740 Durbin-Watson: 0.058
Prob(Omnibus): 0.000 Jarque-Bera (JB): 26.382
Skew: -0.461 Prob(JB): 1.87e-06
Kurtosis: 4.443 Cond. No. 772.
==============================================================
######################
OUTPUT END
######################
2) Reporting the mean for my centered explanatory variable: incomeperperson_c.
As per the assignment instruction, I have centered my quantitative explanatory variable, so that the mean = 0 (or really close to 0) by subtracting the mean, and then I have calculated the mean to check my centering.
My explanatory variable is incomeperperson and I this is the variable that has been centered. The variable which holds the centered “incomeperperson” is incomeperperson_c.
Using the describe() function, it can be seen the mean from the centered quantitative explanatory variable = 0. This can been seen from the program out as below:
Describe the centered quantitative Explanatory variable
count 216.00
mean 0.00
std 773.73
min -692.88
25% -663.88
50% -486.38
75% 634.37
max 2710.12
Name: incomeperperson_c, dtype: float64
Furthermore, using the groupby () function and calculating the mean of the centered variable these are the outcomes:
Mean for centered quantitative explanatory variable: incomeperperson_c
incomeperperson lifeexpectancy
incomeperperson_c
-692.88 696 28.00
-691.88 697 28.00
-690.88 698 28.00
-689.88 699 28.00
-688.88 700 28.00
-687.88 701 28.00
-686.88 702 28.00
-685.88 703 28.00
-684.88 704 28.00
-683.88 705 28.00
-682.88 706 28.00
-681.88 707 28.00
-680.88 708 28.00
-679.88 709 28.00
-678.88 710 28.00
-677.88 711 28.00
-676.88 712 28.00
-675.88 713 28.00
-674.88 714 28.00
-673.88 715 28.00
-672.88 716 28.00
-671.88 717 28.00
-670.88 718 28.00
-669.88 719 28.00
-668.88 720 28.00
-667.88 721 28.00
-666.88 722 28.00
-665.88 723 28.00
-664.88 724 28.00
-663.88 725 28.00
… …
841.12 2230 53.64
851.12 2240 54.90
855.12 2244 60.30
857.12 2246 50.61
878.12 2267 51.17
884.12 2273 60.20
901.12 2290 53.19
917.12 2306 60.30
927.12 2316 51.70
933.12 2322 52.72
945.12 2334 52.22
961.12 2350 60.50
1007.12 2396 55.80
1020.12 2409 60.80
1066.12 2455 55.30
1069.12 2458 56.10
1090.12 2479 61.10
1138.12 2527 55.50
1169.12 2558 61.50
1170.12 2559 56.40
1263.12 2652 61.80
1362.12 2751 62.30
1518.12 2907 62.70
1545.12 2934 63.10
1702.12 3091 63.50
2057.12 3446 64.00
2296.12 3685 64.60
2484.12 3873 64.90
2564.12 3953 65.20
2710.12 4099 65.50
[178 rows x 2 columns]
Results of my linear regression analysis
The results of the linear regression model indicated that the intercept = 37.7407 and the coefficient , which is the slope of the line, = 0.0148.
The equation of the of the line can be calculated using this formula, Y = mx +b
Where Y = my response variable (lifeexpectancy)
m = coefficient (the slope of the line)
x= explanatory variable(incomeperperson_c)
b= intercept
hence my equation of the line of best fit will be:
lifeexpectancy = 0.0148 * incomeperperson_c + 37.7407
From the above equation, it means having had the “m” and “b “values, from the ols() function, it indicates one unit increase or decrease in the explanatory variable, incomeperperson_c, will result in 0.0148 respective increase or decrease in the response variable, lifeexpectancy.
The absolute p –vallue of the explanatory variable , incomeperperson_c, P>|t| = 0.000 hence p-value will be reported as p< .0001
This is further enforced by the p-value, that is, Prob (F-statistic) value which is = 9.62e-61 and it is clearly less than the statistical value of p < 0.05
And there is a greater F-statistic value of 544.9. Hence I can reject the Null hypothesis and accept the alternate hypothesis.
This indicates that income per person of the people of Ghana is significantly associated with their life expectancy.
This also means that an increase in the income per person will result in an increase in the life expectancy. The reverse is true. This can also be seen in the Scatterplot from the program output, which indicates a positively sloping line.
`