Background of the Dataset CSV file Used:
In the GapMinder Codebook the Unique Identifier = Country
Hence in this program, my Unique Identifier = Ghana
1. There are 3 chosen variables (columns) that are core to my chosen research question which is based on the country Ghana.
These are
a. incomeperperson
b. lifeexpectancy
c. literacyrate
The first 2 variables (incomeperperson and lifeexpectancy) already have values in the gapminder.csv data provided for the assignment. But the problem is, the gapminder.csv provided for the assignment comprise all the countries in the world without a laser focus on my country of interest –Ghana. This means that only one year of value for each of these core variables are entered in the gapminder.csv data provided for the assigment.
The result is that when I run my python program with just these single values there are no other more yearly based values for each of these various to make a meaningful frequency distribution which are only specific to Ghana, unless I compare the values of Ghana in general to all the other countries, which is NOT the focus of my research work.
My research work is to find out the Association of the Literacy rate and Life
expectancy & Association of the Literacy rate and Income Per Person
with focus on the country, Ghana ONLY.
To be able to achieve this laser focus research on only the country Ghana I will fetch this data from the http://www.gapminder.org/ website, specifically their data section which can be found here http://www.gapminder.org/data/.
I will therefore need to compile a new data csv file with focus on Ghana which will give me all the variables I will need for my analysis. In a nutshell, this new data csv file seeks to enable me load and call the relevant variables and columns in my python program and more importantly to be get the relevant variables I will need for my research work going forward.
I will, therefore, call the new data csv file for the assignment: gapminder_ghana_updated.csv
This will be the Gapminder csv data file I will be calling and loading into my python program
It is also worth mentioning that, the third variable being the literacyrate was not available on the gapminder.csv provided for the assignment. Since this is one of the core variables, I have duly added this variable and its associated values to my
new data file (gapminder_ghana_updated.csv)
to be able to make sense of the frequency distributions for my research topic.
The gapminder_ghana_updated.csv dataset csv for this project can be view and dowloaded here:
https://drive.google.com/file/d/0B2KfPRxy4ootbzl5N0g1dUtIVzA/view?pref=2&pli=1
see screenshot here for guide (http://prntscr.com/9gctxn)
MY PYTHON PROGRAM CODE:
# -*- coding: utf-8 -*- """ Created on Sat Dec 19 11:24:10 2015 @author: Bernard """ #import statements import pandas import numpy #load the gapminder_ghana_updated dataset csv into the program data = pandas.read_csv('gapminder_ghana_updated.csv', low_memory = False) #print number of observations(rows) which is the number of years this data #has been looked at; print length print("number of observations(rows) which is the number of years this data has been looked at: ") print(len(data)) #print number of variables (columns) print("number of variables (columns) available in the dataset: ") print(len(data.columns)) print("data index: ") print(len(data.index)) #Converting datat to numeric data["incomeperperson"] = data["incomeperperson"].convert_objects(convert_numeric=True) data["lifeexpectancy"] = data["lifeexpectancy"].convert_objects(convert_numeric=True) data["literacyrate"] = data["literacyrate"].convert_objects(convert_numeric= True) #displaying rows or observation in Dataframe. #inc_pp_count is the name that will hold the result from incomeperperson count # sort = false ; i use value false so that the data will be sorted according #to the original format and sequence of the loaded data print("counts for incomeperperson - 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana. ") inc_pp_count = data["incomeperperson"].value_counts(sort = False) #print the count of inc_pp_count ; incomeperperson print(inc_pp_count) print("percentages for incomeperperson - 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana. ") inc_pp_percent = data["incomeperperson"].value_counts(sort=False, normalize =True) #print the percentage of incomeperperson print(inc_pp_percent) print("counts for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana") life_exp_count = data["lifeexpectancy"].value_counts(sort = False) #print the count of life_exp_count ; lifeexpectancy print(life_exp_count) print("percentages for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana ") life_exp_percent = data["lifeexpectancy"].value_counts(sort =False, normalize = True) #print the percentage of life_exp_count ; lifeexpectancy print(life_exp_percent) print("counts for literacyrate - 2010, Literacy rate, adult total (% of people ages 15 and above) of Ghana") lit_rate_count = data["literacyrate"].value_counts(sort = False ,dropna=False) #dropna displays missen values #print the count of lit_rate_count ; literacyrate print(lit_rate_count) print("percentages literacyrate - 2010, Literacy rate, adult total (% of people ages 15 and above) of Ghana ") lit_rate_percent = data["literacyrate"].value_counts(sort =False, normalize = True) #print the percentage of lit_rate_count ; literacyrate print(lit_rate_percent)
The Output Of
My Python Program That Displays Three Of My Data Variables As Frequency Tables:
These variables are the
a. incomeperperson
b. lifeexpectancy
c. literacyrate
OUTPUT
<<<<<<BEGINNING OF OUTPUT >>>>>>>>
number of observations(rows) which is the number of
years this data has been looked at:
216
number of variables (columns) available in the
dataset:
5
data index:
216
counts for incomeperperson – 2010 Gross Domestic Product
per capita in constant 2000 US$ of Ghana.
768 3
769 1
2306 1
771 1
773 1
1861 1
1287 1
1628 1
2240 1
1804 1
3873 1
2036 1
2322 1
3091 1
3685 1
789 1
1816 1
1305 1
1818 1
1778 1
4099 1
2751 1
1822 1
2079 1
1232 1
1570 1
2130 1
2072 1
2085 1
808 1
..
733 1
734 2
735 2
736 1
2273 1
738 2
739 1
741 1
742 2
743 1
744 1
745 1
746 1
747 2
749 1
2015 1
751 1
752 1
1906 1
754 1
756 2
757 1
759 1
761 1
762 1
763 1
764 1
2244 1
766 1
2559 1
Name: incomeperperson, dtype: int64
percentages for incomeperperson – 2010 Gross
Domestic Product per capita in constant 2000 US$ of Ghana.
768
0.013889
769
0.004630
2306
0.004630
771
0.004630
773
0.004630
1861
0.004630
1287
0.004630
1628
0.004630
2240
0.004630
1804
0.004630
3873
0.004630
2036
0.004630
2322
0.004630
3091
0.004630
3685
0.004630
789
0.004630
1816
0.004630
1305
0.004630
1818
0.004630
1778
0.004630
4099
0.004630
2751
0.004630
1822
0.004630
2079
0.004630
1232
0.004630
1570
0.004630
2130
0.004630
2072
0.004630
2085
0.004630
808
0.004630
733
0.004630
734
0.009259
735
0.009259
736
0.004630
2273
0.004630
738
0.009259
739
0.004630
741
0.004630
742
0.009259
743
0.004630
744
0.004630
745
0.004630
746
0.004630
747
0.009259
749
0.004630
2015
0.004630
751
0.004630
752
0.004630
1906
0.004630
754
0.004630
756
0.009259
757
0.004630
759
0.004630
761
0.004630
762
0.004630
763
0.004630
764
0.004630
2244
0.004630
766
0.004630
2559
0.004630
Name: incomeperperson, dtype: float64
counts for lifeexpectancy- 2011 life expectancy at
birth (years) of Ghana
29.884300
1
9.437626
1
60.300000
4
60.500000
2
58.100000
2
63.500000
1
31.816600
1
61.500000
1
28.000000
120
29.240200
1
30.528400
1
31.172500
1
32.460700
1
33.104800
1
34.393000
1
35.037100
1
36.325300
1
37.613500
1
38.257600
1
39.545800
1
40.189900
1
41.478100
1
42.122200
1
43.410400
1
44.054500
1
45.083820
1
33.748900
1
47.005480
1
48.247920
1
49.454360
1
56.700000
1
44.698600
1
61.800000
1
45.732040
1
46.372260
1
54.897780
1
57.400000
1
28.476880
1
47.630700
1
55.800000
1
48.856140
1
58.700000
1
62.700000
1
28.357660
1
56.400000
1
50.612800
1
51.704240
1
57.200000
1
59.700000
1
52.715680
1
60.200000
1
28.238440
1
53.640120
1
58.300000
1
54.490560
1
60.600000
2
55.500000
1
60.400000
1
59.400000
1
28.119220
1
Name: lifeexpectancy, dtype: int64
percentages for lifeexpectancy- 2011 life expectancy
at birth (years) of Ghana
29.884300
0.004630
9.437626
0.004630
60.300000
0.018519
60.500000
0.009259
58.100000
0.009259
63.500000
0.004630
31.816600 0.004630
61.500000
0.004630
28.000000
0.555556
29.240200
0.004630
30.528400
0.004630
31.172500
0.004630
32.460700
0.004630
33.104800
0.004630
34.393000
0.004630
35.037100
0.004630
36.325300
0.004630
37.613500
0.004630
38.257600
0.004630
39.545800
0.004630
40.189900
0.004630
41.478100
0.004630
42.122200
0.004630
43.410400
0.004630
44.054500
0.004630
45.083820
0.004630
33.748900
0.004630
47.005480
0.004630
48.247920
0.004630
49.454360 0.004630
56.700000
0.004630
44.698600
0.004630
61.800000
0.004630
45.732040
0.004630
46.372260
0.004630
54.897780
0.004630
57.400000
0.004630
28.476880
0.004630
47.630700
0.004630
55.800000
0.004630
48.856140
0.004630
58.700000
0.004630
62.700000
0.004630
28.357660
0.004630
56.400000
0.004630
50.612800
0.004630
51.704240
0.004630
57.200000
0.004630
59.700000
0.004630
52.715680
0.004630
60.200000
0.004630
28.238440
0.004630
53.640120 0.004630
58.300000
0.004630
54.490560
0.004630
60.600000
0.009259
55.500000
0.004630
60.400000
0.004630
59.400000
0.004630
28.119220
0.004630
Name: lifeexpectancy, dtype: float64
counts for literacyrate – 2010, Literacy rate, adult
total (% of people ages 15 and above) of Ghana
NaN
214
57.897473
1
71.497075
1
Name: literacyrate, dtype: int64
percentages literacyrate – 2010, Literacy rate,
adult total (% of people ages 15 and above) of Ghana
57.897473
0.00463
71.497075
0.00463
Name: literacyrate, dtype: float64
<<<<<<END OF OUTPUT
>>>>>>>>
VARIABLES AS FREQUENCY TABLES CAN BE FOUND AT THIS LINK:
https://drive.google.com/file/d/0B2KfPRxy4ootS0IxWHk0LVJZNXc/view?usp=sharing
SUMMARY:
This summary is about the 3 main variables that my program
caters for;
These are the
a. incomeperperson
b. lifeexpectancy
c. literacyrate
The summary will look at the values the variables
take, how often they take them, the presence of missing data, etc
These data are only focusing on the country Ghana
and they were looked at from a period of 216 years starting from the year 1800 to
2015
.
Incomeperperson variable
From the dataset, the Incomeperperson
takes a look at the Gross Domestic
Product per capita in constant 2000 US$ of Ghana. From the frequency table
between the years of 1800 to 2015 there 3 years that had 768 USD as the Incomeperperson
of the Ghana people.
This 3 time occurrence represented 0.013889%
of the total number of individual Incomeperperson figures looked at over the
216 years.
Incomeperperson of the values 734, 735, 738,
742, 747, 756 USD each occurred twice over the period of times they were looked
at.
Each of represent 0.009259 % of the total
number of individual Incomeperperson figures looked at over the 216 years.
The rest of the Incomeperpersons figures
were single distinct values which each occurred uniquely within the time frame
they were looked at.
And these single occurrences all put
together form over 80% of the total number of individual Incomeperperson
figures looked at over the 216 years. This means that Incomeperperson changed
almost every year. That is where my research question comes in, to find out the Association of the Income Per Person and Literacy rate
and of the people of Ghana
lifeexpectancy variable
The second variable, lifeexpectancy
takes a look at the life expectancy at
birth ( in years) of Ghana population.
From the frequency table between the
years of 1800 to 2015 it can significantly be noticed that the life expectancy
of the people of Ghana was recorded as 28 years for 120 distinct years! This
forms 0.555556% of the total of lifeexpectancy years recorded over the 216 years the data was
collected.
Over 30 % of the lifeexpectancy years
were distinct as the frequency and percentage values are 1 and 0.00463% each
respectively.
literacyrate variable
The third and final variable in focus in this
program is the literacyrate which is Literacy rate of the adult total (% of
people ages 15 and above) of Ghana
From the frequency table this variable is only
recorded twice.
Missing Data
There is a missen data in the variables significantly the literacyrate as
indicated the literacyrate frequency table. There are years where the literacy rate was not recorded and such all those years missen data in terms of literacy rate of the people of Ghana.
This has been represented with a “NaN” value in the frequency distribution table