Making Data Management Decisions
Background of the Dataset CSV file Used:
The background to the Dataset CSV file used has been explained extensively in the week 2’s assignment. Not to bore assessors and readers by repeating everything here again, please
simply check the background information from my previous assignment which can be assessed at this link: http://adabadata.tumblr.com/ OR in tumblr, it can be seen as the IMMEDIATE post before this post with the title
For easy access though, I will post the link to the actual dataset csv which has
been used for this project here again.
The gapminder_ghana_updated.csv
dataset csv for this project can be view and dowloaded here:
https://drive.google.com/file/d/0B2KfPRxy4ootbzl5N0g1dUtIVzA/view?pref=2&pli=1
see screenshot here for guide (http://prntscr.com/9gctxn)
Data Management Decisions
As instructed in the assignment, I will be continuing with the program I have successfully run. That will be the program written in week 2.
PYTHON PROGRAM CODE:
# -*- coding: utf-8 -*- “”“ Created on Sat Dec 26 14:26:14 2015 @author: Bernard ”“” #import statements import pandas import numpy #load the gapminder_ghana_updated dataset csv into the program data = pandas.read_csv(‘gapminder_ghana_updated.csv’, low_memory = False) #print number of observations(rows) which is the number of years this data #has been looked at; print length print(“number of observations(rows) which is the number of years this data has been looked at: ”) print(len(data)) #print number of variables (columns) print(“number of variables (columns) available in the dataset: ”) print(len(data.columns)) print(“data index: ”) print(len(data.index)) #Converting datat to numeric data[“incomeperperson”] = data[“incomeperperson”].convert_objects(convert_numeric=True) data[“lifeexpectancy”] = data[“lifeexpectancy”].convert_objects(convert_numeric=True) data[“literacyrate”] = data[“literacyrate”].convert_objects(convert_numeric= True) #displaying rows or observation in Dataframe. #inc_pp_count is the name that will hold the result from incomeperperson count # sort = false ; i use value false so that the data will be sorted according #to the original format and sequence of the loaded data print(“counts for incomeperperson - 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana. ”) inc_pp_count = data[“incomeperperson”].value_counts(sort = False) #print the count of inc_pp_count ; incomeperperson print(inc_pp_count) print(“percentages for incomeperperson - 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana. ”) inc_pp_percent = data[“incomeperperson”].value_counts(sort=False, normalize =True) #print the percentage of incomeperperson print(inc_pp_percent) print(“counts for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana”) life_exp_count = data[“lifeexpectancy”].value_counts(sort = False) #print the count of life_exp_count ; lifeexpectancy print(life_exp_count) print(“percentages for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana ”) life_exp_percent = data[“lifeexpectancy”].value_counts(sort =False, normalize = True) #print the percentage of life_exp_count ; lifeexpectancy print(life_exp_percent) print(“counts for literacyrate - 2010, Literacy rate, adult total (% of people ages 15 and above) of Ghana”) lit_rate_count = data[“literacyrate”].value_counts(sort = False ,dropna=False) #dropna displays missen values #print the count of lit_rate_count ; literacyrate print(lit_rate_count) print(“percentages literacyrate - 2010, Literacy rate, adult total (% of people ages 15 and above) of Ghana ”) lit_rate_percent = data[“literacyrate”].value_counts(sort =False, normalize = True) #print the percentage of lit_rate_count ; literacyrate print(lit_rate_percent)
The Output Of My Python Program That Displays Three Of My Data
Variables As Frequency Tables:
These variables are the
a.
incomeperperson
b.
lifeexpectancy
c.
literacyrate
OUTPUT
<<<<<<BEGINNING OF OUTPUT>>>>>>>>
number of observations(rows) which is the number of years this data has been looked at:
216
number of variables (columns) available in the dataset:
5
data index:
216
counts for incomeperperson – 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana.
768
3
769
1
2306
1
771
1
773
1
1861
1
1287
1
1628
1
2240
1
1804
1
3873
1
2036
1
2322
1
3091
1
3685
1
789
1
1816
1
1305
1
1818
1
1778
1
4099
1
2751
1
1822
1
2079
1
1232
1
1570
1
2130
1
2072
1
2085
1
808
1
..
733
1
734
2
735
2
736
1
2273
1
738
2
739
1
741
1
742
2
743
1
744
1
745
1
746
1
747
2
749
1
2015
1
751
1
752
1
1906
1
754
1
756
2
757
1
759
1
761
1
762
1
763
1
764
1
2244
1
766
1
2559
1
Name:
incomeperperson, dtype: int64
percentages for incomeperperson – 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana.
768
0.013889
769
0.004630
2306
0.004630
771
0.004630
773
0.004630
1861
0.004630
1287
0.004630
1628
0.004630
2240
0.004630
1804
0.004630
3873
0.004630
2036
0.004630
2322
0.004630
3091
0.004630
3685
0.004630
789
0.004630
1816
0.004630
1305
0.004630
1818
0.004630
1778
0.004630
4099
0.004630
2751
0.004630
1822
0.004630
2079
0.004630
1232
0.004630
1570
0.004630
2130
0.004630
2072
0.004630
2085
0.004630
808
0.004630
733
0.004630
734
0.009259
735
0.009259
736
0.004630
2273
0.004630
738
0.009259
739
0.004630
741
0.004630
742
0.009259
743
0.004630
744
0.004630
745
0.004630
746
0.004630
747
0.009259
749
0.004630
2015
0.004630
751
0.004630
752
0.004630
1906
0.004630
754
0.004630
756
0.009259
757
0.004630
759
0.004630
761
0.004630
762
0.004630
763
0.004630
764
0.004630
2244
0.004630
766
0.004630
2559
0.004630
Name:
incomeperperson, dtype: float64
counts for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana
29.884300
1
9.437626
1
60.300000
4
60.500000
2
58.100000
2
63.500000
1
31.816600
1
61.500000
1
28.000000
120
29.240200
1
30.528400
1
31.172500
1
32.460700
1
33.104800
1
34.393000
1
35.037100
1
36.325300
1
37.613500
1
38.257600
1
39.545800
1
40.189900
1
41.478100
1
42.122200
1
43.410400
1
44.054500
1
45.083820
1
33.748900
1
47.005480
1
48.247920
1
49.454360
1
56.700000
1
44.698600
1
61.800000
1
45.732040
1
46.372260
1
54.897780
1
57.400000
1
28.476880
1
47.630700
1
55.800000
1
48.856140
1
58.700000
1
62.700000
1
28.357660
1
56.400000
1
50.612800
1
51.704240
1
57.200000
1
59.700000
1
52.715680
1
60.200000
1
28.238440
1
53.640120
1
58.300000
1
54.490560
1
60.600000
2
55.500000
1
60.400000
1
59.400000
1
28.119220
1
Name:
lifeexpectancy, dtype: int64
percentages for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana
29.884300
0.004630
9.437626
0.004630
60.300000
0.018519
60.500000
0.009259
58.100000
0.009259
63.500000
0.004630
31.816600
0.004630
61.500000
0.004630
28.000000
0.555556
29.240200
0.004630
30.528400
0.004630
31.172500
0.004630
32.460700
0.004630
33.104800
0.004630
34.393000
0.004630
35.037100
0.004630
36.325300
0.004630
37.613500
0.004630
38.257600
0.004630
39.545800
0.004630
40.189900
0.004630
41.478100
0.004630
42.122200
0.004630
43.410400
0.004630
44.054500
0.004630
45.083820
0.004630
33.748900
0.004630
47.005480
0.004630
48.247920
0.004630
49.454360
0.004630
56.700000
0.004630
44.698600
0.004630
61.800000
0.004630
45.732040
0.004630
46.372260
0.004630
54.897780
0.004630
57.400000
0.004630
28.476880
0.004630
47.630700
0.004630
55.800000
0.004630
48.856140
0.004630
58.700000
0.004630
62.700000
0.004630
28.357660
0.004630
56.400000
0.004630
50.612800
0.004630
51.704240
0.004630
57.200000
0.004630
59.700000
0.004630
52.715680
0.004630
60.200000
0.004630
28.238440
0.004630
53.640120
0.004630
58.300000
0.004630
54.490560
0.004630
60.600000
0.009259
55.500000
0.004630
60.400000
0.004630
59.400000
0.004630
28.119220
0.004630
Name:
lifeexpectancy, dtype: float64
counts for literacyrate – 2010, Literacy rate, adult total (% of people ages 15 and
above) of Ghana
NaN
214
57.897473
1
71.497075
1
Name:
literacyrate, dtype: int64
percentages literacyrate – 2010, Literacy rate, adult total (% of people ages 15 and above) of Ghana
57.897473
0.00463
71.497075
0.00463
Name:
literacyrate, dtype: float64
<<<<<<END OF OUTPUT >>>>>>>>
VARIABLES AS FREQUENCY TABLES CAN BE FOUND AT THIS LINK:
https://drive.google.com/file/d/0B2KfPRxy4ootS0IxWHk0LVJZNXc/view?usp=sharing
SUMMARY:
This summary is about the 3 main variables that my program
caters for;
These are the
a.
incomeperperson
b.
lifeexpectancy
c.
literacyrate
The summary will look at the values the variables take, how often they take them, the presence of missing data, etc
These data are only focusing on the country Ghana and they were looked at from a period of 216 years starting from the year 1800 to 2015
.
Incomeperperson variable
From the dataset, the Incomeperperson takes a look at the Gross Domestic Product per
capita in constant 2000 US$ of Ghana. From the frequency table between the
years of 1800 to 2015 there 3 years that had 768 USD as the Incomeperperson of
the Ghana people.
This 3 time occurrence represented 0.013889% of the total number of individual
Incomeperperson figures looked at over the 216 years.
Incomeperperson
of the values 734, 735, 738, 742, 747, 756 USD each occurred twice over the
period of times they were looked at.
Each of represent 0.009259 % of the total number of individual Incomeperperson figures looked at over the 216 years.
The rest of the Incomeperpersons figures were single distinct values which each
occurred uniquely within the time frame they were looked at.
And these single occurrences all put together form over 80% of the total number of individual Incomeperperson figures looked at over the 216 years. This means
that Incomeperperson changed almost every year. That is where my research
question comes in, to find out the Association of the Income Per Person and
Literacy rate and of the people of Ghana
lifeexpectancy variable
The second variable, lifeexpectancy takes a look at the life expectancy at birth (
in years) of Ghana population.
From the frequency table between the years of 1800 to 2015 it can significantly be noticed that the life expectancy of the people of Ghana was recorded as 28
years for 120 distinct years! This forms 0.555556% of the total of
lifeexpectancy years recorded over the 216 years the data was collected.
Over 30 % of the lifeexpectancy years were distinct as the frequency and percentage values are 1 and 0.00463% each respectively.
literacyrate variable
The third and final variable in focus in this program is the literacyrate which is
Literacy rate of the adult total (% of people ages 15 and above) of Ghana
From the frequency table this variable is only recorded twice.
Missing Data
There is a missen data in the variables significantly the literacyrate as indicated the literacyrate frequency
table. There are years where the literacy rate was not recorded and such all
those years missen data in terms of literacy rate of the people of Ghana.
This has been represented with a “NaN” value in the frequency distribution table
Decision on how I will manage my variables.
In the first place, there are three variables I am working with. And as indicated above they are:
a.
incomeperperson
b. lifeexpectancy
c. literacyrate
A closer look at these 3 variables will reveal a great deal of missing data. This is specifically apparent with the literacyrate. The literacyrate of the Ghanaian populace were missing 1800 to 1999, 2001 to 2009 and 2011 to 2015.
Managing the missen data in the literacyrate variable
I realised that python already handles these missen values /
data by replacing all of the missen data with “NaN”.
However, when I run the frequency table for the literacyrate
variable, python does not display this “NaN” value. So to make this “NaN” visible, I have used the “dropna=False” argument in my value_counts()
function in order to display this missen data in the frequency table