Making Data Management Decisions

 

Background of the Dataset CSV file Used:

The background to the Dataset CSV file used has been explained extensively in the week 2’s assignment. Not to bore assessors and readers by repeating everything here again, please

simply check the background information from my previous assignment which can be assessed at this link: http://adabadata.tumblr.com/  OR in tumblr, it can be seen as the IMMEDIATE post before this post with the title

(PYTHON PROGRAM For The Research Topic Association Of The Literacy Rate And Life Expectancy & Association Of The Literacy Rate And Income Per Person:  The Case of Ghana)

For easy access though, I will post the link to the actual dataset csv which has
been used for this project here again.

The   gapminder_ghana_updated.csv
dataset csv for this project can be view and dowloaded here:

 

https://drive.google.com/file/d/0B2KfPRxy4ootbzl5N0g1dUtIVzA/view?pref=2&pli=1

 

see screenshot here for guide (http://prntscr.com/9gctxn)

 

Data Management Decisions

As instructed in the assignment, I will be continuing with the program I have successfully run. That will be the program written in week 2.

 

PYTHON PROGRAM CODE: 

# -*-
coding: utf-8 -*-
“”“
Created on Sat Dec 26 14:26:14 2015

@author:
Bernard
”“”

#import
statements
import pandas
import numpy

#load
the gapminder_ghana_updated dataset csv into the program
data = pandas.read_csv(‘gapminder_ghana_updated.csv’, low_memory = False)

#print
number of observations(rows) which is the number of years this data 
#has been looked at; print length
print(“number of observations(rows) which is the number of years this data has
been looked at: ”)
print(len(data))

#print
number of variables (columns)
print(“number of variables (columns) available in the dataset: ”)
print(len(data.columns))

print(“data
index: ”)
print(len(data.index))

#Converting
datat to numeric
data[“incomeperperson”] =
data[“incomeperperson”].convert_objects(convert_numeric=True)
data[“lifeexpectancy”] =
data[“lifeexpectancy”].convert_objects(convert_numeric=True)
data[“literacyrate”] = data[“literacyrate”].convert_objects(convert_numeric=
True)

#displaying
rows or observation in Dataframe.
#inc_pp_count is the name that will hold the result from incomeperperson count
# sort = false ; i use value false so that the data will be sorted according 
#to the original format and sequence  of the loaded data

print(“counts
for incomeperperson - 2010 Gross Domestic Product per capita in constant 2000
US$ of Ghana. ”)
inc_pp_count = data[“incomeperperson”].value_counts(sort = False)
#print the count of inc_pp_count ; incomeperperson
print(inc_pp_count)

print(“percentages
for incomeperperson - 2010 Gross Domestic Product per capita in constant 2000
US$ of Ghana. ”)
inc_pp_percent = data[“incomeperperson”].value_counts(sort=False, normalize
=True)
#print the percentage of incomeperperson
print(inc_pp_percent)

print(“counts
for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana”)
life_exp_count = data[“lifeexpectancy”].value_counts(sort = False)
#print the count of life_exp_count ; lifeexpectancy
print(life_exp_count)

print(“percentages
for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana ”)
life_exp_percent = data[“lifeexpectancy”].value_counts(sort =False, normalize =
True)
#print the percentage of life_exp_count ; lifeexpectancy
print(life_exp_percent)

print(“counts
for literacyrate - 2010, Literacy rate, adult total (% of people ages 15 and
above) of Ghana”)
lit_rate_count = data[“literacyrate”].value_counts(sort = False ,dropna=False)
#dropna displays missen values
#print the count of lit_rate_count ; literacyrate
print(lit_rate_count)

print(“percentages
literacyrate - 2010, Literacy rate, adult total (% of people ages 15 and above)
of Ghana ”)
lit_rate_percent = data[“literacyrate”].value_counts(sort =False, normalize =
True)
#print the percentage of lit_rate_count ; literacyrate
print(lit_rate_percent)

 

 

The Output Of My Python Program That Displays Three Of My Data
Variables As Frequency Tables:

These variables are the

a.
incomeperperson

b.
lifeexpectancy

c.
literacyrate

 

 

OUTPUT

<<<<<<BEGINNING OF OUTPUT>>>>>>>>

number of observations(rows) which is the number of years this data has been looked at:

216

number of variables (columns) available in the dataset:

5

data index:

216

counts for incomeperperson – 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana.

768
3

769
1

2306
1

771
1

773
1

1861
1

1287
1

1628
1

2240
1

1804
1

3873
1

2036
1

2322
1

3091
1

3685
1

789
1

1816
1

1305
1

1818
1

1778
1

4099
1

2751
1

1822
1

2079
1

1232
1

1570
1

2130
1

2072
1

2085
1

808
1
..

733
1

734
2

735
2

736
1

2273
1

738
2

739
1

741
1

742
2

743
1

744
1

745
1

746
1

747
2

749
1

2015
1

751
1

752
1

1906
1

754
1

756
2

757
1

759
1

761
1

762
1

763
1

764
1

2244
1

766
1

2559
1

Name:
incomeperperson, dtype: int64

percentages for incomeperperson – 2010 Gross Domestic Product per capita in constant 2000 US$ of Ghana.

768
0.013889

769
0.004630

2306
0.004630

771
0.004630

773
0.004630

1861
0.004630

1287
0.004630

1628
0.004630

2240
0.004630

1804
0.004630

3873
0.004630

2036
0.004630

2322
0.004630

3091
0.004630

3685
0.004630

789
0.004630

1816
0.004630

1305
0.004630

1818
0.004630

1778
0.004630

4099
0.004630

2751
0.004630

1822
0.004630

2079
0.004630

1232
0.004630

1570
0.004630

2130
0.004630

2072
0.004630

2085
0.004630

808
0.004630

733
0.004630

734
0.009259

735
0.009259

736
0.004630

2273
0.004630

738
0.009259

739
0.004630

741
0.004630

742
0.009259

743
0.004630

744
0.004630

745
0.004630

746
0.004630

747
0.009259

749
0.004630

2015
0.004630

751
0.004630

752
0.004630

1906
0.004630

754
0.004630

756
0.009259

757
0.004630

759
0.004630

761
0.004630

762
0.004630

763
0.004630

764
0.004630

2244
0.004630

766
0.004630

2559
0.004630

Name:
incomeperperson, dtype: float64

counts for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana

29.884300
1

9.437626
1

60.300000
4

60.500000
2

58.100000
2

63.500000
1

31.816600
1

61.500000
1

28.000000
120

29.240200
1

30.528400
1

31.172500
1

32.460700
1

33.104800
1

34.393000
1

35.037100
1

36.325300
1

37.613500
1

38.257600
1

39.545800
1

40.189900
1

41.478100
1

42.122200
1

43.410400
1

44.054500
1

45.083820
1

33.748900
1

47.005480
1

48.247920
1

49.454360
1

56.700000
1

44.698600
1

61.800000
1

45.732040
1

46.372260
1

54.897780
1

57.400000
1

28.476880
1

47.630700
1

55.800000
1

48.856140
1

58.700000
1

62.700000
1

28.357660
1

56.400000
1

50.612800
1

51.704240
1

57.200000
1

59.700000
1

52.715680
1

60.200000
1

28.238440
1

53.640120
1

58.300000
1

54.490560
1

60.600000
2

55.500000
1

60.400000
1

59.400000
1

28.119220
1

Name:
lifeexpectancy, dtype: int64

percentages for lifeexpectancy- 2011 life expectancy at birth (years) of Ghana

29.884300
0.004630

9.437626
0.004630

60.300000
0.018519

60.500000
0.009259

58.100000
0.009259

63.500000
0.004630

31.816600
0.004630

61.500000
0.004630

28.000000
0.555556

29.240200
0.004630

30.528400
0.004630

31.172500
0.004630

32.460700
0.004630

33.104800
0.004630

34.393000
0.004630

35.037100
0.004630

36.325300
0.004630

37.613500
0.004630

38.257600
0.004630

39.545800
0.004630

40.189900
0.004630

41.478100
0.004630

42.122200
0.004630

43.410400
0.004630

44.054500
0.004630

45.083820
0.004630

33.748900
0.004630

47.005480
0.004630

48.247920
0.004630

49.454360
0.004630

56.700000
0.004630

44.698600
0.004630

61.800000
0.004630

45.732040
0.004630

46.372260
0.004630

54.897780
0.004630

57.400000
0.004630

28.476880
0.004630

47.630700
0.004630

55.800000
0.004630

48.856140
0.004630

58.700000
0.004630

62.700000
0.004630

28.357660
0.004630

56.400000
0.004630

50.612800
0.004630

51.704240
0.004630

57.200000
0.004630

59.700000
0.004630

52.715680
0.004630

60.200000
0.004630

28.238440
0.004630

53.640120
0.004630

58.300000
0.004630

54.490560
0.004630

60.600000
0.009259

55.500000
0.004630

60.400000
0.004630

59.400000
0.004630

28.119220
0.004630

Name:
lifeexpectancy, dtype: float64

counts for literacyrate – 2010, Literacy rate, adult total (% of people ages 15 and
above) of Ghana

NaN
214

57.897473
1

71.497075
1

Name:
literacyrate, dtype: int64

percentages literacyrate – 2010, Literacy rate, adult total (% of people ages 15 and above) of Ghana

57.897473
0.00463

71.497075
0.00463

Name:
literacyrate, dtype: float64

<<<<<<END OF OUTPUT >>>>>>>>

 

VARIABLES AS FREQUENCY TABLES CAN BE FOUND AT THIS LINK:

https://drive.google.com/file/d/0B2KfPRxy4ootS0IxWHk0LVJZNXc/view?usp=sharing

 

SUMMARY:

This summary is about the 3 main variables that my program
caters for;

These are the

a.
incomeperperson

b.
lifeexpectancy

c.
literacyrate

The summary will look at the values the variables take, how often they take them, the presence of missing data, etc

These data are only focusing on the country Ghana and they were looked at from a period of 216 years starting from the year 1800 to 2015

.

Incomeperperson   variable

From the dataset, the Incomeperperson takes a look at the Gross Domestic Product per
capita in constant 2000 US$ of Ghana. From the frequency table between the
years of 1800 to 2015 there 3 years that had 768 USD as the Incomeperperson of
the Ghana people.

This 3 time occurrence represented 0.013889% of the total number of individual
Incomeperperson figures looked at over the 216 years.

Incomeperperson
of the values 734, 735, 738, 742, 747, 756 USD each occurred twice over the
period of times they were looked at.

Each of represent 0.009259 % of the total number of individual Incomeperperson figures looked at over the 216 years.

The rest of the Incomeperpersons figures were single distinct values which each
occurred uniquely within the time frame they were looked at.

And these single occurrences all put together form over 80% of the total number of individual Incomeperperson figures looked at over the 216 years. This means
that Incomeperperson changed almost every year. That is where my research
question comes in, to find out the Association of the Income Per Person and
Literacy rate  and  of the people of Ghana

lifeexpectancy  variable

The second variable, lifeexpectancy takes a look at the life expectancy at birth (
in years) of Ghana population.

From the frequency table between the years of 1800 to 2015 it can significantly be noticed that the life expectancy of the people of Ghana was recorded as 28
years for 120 distinct years! This forms 0.555556% of the total of
lifeexpectancy  years recorded over the 216 years the data was collected.

Over 30 % of the lifeexpectancy years were distinct as the frequency and percentage values are 1 and 0.00463% each respectively.

 

literacyrate   variable

The third and final variable in focus in this program is the literacyrate which is
Literacy rate of the adult total (% of people ages 15 and above) of Ghana

From the frequency table this variable is only recorded twice.

Missing Data

There is a missen data in the variables significantly the literacyrate   as indicated the literacyrate  frequency
table. There are years where the literacy rate was not recorded and such all
those years missen data in terms of literacy rate of the people of Ghana.

This has been represented with a “NaN” value in the frequency distribution table

 

Decision on how I will manage my variables.

In the first place, there are three variables I am working with. And as indicated above they are:

a.
incomeperperson

b.      lifeexpectancy

c.       literacyrate

 

A closer look at these 3 variables will reveal a great deal of missing data. This is specifically apparent with the literacyrate. The literacyrate of the Ghanaian populace were missing 1800 to 1999, 2001 to 2009 and 2011 to 2015.

 

Managing the missen data in the literacyrate variable

 

I realised that python already handles these missen values /
data by replacing all of the missen data with “NaN”.

However, when I run the frequency table for the literacyrate
variable, python does not display this “NaN” value. So to make this “NaN” visible, I have used the “dropna=False” argument in my value_counts()
function
in order to display this missen data in the frequency table

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *