This is a quick walk through Bokeh data exploration and visualization and also python pivot_tables (credit to pbpython on the pivot_tables).Visualization Dashboard

In [49]:
import sqlalchemy
import pyodbc
from pandas import DataFrame
from bokeh.plotting import figure, output_file, show
import pandas as pd

 

In [50]:
df = DataFrame


## connect to SqlServer Database and get information. 
try:

    import urllib
    params = urllib.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=.\MSSQLSERVER_ENT;DATABASE=Antrak;Trusted_Connection=yes;")

    engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=%s"' % params)
    resoverall = engine.execute("SELECT top 10 * FROM FlightBookingsMain")


    df = df(resoverall.fetchall())
    df.columns = resoverall.keys()

    
except (RuntimeError, TypeError, NameError):
    print('Error in Conneccting')
    print(RuntimeError, TypeError, NameError)
finally:
    print("connected")

 

connected
In [51]:
#lets look at the head of the dataframe
df.head()

 

Out[51]:
First Name Last Name Sex Phone Fax Email Address Booking Date Departure Date Arrival Date Address1 Number of Passengers Total Cost Currency Numeric Code Company website Flight ID Number Flight Booking ID Insurance AFP Flight Name Age
0 Ellis Harris m 077 5665 3370 077 5665 3373 Ellis.E.Harris@mailinator.com 2012-12-18 2011-07-16 2000-08-23 63 Hull Road 2 983 710 http://qvc.com 4111113236285560 1 16 62 ZAR 62
1 Aaron Field m 079 6068 9016 079 6068 9015 Aaron.A.Field@mailinator.com 2014-10-28 2005-02-26 2000-07-30 42 Russell Rd 9 2452 710 http://msnbc.com 4111110402728960 2 83 38 ZAR 44
2 Archie Kaur m 070 3056 9804 070 3056 9809 Archie.M.Kaur@spambob.com 2003-11-17 2000-12-05 2007-03-07 10 Southend Avenue 8 1710 710 http://ap.org 4111110879255550 3 54 58 ZAR 60
3 Bradley Middleton m 070 0765 7814 070 0765 7816 Bradley.A.Middleton@dodgeit.com 2013-07-18 2011-02-26 2015-10-25 32 Pier Road 6 518 710 http://tripadvisor.com 4111111855573320 4 72 43 ZAR 66
4 Evan Reeves m 070 8266 4451 070 8266 4458 Evan.A.Reeves@dodgeit.com 2006-07-20 2006-06-20 2001-02-25 75 Moulton Road 7 1943 710 http://aliexpress.com 4111113022178850 5 68 52 ZAR 38

5 rows × 25 columns

In [52]:
#lets define the bokeh figure, and declare the x axis as type datetime

p = figure(width=800, height=350, x_axis_type="datetime")

 

In [53]:
#lots plot both a circle and line graph
p.circle(df['Booking Date'], df['AFP'], size=4, color='darkgrey', alpha=0.2, legend='close')
p.line(df['Booking Date'], df['Total Cost'], color='navy', legend='avg')

 

Out[53]:
<bokeh.models.renderers.GlyphRenderer at 0xc431ac8>
In [54]:
# show the results
output_file('linegraph.html')
show(p)

 

In [55]:
# bokeh server plotting
#sample server plotting code from bokeh

from bokeh.plotting import figure, show, output_server

p = figure(title="Server Plot")
p.circle([1, 2, 3], [4, 5, 6])

output_server("hover")

show(p)

 

In [ ]:

I came across this brilliant analyis by pbpython. and decided to play around with it on this data. you can see the full code and explanation on his website: http://pbpython.com/pandas-pivot-table-explained.html. Though I have added just a little bit of some extra exploration, the key and major analysis are from pbpython so you can check him out.

For convenience sake, let’s define the SEX column as a category and set the order we want to view.

This isn’t strictly required but helps us keep the order we want as we work through analyzing the data.

In [56]:
df['Sex'] = df['Sex'].astype('category')
df['Sex'].cat.set_categories(['m','f'], inplace=True)

 

The simplest pivot table must have a dataframe and an index . In this case, let’s use the Name as our index.

In [57]:
df['Booking Date'] = pd.to_datetime(df['Booking Date'])

 

In [58]:
#get year from date

df['year'] = df['Booking Date'].apply(lambda x: x.year)

 

In [59]:
#get month from date

df['month'] = df['Booking Date'].apply(lambda x: x.month)

 

In [60]:
df.head()

 

Out[60]:
First Name Last Name Sex Phone Fax Email Address Booking Date Departure Date Arrival Date Address1 Currency Numeric Code Company website Flight ID Number Flight Booking ID Insurance AFP Flight Name Age year month
0 Ellis Harris NaN 077 5665 3370 077 5665 3373 Ellis.E.Harris@mailinator.com 2012-12-18 2011-07-16 2000-08-23 63 Hull Road 710 http://qvc.com 4111113236285560 1 16 62 ZAR 62 2012 12
1 Aaron Field NaN 079 6068 9016 079 6068 9015 Aaron.A.Field@mailinator.com 2014-10-28 2005-02-26 2000-07-30 42 Russell Rd 710 http://msnbc.com 4111110402728960 2 83 38 ZAR 44 2014 10
2 Archie Kaur NaN 070 3056 9804 070 3056 9809 Archie.M.Kaur@spambob.com 2003-11-17 2000-12-05 2007-03-07 10 Southend Avenue 710 http://ap.org 4111110879255550 3 54 58 ZAR 60 2003 11
3 Bradley Middleton NaN 070 0765 7814 070 0765 7816 Bradley.A.Middleton@dodgeit.com 2013-07-18 2011-02-26 2015-10-25 32 Pier Road 710 http://tripadvisor.com 4111111855573320 4 72 43 ZAR 66 2013 7
4 Evan Reeves NaN 070 8266 4451 070 8266 4458 Evan.A.Reeves@dodgeit.com 2006-07-20 2006-06-20 2001-02-25 75 Moulton Road 710 http://aliexpress.com 4111113022178850 5 68 52 ZAR 38 2006 7

5 rows × 27 columns

In [61]:
#get all data less than a particular date
df[df['Booking Date']<'2014-10-28']

 

Out[61]:
First Name Last Name Sex Phone Fax Email Address Booking Date Departure Date Arrival Date Address1 Currency Numeric Code Company website Flight ID Number Flight Booking ID Insurance AFP Flight Name Age year month
0 Ellis Harris NaN 077 5665 3370 077 5665 3373 Ellis.E.Harris@mailinator.com 2012-12-18 2011-07-16 2000-08-23 63 Hull Road 710 http://qvc.com 4111113236285560 1 16 62 ZAR 62 2012 12
2 Archie Kaur NaN 070 3056 9804 070 3056 9809 Archie.M.Kaur@spambob.com 2003-11-17 2000-12-05 2007-03-07 10 Southend Avenue 710 http://ap.org 4111110879255550 3 54 58 ZAR 60 2003 11
3 Bradley Middleton NaN 070 0765 7814 070 0765 7816 Bradley.A.Middleton@dodgeit.com 2013-07-18 2011-02-26 2015-10-25 32 Pier Road 710 http://tripadvisor.com 4111111855573320 4 72 43 ZAR 66 2013 7
4 Evan Reeves NaN 070 8266 4451 070 8266 4458 Evan.A.Reeves@dodgeit.com 2006-07-20 2006-06-20 2001-02-25 75 Moulton Road 710 http://aliexpress.com 4111113022178850 5 68 52 ZAR 38 2006 7
6 Harley Carroll NaN 077 5239 3650 077 5239 3659 Harley.F.Carroll@dodgeit.com 2008-12-18 2007-09-28 2004-10-05 71 Stone St 710 http://fixya.com 4111111662317450 7 83 79 ZAR 60 2008 12
7 Christopher Gibson NaN 077 3549 4164 077 3549 4169 Christopher.M.Gibson@mailinator.com 2008-01-28 2011-05-04 2006-01-19 78 Bootham Terrace 710 http://vertitechnologygroup.com 4111112798242170 8 75 57 ZAR 60 2008 1
8 Isobel Holden NaN 079 2946 6947 079 2946 6946 Isobel.J.Holden@mailinator.com 2001-12-15 2009-05-25 2007-04-20 71 Front St 710 http://rr.com 4111114862347230 9 91 99 ZAR 27 2001 12
9 Alfie Bolton NaN 079 4864 1114 079 4864 1112 Alfie.K.Bolton@mailinator.com 2006-05-05 2002-11-29 2012-08-06 57 Ockham Road 710 http://nfl.com 4111111081803270 10 83 31 ZAR 30 2006 5

8 rows × 27 columns

In [62]:
year_plot = pd.pivot_table(df, index=['year'])
year_plot

 

Out[62]:
AFP Age Flight Booking ID Insurance Number of Passengers Postcode Total Cost month
year
2001 99.0 27.0 9.0 91.0 6.0 0.0 978.0 12.0
2003 58.0 60.0 3.0 54.0 8.0 0.0 1710.0 11.0
2006 41.5 34.0 7.5 75.5 7.0 0.0 2346.5 6.0
2008 68.0 60.0 7.5 79.0 8.5 0.0 2751.5 6.5
2012 62.0 62.0 1.0 16.0 2.0 0.0 983.0 12.0
2013 43.0 66.0 4.0 72.0 6.0 0.0 518.0 7.0
2014 38.0 44.0 2.0 83.0 9.0 0.0 2452.0 10.0
2015 14.0 46.0 6.0 55.0 6.0 0.0 1271.0 7.0
In [63]:
p.line(df['year'], df['Total Cost'], color='darkgrey', alpha=0.2, legend='year plot')
output_file('linegraph.html')
show(p)

 

In [64]:
pv_result = pd.pivot_table(df, index=['Last Name'])

 

In [65]:
type(pv_result)
pv_result

 

Out[65]:
AFP Age Flight Booking ID Insurance Number of Passengers Postcode Total Cost month year
Last Name
Bolton 31.0 30.0 10.0 83.0 7.0 0.0 2750.0 5.0 2006.0
Carroll 79.0 60.0 7.0 83.0 10.0 0.0 3869.0 12.0 2008.0
Field 38.0 44.0 2.0 83.0 9.0 0.0 2452.0 10.0 2014.0
Gibson 57.0 60.0 8.0 75.0 7.0 0.0 1634.0 1.0 2008.0
Harris 62.0 62.0 1.0 16.0 2.0 0.0 983.0 12.0 2012.0
Holden 99.0 27.0 9.0 91.0 6.0 0.0 978.0 12.0 2001.0
Kaur 36.0 53.0 4.5 54.5 7.0 0.0 1490.5 9.0 2009.0
Middleton 43.0 66.0 4.0 72.0 6.0 0.0 518.0 7.0 2013.0
Reeves 52.0 38.0 5.0 68.0 7.0 0.0 1943.0 7.0 2006.0

You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.

In [66]:
pd.pivot_table(df, index=['Last Name','Booking Date','Departure Date'])

 

Out[66]:
AFP Age Flight Booking ID Insurance Number of Passengers Postcode Total Cost month year
Last Name Booking Date Departure Date
Bolton 2006-05-05 2002-11-29 31 30 10 83 7 0 2750 5 2006
Carroll 2008-12-18 2007-09-28 79 60 7 83 10 0 3869 12 2008
Field 2014-10-28 2005-02-26 38 44 2 83 9 0 2452 10 2014
Gibson 2008-01-28 2011-05-04 57 60 8 75 7 0 1634 1 2008
Harris 2012-12-18 2011-07-16 62 62 1 16 2 0 983 12 2012
Holden 2001-12-15 2009-05-25 99 27 9 91 6 0 978 12 2001
Kaur 2003-11-17 2000-12-05 58 60 3 54 8 0 1710 11 2003
2015-07-01 2008-01-05 14 46 6 55 6 0 1271 7 2015
Middleton 2013-07-18 2011-02-26 43 66 4 72 6 0 518 7 2013
Reeves 2006-07-20 2006-06-20 52 38 5 68 7 0 1943 7 2006

For this purpose, the Flight Booking ID, Insurance and Postcode columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.

In [67]:
pd.pivot_table(df, index=['Last Name'], values=['Total Cost'])

 

Out[67]:
Total Cost
Last Name
Bolton 2750.0
Carroll 3869.0
Field 2452.0
Gibson 1634.0
Harris 983.0
Holden 978.0
Kaur 1490.5
Middleton 518.0
Reeves 1943.0

The price column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc and np.sum .

In [68]:
pd.pivot_table(df, index=['Last Name'], values=['Total Cost'], aggfunc=np.sum)

 

Out[68]:
Total Cost
Last Name
Bolton 2750
Carroll 3869
Field 2452
Gibson 1634
Harris 983
Holden 978
Kaur 2981
Middleton 518
Reeves 1943

aggfunc can take a list of functions. Let’s try a mean using the numpy mean function and len to get a count.

In [69]:
pd.pivot_table(df, index=['Last Name'], values=['Total Cost'], aggfunc =[np.sum,len])

 

Out[69]:
sum len
Total Cost Total Cost
Last Name
Bolton 2750 1
Carroll 3869 1
Field 2452 1
Gibson 1634 1
Harris 983 1
Holden 978 1
Kaur 2981 2
Middleton 518 1
Reeves 1943 1

I think one of the confusing points with the pivot_table is the use of columns and values . Remember, columns are optional – they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you list.

 

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *