This is a quick walk through Bokeh data exploration and visualization and also python pivot_tables (credit to pbpython on the pivot_tables).Visualization Dashboard
import sqlalchemy
import pyodbc
from pandas import DataFrame
from bokeh.plotting import figure, output_file, show
import pandas as pd
df = DataFrame
## connect to SqlServer Database and get information.
try:
import urllib
params = urllib.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=.\MSSQLSERVER_ENT;DATABASE=Antrak;Trusted_Connection=yes;")
engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=%s"' % params)
resoverall = engine.execute("SELECT top 10 * FROM FlightBookingsMain")
df = df(resoverall.fetchall())
df.columns = resoverall.keys()
except (RuntimeError, TypeError, NameError):
print('Error in Conneccting')
print(RuntimeError, TypeError, NameError)
finally:
print("connected")
#lets look at the head of the dataframe
df.head()
#lets define the bokeh figure, and declare the x axis as type datetime
p = figure(width=800, height=350, x_axis_type="datetime")
#lots plot both a circle and line graph
p.circle(df['Booking Date'], df['AFP'], size=4, color='darkgrey', alpha=0.2, legend='close')
p.line(df['Booking Date'], df['Total Cost'], color='navy', legend='avg')
# show the results
output_file('linegraph.html')
show(p)
# bokeh server plotting
#sample server plotting code from bokeh
from bokeh.plotting import figure, show, output_server
p = figure(title="Server Plot")
p.circle([1, 2, 3], [4, 5, 6])
output_server("hover")
show(p)
I came across this brilliant analyis by pbpython. and decided to play around with it on this data. you can see the full code and explanation on his website: http://pbpython.com/pandas-pivot-table-explained.html. Though I have added just a little bit of some extra exploration, the key and major analysis are from pbpython so you can check him out.
For convenience sake, let’s define the SEX column as a category and set the order we want to view.
This isn’t strictly required but helps us keep the order we want as we work through analyzing the data.
df['Sex'] = df['Sex'].astype('category')
df['Sex'].cat.set_categories(['m','f'], inplace=True)
The simplest pivot table must have a dataframe and an index . In this case, let’s use the Name as our index.
df['Booking Date'] = pd.to_datetime(df['Booking Date'])
#get year from date
df['year'] = df['Booking Date'].apply(lambda x: x.year)
#get month from date
df['month'] = df['Booking Date'].apply(lambda x: x.month)
df.head()
#get all data less than a particular date
df[df['Booking Date']<'2014-10-28']
year_plot = pd.pivot_table(df, index=['year'])
year_plot
p.line(df['year'], df['Total Cost'], color='darkgrey', alpha=0.2, legend='year plot')
output_file('linegraph.html')
show(p)
pv_result = pd.pivot_table(df, index=['Last Name'])
type(pv_result)
pv_result
You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.
pd.pivot_table(df, index=['Last Name','Booking Date','Departure Date'])
For this purpose, the Flight Booking ID, Insurance and Postcode columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.
pd.pivot_table(df, index=['Last Name'], values=['Total Cost'])
The price column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc and np.sum .
pd.pivot_table(df, index=['Last Name'], values=['Total Cost'], aggfunc=np.sum)
aggfunc can take a list of functions. Let’s try a mean using the numpy mean function and len to get a count.
pd.pivot_table(df, index=['Last Name'], values=['Total Cost'], aggfunc =[np.sum,len])
I think one of the confusing points with the pivot_table is the use of columns and values . Remember, columns are optional – they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you list.