Category : Data Science

Pandas Numpy and Matplotlib

I have created demo code with Jupyter Notebook, which can be viewed here: https://github.com/marcuspaget/pythonDSFromScratch/blob/master/PandasDemo.md

Panda – Python Data Analysis Library

Quick install with:

pip install pandas

Python’s answer to R’s DataFrames for data manipulation

Providing tools to read and write data between data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

Easily to manipulate, slice and dice data, with integrated indexing.

Possible to convert HDF5 to HDFS for ingestion in Hadoop

Time series-functionality:

  • Date range generation & modification
  • Frequency conversion
  • Moving window statistics
  • Join time series without losing data

Numpy – Python Number Library

pip install numpy
Create, manipulate , slice and run ops i.e std, mean, min, max, etc
Please see link at top for examples

Matplotlib – Python plotting and figures

pip install matplotlib
Graph data from lists, dataframes, etc.

For example:


Matplotlib output example

I have created demo code with Jupyter Notebook, which can be viewed here: https://github.com/marcuspaget/pythonDSFromScratch/blob/master/PandasDemo.md

Using tapply to sum by category

Good use of R’s tapply function to summary data ..

## read a csv file into a table called x – the first row contains column names

x<-read.table("2014-tax.csv",sep=",",header=T)

## In my instance column names are Item,Amount,Cat,Month,Who
## split out by Who

bob<-x[x$Who == "bob",] jane<-x[x$Who == "jane",]

## Spin around each row (obs) and sum the Amount

print(tapplybob$Amount,bob$Cat,sum))

## Typical output for bob #

#  books  equipment licences stationery  supplies    telephone

# 303.00 694.27 132.00 345.50 96.00 30.00
#


# Then for jane

print(tapply(jane$Amount,jane$Cat,sum)) 

# books equipment licences stationery supplies telephone

# 163.0 583.0 348.0 678.4 11543.0 NA
#
#

Python Dict

Sample code for working with Python Dicts

# init users list of dicts, print out id 1, then init friends list of tuples

users = [

     { "id": 0, "name": "Bob" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" },
{ "id": 10, "name": "Jen" }

]

i=0
for user in users:
if(users[i]["name"]=="Bob"):
print("Bob ID: ",users[i]["id"])
i+=1

friends= [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

# spin through all users and create empty list to store list, then populate

for user in users:
user["friends"]=[]

# populate empty list with all left side of tuple with right and vice versa

for i,j in friends:
users[i]["friends"].append(users[j])
users[j]["friends"].append(users[i])

# function to return length based on passed in user

def number_of_friends(user):     
"""how many friends does _user_ have?"""
return len(user["friends"]) # length of friend_ids list

# total up all friends

total_connection = sum(number_of_friends(user)                         
for user in users) # 24

# grab number of users

num_users = len(users)

avg_connections = total_connection / num_users # 2.4

# create a list (user_id, number_of_friends)

num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users] 

print(sorted(num_friends_by_id,key=lambda pair: pair[1], reverse=True))

# Output – largest to smallest

[(1, 3), (2, 3), (3, 3), (5, 3), (8, 3), (0, 2), (4, 2), (6, 2), (7, 2), (9, 1), (10, 0)]

Shiny DashBoard

 

“Having studied Data Science since April 2014, felt it a good time to get to know Rstudio’s Shiny Server! So sourced data.gov.au’s Disaster Events and built an interactive dashboard. Writing up my experiences as a way of introduction, in the hope it might help others to learn Shiny Server.”

If you just want to see the demo first » CLICK HERE «

  1. First requirement is to have a fair understanding of R. There are a ton of courses online, such as on Rstudio, EDX and Coursera. My favourite course is MIT’s Analytics Edge. As a currently archived course you can complete it self paced, although assessments are disabled.
  2. Next to complete the Shiny Tutorial. It is very comprehensive and the only requirement is Rstudio, naturally. Previously the tutorial comprised 3 parts, but now are presented in the single video lasting a bit of 2 hours 25 minutes. I highly recommend it.
  3. Study the Shiny Dashboard Instructions – my source code might help too
  4. References:
Now onto my demo!

The Australian government provides public access to numerous data sets and encourages re-use. Therefore chose data from data.gov.au to build my demo, as it is freely available.

This is what the data looked like: raw-data

Quick preview of what I achieved thanks to Shiny Dashboards:
ausgov-homes-destroyed ausgov-world
The example I created converts a CSV file to data tables, graphs and map of Australia using uses RStudio’s Shiny Dashboard =>

Australia Government Disaster Events Dashboard

And the Source code

Lastly this tutorial helped with the maps.

Enjoy and feel free to comment below – would be good to receive feedback.