Here, we continue our exploration of the Pandas package. We focus on various ways data stored in dataframes can be analyzed using the package. Specifically, we look at:
#Import data
import pandas as pd
#Initial import of surveys dataset
survey_df = pd.read_csv('../data/surveys.csv',
index_col='record_id',
dtype={'plot_id':'str'})
survey_df.head()
A typical operation in dataframes is creating new fields from values in other fields. We'll show this by computing hindfoodt_length (currently in mm) into inches. (25.4 mm per inch...)
#Create a field of hindfoot length in inches
survey_df['hindfoot_length_in'] = survey_df['hindfoot_length'] / 25.4
#Review the output
survey_df.head()
â–º YOU TRY IT:
#Create a field of weight in ounces
#Review the output
survey_df.tail()
We often want to isolate specific rows and/or columns of data from our dataframe for further analysis. Pandas offers many ways to do this. Here we'll review these techniques.
iloc
¶Recall that dataframes can be considered lists (rows) of lists (columns). As such, we can select specific rows by their integer index, either indivdually or in slices. This is done using the dataframe's iloc
method.
Refer to the Pandas help on iloc
for more info on this functionality:
iloc
is short for integer location
#Fetch the 4th row of data
survey_df.iloc[3]
#Fetch the 101st thru 110th rows of data
survey_df.iloc[100:110]
#Fetch the 3rd, 5th, and 10th rows
survey_df.iloc[[2,4,9]]
iloc
¶The iloc
method can also be used to select columns. Which columns to select are passed as the second parameter in the iloc
function. Therefore, we need to specify which rows we want to fetch first. If we want to retrieve the selected columns for all rows, we can just pass a ":
" as the first parameter.
#Select all rows and the 4th column of the dataset
survey_df.iloc[:,3]
#Select the first 5 rows and the 4th column of the dataset
survey_df.iloc[:5,3]
→ So you see, selecting subsets is just a matter of messing around with the row/column integer indices and inserting them into the iloc
method.
loc
¶While iloc
selects rows and columns by their absolute position, loc
selects rows and columns by their labels. Yes, each row and column has a label. A column's label is its column name; a row's label is its label index (not to be confused with its integer index, which is simply its row number).
When Pandas reads in a CSV file, the CSV file determines the column names, but row labels are assigned a sequential numeric value by default. When we read in data into our surveys_df
, we specified the values inthe record_id
column to become our index, i.e., our row labels. We can now use these labels to select rows, and combine them with column names to select row/column combinations:
#Fetch the row with row label of 10
survey_df.loc[10]
#Fetch value in the species_id column in the row with row label of 10
survey_df.loc[10,'species_id']
Because our dataframe's row index (i.e. its row labels) happens to be numeric, we can fetch slices of values. We can also fetch multiple columns by passing them in as a list.
#Fetch value in the species_id column in the row with rows with labels between 10 and 20
survey_df.loc[10:20,['species_id','sex']]
Instead of selecting data by position or by index, we can also query our dataframe for all records meeting specific criteria. In this first example, we'll use the dataframe's query()
function.
#Fetch all records from the year 1997 into a new variable
records_1997 = survey_df.query('year == 1997')
records_1997.shape
#Fetch all records from 1997 for female individuals
females_1997 = survey_df.query('year == 1997 & sex == "F"')
females_1997.shape
Queries are handy, but as more criteria are used, the query statement can get cumbersome. The use of masks provide a somewhat more simplified approach, or at least a more modular approach.
A "mask" is a Python series (i.e. one column of data) with a Boolean values for each row:
survey_df
#Create a mask for all records with hindfeet smaller than 35mm
mask_smallfeet = survey_df['hindfoot_length'] < 35
mask_smallfeet
We can now apply this mask to select records where the mask value is true by using it with the loc
method.
survey_df.loc[mask_smallfeet]
What's nice about masks is that we can combine multiple masks. Think of each Boolean value as as on/off switch that determines whether or not the value is added to the final set. We use &
["and" switch] to combine switches such that each must be "on", and |
["or" switch] to require only one switch to be on.
#Create a mask of females
mask_male = survey_df['sex'] == 'M'
mask_male
#Create a mask for records colected in April
mask_april = survey_df['month'] == 4
#Filter records meeting all criteria
data_subset = survey_df.loc[mask_smallfeet & mask_male & mask_april]
data_subset.shape
#Filter records meeting any criteria
data_subset = survey_df.loc[mask_smallfeet | mask_male | mask_april]
data_subset.shape
query()
function to select all male individuals with weight greater than 41 g. #Extract males weighing more than 41 g - using query
subset = survey_df.query("sex == 'M' and weight > 41")
subset['hindfoot_length'].median()
#Extract males weighing more than 41 g - using masks
mask_weight = survey_df['weight'] > 41
subset_2 = survey_df.loc[mask_male & mask_weight]
subset_2['hindfoot_length'].median()
We've learned how to select data in specific rows and columns. Now we'll update those values.
Our dataset has some erroneous data: records occuring on the "31st" day of April (which has only 30 days in it). We'll replace those dates to be the 30th day of April. To do this, we have to select rows where the month is '4' and the day is '31'. And then with those rows selected, we'll update the the values in the day column to be 30.
#Create masks for day = 30 and month = 4
mask_day = survey_df['day'] == 31
mask_april = survey_df['month'] == 4
mask_sept = survey_df['month'] == 9
If we apply the masks, it returns a dataframe of that selection
#Apply the masks
survey_df.loc[mask_day & (mask_april | mask_sept)]
We can add a column designation to just get one column with the selected rows:
#Fetch the just the day column of the filtered rows
survey_df.loc[mask_day & (mask_april | mask_sept), 'day']
Rather than just display to the screen, we can assign and new value to this selection
#Update the values
survey_df.loc[mask_day & (mask_april | mask_sept), 'day'] = 30
Check our results...
#What's the largest value in the day column for records where month = 4
survey_df.query('month == 4 or month == 9')['day'].max()
With the dates fixed, we can now create a date attribute
survey_df['date'] = pd.to_datetime(survey_df[['year','month','day']])
survey_df.head()
We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per some variable like species_id
, plot_id
or sex
. This is done by Pandas' group_by()
function.
So let's go through the process of grouping data by sex (a variable without a lot of unique values) and computing mean weight and mean hindfoot length of males vs females.
First, some review:
survey_df['weight'].describe()
print(" Min: ", survey_df['weight'].min())
print(" Max: ", survey_df['weight'].max())
print(" Mean: ", survey_df['weight'].mean())
print(" Std Dev: ", survey_df['weight'].std())
print(" Count: ", survey_df['weight'].count())
#Group the data by unique values in the `sex` field
grouped_data = survey_df.groupby('sex')
#This creates a Panda's "grouped dataframe" object
type(grouped_data)
min()
, median()
, count()
etc.#Compute the median value of all numeric fields for each group
grouped_data.median()
# Provide a list of fields if you want to restrict what columns are reported
grouped_data[['hindfoot_length','weight']].median()
# Or, use the describe function to reveal all summary stats for the grouped data
grouped_data['weight'].describe()
# Challenge 1: Show the *count* of records, grouped by sex
#grouped_data.â–ˆ()
grouped_data.count()
weight
field in the resulting dataframe.# Challege 2: Just show the 'month' column in the above statement
#grouped_data.â–ˆ()['â–ˆ']
grouped_data.count()['month']
grouped_data2 = survey_df.groupby(['plot_id','sex'])
grouped_data2.mean()
# Challenge 3
grouped_data2 = survey_df.groupby(['plot_id','sex'])
grouped_data2.mean()
describe
function on that result to compute summary statistics...# Challenge 4
We can also supply a dictionary of aggregating functions so that each column in the grouped result is aggregated exactly how we want (i.e. instead of computing just the sum or mean of all columns). This dictionary is built by specifying the column name as the key and the aggregate function(s) as the values. Below is an example to aggregate the data by sex
, computing the minumum and maximum of the year
, the median of the hindfoot_length
, and the mean weight
.
#Create the aggregate function dictionary
aggFuncs = {
"year": ['count','min','max'],#compute count, min, and max of the year attribute
"hindfoot_length": 'median', #compute the median hindfoot length
"weight": 'mean'} #compute the mean weight
#Apply the dictionary
survey_df.groupby('sex').agg(aggFuncs)
And finally, we can transform our data via pivot tables. In this operation, we take two categorical values in our data set, creating row labels with one and column names with the other.
survey_df.pivot_table(
values = 'weight',
index = 'species_id',
columns= 'year',
aggfunc ='mean'
).fillna(-1).T.style.background_gradient(cmap = 'YlGnBu')