../data/elephants.jpg

Problem Set 5: Using Pandas¶

In this set of exercies, we explore actual elephant tracking data provided my Dr. Amelia Meier (PhD '20). These data, stored in the "EleTaiOneWeek.csv" file in the Data folder, include tracking locations for a set of elephants in Gabon. The fields included are described as follows:

Field Description Format
MovDataID Name of the elephant Text
CalcID Numeric ID of the elephant Text
StartDate Start date/time of the tracking effort DateTime
EndDate End date/time of the tracking effort DateTime
FixTime Time that the location was recorded DateTime
X Tracking longitude Float
Y Tracking latutude Float
Z Tracking elevation (not recorded) Float

The steps below lead you through the process of examining these data using Pandas.

Importing the package¶

In [1]:
#Import pandas
import pandas as pd

1. Loading and exploring the data¶

The first step in the analysis is to import the data into our coding environment, ensuring that the data are read as the correct data types. For example, Pandas will interpret the CalcID value as an integer, but as it represents a nominal value, we want to import it as a string.

Once imported successfully, we want to do a quick exploration of the data. Here, this will involve looking at a few records, checking the data types of each column, and then we will examine unique values held in a few fields.

► Q1.1 Import the EleTaiOneWeek.csv file (located in the data folder) into a pandas dataframe.

  • Modify the code below so that the CalcID field is imported as a string
  • Note how parse_dates imports the StartDate,EndDate, and Fixtime fields as date fields
In [2]:
#Load the data as instructed
df = pd.read_csv('../data/EleTaiOneWeek.csv',
                 parse_dates=['StartDate','EndDate','Fixtime'],
                )

► Q1.2 Display the first 5 records in the dataframe

In [3]:
#Display the "head" (i.e. the first 5 records) of the dataframe
df.
Out[3]:
MovDataID CalcID StartDate EndDate Fixtime X Y Z
0 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 08:14:57 -7.626650 6.242050 NaN
1 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 10:13:55 -7.628533 6.239200 NaN
2 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 11:14:33 -7.628367 6.238933 NaN
3 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 12:14:07 -7.629200 6.237500 NaN
4 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 13:13:56 -7.629800 6.238400 NaN

► Q1.3 Display the data types of each column

In [4]:
#Display the data types of each column
df.
Out[4]:
MovDataID            object
CalcID               object
StartDate    datetime64[ns]
EndDate      datetime64[ns]
Fixtime      datetime64[ns]
X                   float64
Y                   float64
Z                   float64
dtype: object

► Q1.4 List each unique value in the MovDataID column

In [5]:
#List each unique value in the MovDataID column
Out[5]:
array(['Cami', 'George', 'Frank', 'Yves', 'Tapa', 'Claudine', 'DjeDje',
       'Mahmahdou', 'Blanche', 'Sylvie', 'Zamba', 'Laura', 'Omar', 'Dave',
       'Anika', 'Marie'], dtype=object)

► Q1.5 Reveal how many unique dates are in the Fixtime field?

In [6]:
#Display the number of values in the `Fixtime` column
Out[6]:
1644

2. Subsetting data¶

Now that we are feeling good that our data were imported correctly, we'll want to wrangle the data in preparation for some analysis. Here we will isolate a few columns into new variables and create a Boolean mask that will allow us to easily subset records linked to one elephant from the data frame of all elephants.

Ok, so some of these aren't exactly relevant to our analysis, but rather are for me to see that you can successfully isolate columns and rows...

► Q2.1 Extract values in the X column into a variable called xCoords

In [7]:
#Extract the X values into a series called xCoords
xCoords = 

► Q2.2 Display the last 10 values in the Xcoords series

In [8]:
#Display the last 10 values in the series
Out[8]:
1997   -7.881133
1998   -7.877033
1999   -7.873417
2000   -7.869183
2001   -7.867383
2002   -7.866850
2003   -7.866783
2004   -7.861517
2005   -7.855983
2006   -7.860733
Name: X, dtype: float64

► Q2.3 Create a dataframe of just the X and Y columns and show the head (first 5 rows) of the dataframe

In [9]:
#Create a dataframe of just the X and Y columns and show the first 5 rows of data
dfXY = 
dfXY.
Out[9]:
X Y
0 -7.626650 6.242050
1 -7.628533 6.239200
2 -7.628367 6.238933
3 -7.629200 6.237500
4 -7.629800 6.238400

► Q2.4 Create a Boolean mask (dfGeorgeMask) for all records where MovDataID is "George"

In [10]:
#Create Boolean mask where MovDataID has the value "George"
mask_George = 

► Q2.5 Select all records where the MovDataID equals "George" into a dataframe called dfGeorge

  • Feel free to use the Boolean mask above or use other techniques.
In [11]:
#Create a new dataframe of just records where MovDataID is "George"
dfGeorge = 
dfGeorge.head()
Out[11]:
MovDataID CalcID StartDate EndDate Fixtime X Y Z
139 George 588 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 08:12:27 -7.622767 6.205850 NaN
140 George 588 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 10:12:33 -7.622517 6.204767 NaN
141 George 588 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 11:14:27 -7.622500 6.205317 NaN
142 George 588 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 12:12:51 -7.623750 6.207250 NaN
143 George 588 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 14:12:02 -7.626583 6.208233 NaN

3. Answering questions with the data¶

Now that we have command of our imported data sets, we're ready to begin our analysis. In this example, we are going to explore the wanderings of George the elephant. In particular, we want to use the tracking data to see how often George travels near a particular [imaginary] village located at latitude 6.32°N, 7.62°W.

Tip: Don't forget that longitude values west of prime meridian are negative...

► Q3.1 If the 'Y' column represents latitude, what's the southern most location George has been seen?

In [12]:
#Display the southern most location of the "George" subset
Out[12]:
6.2032

► Q3.2 Somehow demonstrate that George been seen north of 6.23°N 28 times. (In other words, show how many records meet the criteria...)

In [13]:
 
Out[13]:
28

► Q3.3 How many times has George been seen north of 6.23°N and west of 7.62°W?

In [14]:
 
Out[14]:
18

► Q3.4 What's the earliest "Fixtime" date George has been seen 6.23°N and west of 7.62°W?

In [15]:
 
Out[15]:
Timestamp('2017-04-10 04:12:32')

A cool plot!¶

If we set the index of our table to a time column, then plot the XY coordinates, we can visualize the track of our elephant...

In [16]:
#Plot the path of George!
dfGeorge.set_index('Fixtime').plot(x='X',y='Y');

See if you can to it with another elephant. You'll have to subset the table for a different individual - we'll choose "Cami" - then repeat the steps above...

In [17]:
#Plot the path of Cami

4. Grouping/aggregating data¶

Now we want to generate some quick movement stats on each individual from our dataset. First, we'll use Panda's value_counts() to generate a quick list of how many times each individual has been tracked. Then we'll use Panda's groupby function to create a grouped object from which we can easily generate aggeregate statistics like the mean X and Y coordinates taken from all records for a given individual.

► Q4.1 Use the value_counts() function to list how many records exist for each elephant.

In [18]:
df['MovDataID'].value_counts()
Out[18]:
MovDataID
Blanche      146
Zamba        146
Sylvie       144
Omar         144
Laura        143
Cami         139
DjeDje       138
Dave         138
Frank        134
Yves         134
Tapa         134
Claudine     134
George       132
Mahmahdou    132
Anika         35
Marie         34
Name: count, dtype: int64

► Q4.2 Group the dataframe of all elephants on the MovDataID attribute into a "groupBy" object called grpName.

In [19]:
grpName = df.groupby('MovDataID')

► Q4.3 Apply the appropriate aggregate function to the grpName object to report the median X and Y coordinate for each elephant.
Partial credit will be given if include more than just the X and Y columns...

In [20]:
#Report the mean X and Y values for each elephant using the grpName object created above
grpName[['X','Y']].median()
Out[20]:
X Y
MovDataID
Anika -7.894817 6.004683
Blanche -7.871358 5.861492
Cami -7.625967 6.234633
Claudine -7.631233 6.168733
Dave -7.665100 6.206167
DjeDje -8.027500 6.265792
Frank -7.708358 6.113575
George -7.625758 6.210975
Laura -7.709083 6.172267
Mahmahdou -7.683300 6.280983
Marie -7.836750 5.869792
Omar -7.754417 6.216633
Sylvie -7.732183 6.331867
Tapa -7.709117 6.137833
Yves -7.667825 6.131133
Zamba -7.671250 6.117942

Demonstrations: Transforming/pivoting data¶

►This portion is ungraded◄
We may also want to transform our data. Below are some examples of how that is done using Pandas.

In this example, we first want to examine how many tracking records occur across the hours of the day. To do this, we first need to create a new column listing just the hour from the Fixtime timestamp. Pandas has a function to do this: .dt.hour.

In [21]:
#Add an "hour" column to the data 
df['Hour'] = df['Fixtime'].dt.hour
#Have a look at the result
df.head()
Out[21]:
MovDataID CalcID StartDate EndDate Fixtime X Y Z Hour
0 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 08:14:57 -7.626650 6.242050 NaN 8
1 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 10:13:55 -7.628533 6.239200 NaN 10
2 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 11:14:33 -7.628367 6.238933 NaN 11
3 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 12:14:07 -7.629200 6.237500 NaN 12
4 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 13:13:56 -7.629800 6.238400 NaN 13

With the hour field created, we can compute how may tracks were recorded at each hour (using value_counts()) and then plot these as a line plot. For this to work, we need to sort our value_counts result on the index, which is the hour recorded.

→ Are there any trends in when elephants have been tracked?
→ It's quite easy to repeat this analysis displaying track records across years. Could you do that?

In [22]:
#Plot the number of tracks recorded each out
(df['Hour']       # Extract just the hour records
  .value_counts() # Tally how many records of each hour
  .sort_index()   # Sort on the index (i.e. the hour)
  .plot()         # Create and display a line plot
);

Now we are going to move on to something a bit more sophisticated: we are going to add a new field to our tracking data frame that represents the distance each record is from a given location.

To do this, we first are going to create a Python function that computes the distance a supplied X and Y values is from a provided coordinate tuple. (This doesn't work accurately with geographic coordinates, but we'll use it anyway...)

In [23]:
#Create a function that calculates the distance from a coordinate
def get_dist(theX, theY, thePoint):
    #Extract the coordinates from the provided point into separate variables
    center_x, center_y = thePoint
    #Get distances along X and Y axes
    x_dist = theX - center_x
    y_dist = theY - center_y
    #Compute the distances between points
    the_dist = (x_dist ** 2 + y_dist ** 2) * 0.5
    #Return the distance
    return the_dist

Next we need a point (just a tuple of X and Y coordinates) from which to measure distances. We'll use the centroid of our data (mean X and Y values).

In [24]:
#Get the centroid of all track records
the_centroid = (df['X'].mean(),df['Y'].mean())
the_centroid
Out[24]:
(-7.877294295067264, 6.15885599715994)

And here is the syntax for applying our get_dist() function to the X and Y series in our dataframe to compute each record's distance from the_centroid.

In [25]:
#Apply the function above to add a distance column
df['distance'] = get_dist(df['X'],df['Y'],the_centroid)
#Have a look
df.head()
Out[25]:
MovDataID CalcID StartDate EndDate Fixtime X Y Z Hour distance
0 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 08:14:57 -7.626650 6.242050 NaN 8 0.034872
1 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 10:13:55 -7.628533 6.239200 NaN 10 0.034169
2 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 11:14:33 -7.628367 6.238933 NaN 11 0.034189
3 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 12:14:07 -7.629200 6.237500 NaN 12 0.033868
4 Cami 587 2017-04-09 07:33:40 2017-04-15 16:31:42 2017-04-09 13:13:56 -7.629800 6.238400 NaN 13 0.033790

Next we want to transform our data so that, for each hour of the day, we show the median distance each animal was recorded away from the overall centroid.

To do this we want to transform or "pivot" our data such that:

  • Each each uniqueMovDataID value becomes a row
  • Each unique Hour value becomes a column
  • The table shows mean distance values for each combination of elephant/hour of day

We'll do this and then add some styling such that dark blue values indicate animals furthest from the centroid.

In [26]:
#Pivot data
df.pivot_table(
    index='MovDataID', # <- The index becomes the row name
    columns='Hour',    # <- Each unique value in the Hour column become a new separate column
    values=['distance'],      # <- The value assigned to each species/hour combination is taken from the "X" column
    aggfunc='mean'   # <- If more than one species/hour value exists, we will insert the median value
              ).style.background_gradient(cmap = 'YlGnBu') #Style the output
Out[26]:
  distance
Hour 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
MovDataID                                                
Anika 0.013238 nan nan 0.011058 0.014051 nan nan 0.010882 0.014839 nan nan 0.011660 0.013748 nan nan 0.010518 0.013247 nan nan 0.010282 0.013277 nan nan 0.010924
Blanche 0.045959 0.044909 0.045882 0.045766 0.045700 0.045615 0.045640 7.298096 0.043922 0.044262 0.044125 0.043464 0.042343 0.042648 0.042308 0.043228 0.043711 0.044757 0.045383 0.045579 0.045414 0.045730 0.046187 0.046244
Cami 0.035077 0.035185 0.035401 0.035366 0.035777 0.036025 0.036932 0.036527 0.035518 0.034601 0.035980 0.035889 0.036252 0.037009 0.036325 0.036395 0.035146 0.035746 0.036412 0.035296 0.035341 0.035374 0.035295 0.035124
Claudine 0.028958 0.028978 0.029009 7.284461 0.029133 0.029067 0.029351 0.030818 0.031604 0.031092 0.030740 0.030738 0.030621 0.030633 0.030623 0.031732 0.030285 0.030549 0.032089 0.029922 0.030194 0.030398 0.029361 0.029016
Dave 0.026432 0.026131 0.025935 0.025847 0.025819 0.026138 0.025096 7.281276 6.244370 8.732697 0.024292 0.023649 0.023474 0.024253 0.025362 10.908792 0.024164 0.023369 7.280588 0.023290 0.025442 0.025530 0.025877 0.026274
DjeDje 0.017004 0.016382 0.016572 0.016491 0.016428 0.016319 0.016907 0.016134 0.015732 0.016180 0.015806 0.016051 0.015124 0.015788 0.016011 0.016048 0.016621 0.016571 0.017474 0.016614 7.275011 0.017780 0.016028 0.017160
Frank 0.015910 0.015741 0.015396 0.015051 0.014927 0.014030 0.014051 0.013622 0.013952 0.013340 0.013455 0.013390 0.013613 0.013518 0.015651 0.014412 0.014335 0.015717 0.015704 0.015538 0.015737 0.015685 0.015639 0.015904
George 0.033631 0.032593 0.034016 0.034815 0.035984 0.034382 0.036388 0.036413 0.034678 0.036088 0.034498 0.035835 0.034828 0.036663 0.034387 0.034734 0.034246 0.033035 0.032617 0.032627 0.032541 0.032865 0.032622 0.033147
Laura 0.014487 0.014424 0.014299 7.272207 0.014145 0.014098 0.013903 7.271815 7.272540 0.014454 14.530194 7.272316 0.014251 0.014207 0.014367 6.235466 0.014393 0.014471 0.014204 0.014251 0.014443 0.014632 0.014658 0.014568
Mahmahdou 0.027432 0.027199 0.027362 0.028441 0.027492 0.027561 0.027285 0.028201 6.248844 8.735662 0.028420 0.030341 0.027706 0.029416 0.029542 0.026545 0.028083 0.030600 0.026044 0.030579 0.029253 0.028573 0.028197 0.027707
Marie nan 0.043860 nan nan nan 0.043899 nan nan nan 8.746202 nan nan nan 0.044444 nan nan nan 0.044700 nan nan nan 0.043473 nan nan
Omar 7.267820 0.009155 0.009155 0.009244 0.009322 0.009533 0.009743 7.268262 0.009878 0.009873 6.231392 0.009678 0.009705 0.009687 0.009397 6.231224 8.720060 7.268417 0.010240 0.010248 0.010066 0.009738 0.009344 0.009148
Sylvie 0.028166 0.028332 0.028410 0.028429 0.027657 0.027463 0.025541 0.023691 7.279292 0.023178 0.022454 0.022186 0.022659 0.022498 6.242445 0.022436 7.279287 0.023411 0.024226 0.025352 0.025956 0.026738 0.027354 0.027840
Tapa 0.013914 0.014064 0.014101 0.014051 0.013837 0.013418 0.012858 0.013028 0.012606 0.012580 0.012608 0.013103 0.013771 0.013086 0.013958 0.013673 0.012802 0.013900 0.013880 0.013354 0.013948 0.014024 0.013966 0.013901
Yves 0.021338 0.021749 0.022090 0.022437 0.022706 0.022607 0.022451 0.022724 0.023056 0.022779 0.022432 0.022364 0.022682 0.022574 0.022645 0.022226 0.021916 0.022218 0.020715 0.020570 0.020505 0.020609 0.020660 0.020957
Zamba 14.534592 0.022195 0.022073 0.022074 0.022251 0.022036 0.021994 0.022410 0.021636 0.021317 7.278447 0.022131 0.022268 0.021754 0.022349 6.242183 6.241515 0.021531 0.020775 0.021390 0.021508 0.021858 0.021842 0.021788
In [27]:
#Or, using the untransformed data, we can plot
df.hist(column='distance',by='MovDataID',figsize=(15,15));