In this set of exercies, we explore actual elephant tracking data provided my Dr. Amelia Meier (PhD '20). These data, stored in the "EleTaiOneWeek.csv
" file in the Data folder, include tracking locations for a set of elephants in Gabon. The fields included are described as follows:
Field | Description | Format |
---|---|---|
MovDataID | Name of the elephant | Text |
CalcID | Numeric ID of the elephant | Text |
StartDate | Start date/time of the tracking effort | DateTime |
EndDate | End date/time of the tracking effort | DateTime |
FixTime | Time that the location was recorded | DateTime |
X | Tracking longitude | Float |
Y | Tracking latutude | Float |
Z | Tracking elevation (not recorded) | Float |
The steps below lead you through the process of examining these data using Pandas.
#Import pandas
import pandas as pd
The first step in the analysis is to import the data into our coding environment, ensuring that the data are read as the correct data types. For example, Pandas will interpret the CalcID
value as an integer, but as it represents a nominal value, we want to import it as a string.
Once imported successfully, we want to do a quick exploration of the data. Here, this will involve looking at a few records, checking the data types of each column, and then we will examine unique values held in a few fields.
► Q1.1 Import the EleTaiOneWeek.csv
file (located in the data folder) into a pandas dataframe.
CalcID
field is imported as a stringparse_dates
imports the StartDate
,EndDate
, and Fixtime
fields as date fields#Load the data as instructed
df = pd.read_csv('../data/EleTaiOneWeek.csv',
parse_dates=['StartDate','EndDate','Fixtime'],
)
► Q1.2 Display the first 5 records in the dataframe
#Display the "head" (i.e. the first 5 records) of the dataframe
df.
MovDataID | CalcID | StartDate | EndDate | Fixtime | X | Y | Z | |
---|---|---|---|---|---|---|---|---|
0 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 08:14:57 | -7.626650 | 6.242050 | NaN |
1 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 10:13:55 | -7.628533 | 6.239200 | NaN |
2 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 11:14:33 | -7.628367 | 6.238933 | NaN |
3 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 12:14:07 | -7.629200 | 6.237500 | NaN |
4 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 13:13:56 | -7.629800 | 6.238400 | NaN |
► Q1.3 Display the data types of each column
#Display the data types of each column
df.
MovDataID object CalcID object StartDate datetime64[ns] EndDate datetime64[ns] Fixtime datetime64[ns] X float64 Y float64 Z float64 dtype: object
► Q1.4 List each unique value in the MovDataID
column
#List each unique value in the MovDataID column
array(['Cami', 'George', 'Frank', 'Yves', 'Tapa', 'Claudine', 'DjeDje', 'Mahmahdou', 'Blanche', 'Sylvie', 'Zamba', 'Laura', 'Omar', 'Dave', 'Anika', 'Marie'], dtype=object)
► Q1.5 Reveal how many unique dates are in the Fixtime
field?
#Display the number of values in the `Fixtime` column
1644
Now that we are feeling good that our data were imported correctly, we'll want to wrangle the data in preparation for some analysis. Here we will isolate a few columns into new variables and create a Boolean mask that will allow us to easily subset records linked to one elephant from the data frame of all elephants.
Ok, so some of these aren't exactly relevant to our analysis, but rather are for me to see that you can successfully isolate columns and rows...
► Q2.1 Extract values in the X
column into a variable called xCoords
#Extract the X values into a series called xCoords
xCoords =
► Q2.2 Display the last 10 values in the Xcoords
series
#Display the last 10 values in the series
1997 -7.881133 1998 -7.877033 1999 -7.873417 2000 -7.869183 2001 -7.867383 2002 -7.866850 2003 -7.866783 2004 -7.861517 2005 -7.855983 2006 -7.860733 Name: X, dtype: float64
► Q2.3 Create a dataframe of just the X
and Y
columns and show the head (first 5 rows) of the dataframe
#Create a dataframe of just the X and Y columns and show the first 5 rows of data
dfXY =
dfXY.
X | Y | |
---|---|---|
0 | -7.626650 | 6.242050 |
1 | -7.628533 | 6.239200 |
2 | -7.628367 | 6.238933 |
3 | -7.629200 | 6.237500 |
4 | -7.629800 | 6.238400 |
► Q2.4 Create a Boolean mask (dfGeorgeMask
) for all records where MovDataID
is "George"
#Create Boolean mask where MovDataID has the value "George"
mask_George =
► Q2.5 Select all records where the MovDataID equals "George" into a dataframe called dfGeorge
#Create a new dataframe of just records where MovDataID is "George"
dfGeorge =
dfGeorge.head()
MovDataID | CalcID | StartDate | EndDate | Fixtime | X | Y | Z | |
---|---|---|---|---|---|---|---|---|
139 | George | 588 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 08:12:27 | -7.622767 | 6.205850 | NaN |
140 | George | 588 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 10:12:33 | -7.622517 | 6.204767 | NaN |
141 | George | 588 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 11:14:27 | -7.622500 | 6.205317 | NaN |
142 | George | 588 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 12:12:51 | -7.623750 | 6.207250 | NaN |
143 | George | 588 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 14:12:02 | -7.626583 | 6.208233 | NaN |
Now that we have command of our imported data sets, we're ready to begin our analysis. In this example, we are going to explore the wanderings of George the elephant. In particular, we want to use the tracking data to see how often George travels near a particular [imaginary] village located at latitude 6.32°N, 7.62°W.
Tip: Don't forget that longitude values west of prime meridian are negative...
► Q3.1 If the 'Y' column represents latitude, what's the southern most location George has been seen?
#Display the southern most location of the "George" subset
6.2032
► Q3.2 Somehow demonstrate that George been seen north of 6.23°N 28 times. (In other words, show how many records meet the criteria...)
28
► Q3.3 How many times has George been seen north of 6.23°N and west of 7.62°W?
18
► Q3.4 What's the earliest "Fixtime" date George has been seen 6.23°N and west of 7.62°W?
Timestamp('2017-04-10 04:12:32')
If we set the index of our table to a time column, then plot the XY coordinates, we can visualize the track of our elephant...
#Plot the path of George!
dfGeorge.set_index('Fixtime').plot(x='X',y='Y');
See if you can to it with another elephant. You'll have to subset the table for a different individual - we'll choose "Cami" - then repeat the steps above...
#Plot the path of Cami
Now we want to generate some quick movement stats on each individual from our dataset. First, we'll use Panda's value_counts()
to generate a quick list of how many times each individual has been tracked. Then we'll use Panda's groupby
function to create a grouped object from which we can easily generate aggeregate statistics like the mean X and Y coordinates taken from all records for a given individual.
► Q4.1 Use the value_counts()
function to list how many records exist for each elephant.
df['MovDataID'].value_counts()
MovDataID Blanche 146 Zamba 146 Sylvie 144 Omar 144 Laura 143 Cami 139 DjeDje 138 Dave 138 Frank 134 Yves 134 Tapa 134 Claudine 134 George 132 Mahmahdou 132 Anika 35 Marie 34 Name: count, dtype: int64
► Q4.2 Group the dataframe of all elephants on the MovDataID
attribute into a "groupBy" object called grpName
.
grpName = df.groupby('MovDataID')
► Q4.3 Apply the appropriate aggregate function to the grpName
object to report the median X and Y coordinate for each elephant.
Partial credit will be given if include more than just the X and Y columns...
#Report the mean X and Y values for each elephant using the grpName object created above
grpName[['X','Y']].median()
X | Y | |
---|---|---|
MovDataID | ||
Anika | -7.894817 | 6.004683 |
Blanche | -7.871358 | 5.861492 |
Cami | -7.625967 | 6.234633 |
Claudine | -7.631233 | 6.168733 |
Dave | -7.665100 | 6.206167 |
DjeDje | -8.027500 | 6.265792 |
Frank | -7.708358 | 6.113575 |
George | -7.625758 | 6.210975 |
Laura | -7.709083 | 6.172267 |
Mahmahdou | -7.683300 | 6.280983 |
Marie | -7.836750 | 5.869792 |
Omar | -7.754417 | 6.216633 |
Sylvie | -7.732183 | 6.331867 |
Tapa | -7.709117 | 6.137833 |
Yves | -7.667825 | 6.131133 |
Zamba | -7.671250 | 6.117942 |
►This portion is ungraded◄
We may also want to transform our data. Below are some examples of how that is done using Pandas.
In this example, we first want to examine how many tracking records occur across the hours of the day. To do this, we first need to create a new column listing just the hour from the Fixtime
timestamp. Pandas has a function to do this: .dt.hour
.
#Add an "hour" column to the data
df['Hour'] = df['Fixtime'].dt.hour
#Have a look at the result
df.head()
MovDataID | CalcID | StartDate | EndDate | Fixtime | X | Y | Z | Hour | |
---|---|---|---|---|---|---|---|---|---|
0 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 08:14:57 | -7.626650 | 6.242050 | NaN | 8 |
1 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 10:13:55 | -7.628533 | 6.239200 | NaN | 10 |
2 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 11:14:33 | -7.628367 | 6.238933 | NaN | 11 |
3 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 12:14:07 | -7.629200 | 6.237500 | NaN | 12 |
4 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 13:13:56 | -7.629800 | 6.238400 | NaN | 13 |
With the hour field created, we can compute how may tracks were recorded at each hour (using value_counts()
) and then plot these as a line plot. For this to work, we need to sort our value_counts result on the index, which is the hour recorded.
→ Are there any trends in when elephants have been tracked?
→ It's quite easy to repeat this analysis displaying track records across years. Could you do that?
#Plot the number of tracks recorded each out
(df['Hour'] # Extract just the hour records
.value_counts() # Tally how many records of each hour
.sort_index() # Sort on the index (i.e. the hour)
.plot() # Create and display a line plot
);
Now we are going to move on to something a bit more sophisticated: we are going to add a new field to our tracking data frame that represents the distance each record is from a given location.
To do this, we first are going to create a Python function that computes the distance a supplied X and Y values is from a provided coordinate tuple. (This doesn't work accurately with geographic coordinates, but we'll use it anyway...)
#Create a function that calculates the distance from a coordinate
def get_dist(theX, theY, thePoint):
#Extract the coordinates from the provided point into separate variables
center_x, center_y = thePoint
#Get distances along X and Y axes
x_dist = theX - center_x
y_dist = theY - center_y
#Compute the distances between points
the_dist = (x_dist ** 2 + y_dist ** 2) * 0.5
#Return the distance
return the_dist
Next we need a point (just a tuple of X and Y coordinates) from which to measure distances. We'll use the centroid of our data (mean X and Y values).
#Get the centroid of all track records
the_centroid = (df['X'].mean(),df['Y'].mean())
the_centroid
(-7.877294295067264, 6.15885599715994)
And here is the syntax for applying our get_dist()
function to the X and Y series in our dataframe to compute each record's distance from the_centroid
.
#Apply the function above to add a distance column
df['distance'] = get_dist(df['X'],df['Y'],the_centroid)
#Have a look
df.head()
MovDataID | CalcID | StartDate | EndDate | Fixtime | X | Y | Z | Hour | distance | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 08:14:57 | -7.626650 | 6.242050 | NaN | 8 | 0.034872 |
1 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 10:13:55 | -7.628533 | 6.239200 | NaN | 10 | 0.034169 |
2 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 11:14:33 | -7.628367 | 6.238933 | NaN | 11 | 0.034189 |
3 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 12:14:07 | -7.629200 | 6.237500 | NaN | 12 | 0.033868 |
4 | Cami | 587 | 2017-04-09 07:33:40 | 2017-04-15 16:31:42 | 2017-04-09 13:13:56 | -7.629800 | 6.238400 | NaN | 13 | 0.033790 |
Next we want to transform our data so that, for each hour of the day, we show the median distance each animal was recorded away from the overall centroid.
To do this we want to transform or "pivot" our data such that:
MovDataID
value becomes a rowHour
value becomes a columnWe'll do this and then add some styling such that dark blue values indicate animals furthest from the centroid.
#Pivot data
df.pivot_table(
index='MovDataID', # <- The index becomes the row name
columns='Hour', # <- Each unique value in the Hour column become a new separate column
values=['distance'], # <- The value assigned to each species/hour combination is taken from the "X" column
aggfunc='mean' # <- If more than one species/hour value exists, we will insert the median value
).style.background_gradient(cmap = 'YlGnBu') #Style the output
distance | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
MovDataID | ||||||||||||||||||||||||
Anika | 0.013238 | nan | nan | 0.011058 | 0.014051 | nan | nan | 0.010882 | 0.014839 | nan | nan | 0.011660 | 0.013748 | nan | nan | 0.010518 | 0.013247 | nan | nan | 0.010282 | 0.013277 | nan | nan | 0.010924 |
Blanche | 0.045959 | 0.044909 | 0.045882 | 0.045766 | 0.045700 | 0.045615 | 0.045640 | 7.298096 | 0.043922 | 0.044262 | 0.044125 | 0.043464 | 0.042343 | 0.042648 | 0.042308 | 0.043228 | 0.043711 | 0.044757 | 0.045383 | 0.045579 | 0.045414 | 0.045730 | 0.046187 | 0.046244 |
Cami | 0.035077 | 0.035185 | 0.035401 | 0.035366 | 0.035777 | 0.036025 | 0.036932 | 0.036527 | 0.035518 | 0.034601 | 0.035980 | 0.035889 | 0.036252 | 0.037009 | 0.036325 | 0.036395 | 0.035146 | 0.035746 | 0.036412 | 0.035296 | 0.035341 | 0.035374 | 0.035295 | 0.035124 |
Claudine | 0.028958 | 0.028978 | 0.029009 | 7.284461 | 0.029133 | 0.029067 | 0.029351 | 0.030818 | 0.031604 | 0.031092 | 0.030740 | 0.030738 | 0.030621 | 0.030633 | 0.030623 | 0.031732 | 0.030285 | 0.030549 | 0.032089 | 0.029922 | 0.030194 | 0.030398 | 0.029361 | 0.029016 |
Dave | 0.026432 | 0.026131 | 0.025935 | 0.025847 | 0.025819 | 0.026138 | 0.025096 | 7.281276 | 6.244370 | 8.732697 | 0.024292 | 0.023649 | 0.023474 | 0.024253 | 0.025362 | 10.908792 | 0.024164 | 0.023369 | 7.280588 | 0.023290 | 0.025442 | 0.025530 | 0.025877 | 0.026274 |
DjeDje | 0.017004 | 0.016382 | 0.016572 | 0.016491 | 0.016428 | 0.016319 | 0.016907 | 0.016134 | 0.015732 | 0.016180 | 0.015806 | 0.016051 | 0.015124 | 0.015788 | 0.016011 | 0.016048 | 0.016621 | 0.016571 | 0.017474 | 0.016614 | 7.275011 | 0.017780 | 0.016028 | 0.017160 |
Frank | 0.015910 | 0.015741 | 0.015396 | 0.015051 | 0.014927 | 0.014030 | 0.014051 | 0.013622 | 0.013952 | 0.013340 | 0.013455 | 0.013390 | 0.013613 | 0.013518 | 0.015651 | 0.014412 | 0.014335 | 0.015717 | 0.015704 | 0.015538 | 0.015737 | 0.015685 | 0.015639 | 0.015904 |
George | 0.033631 | 0.032593 | 0.034016 | 0.034815 | 0.035984 | 0.034382 | 0.036388 | 0.036413 | 0.034678 | 0.036088 | 0.034498 | 0.035835 | 0.034828 | 0.036663 | 0.034387 | 0.034734 | 0.034246 | 0.033035 | 0.032617 | 0.032627 | 0.032541 | 0.032865 | 0.032622 | 0.033147 |
Laura | 0.014487 | 0.014424 | 0.014299 | 7.272207 | 0.014145 | 0.014098 | 0.013903 | 7.271815 | 7.272540 | 0.014454 | 14.530194 | 7.272316 | 0.014251 | 0.014207 | 0.014367 | 6.235466 | 0.014393 | 0.014471 | 0.014204 | 0.014251 | 0.014443 | 0.014632 | 0.014658 | 0.014568 |
Mahmahdou | 0.027432 | 0.027199 | 0.027362 | 0.028441 | 0.027492 | 0.027561 | 0.027285 | 0.028201 | 6.248844 | 8.735662 | 0.028420 | 0.030341 | 0.027706 | 0.029416 | 0.029542 | 0.026545 | 0.028083 | 0.030600 | 0.026044 | 0.030579 | 0.029253 | 0.028573 | 0.028197 | 0.027707 |
Marie | nan | 0.043860 | nan | nan | nan | 0.043899 | nan | nan | nan | 8.746202 | nan | nan | nan | 0.044444 | nan | nan | nan | 0.044700 | nan | nan | nan | 0.043473 | nan | nan |
Omar | 7.267820 | 0.009155 | 0.009155 | 0.009244 | 0.009322 | 0.009533 | 0.009743 | 7.268262 | 0.009878 | 0.009873 | 6.231392 | 0.009678 | 0.009705 | 0.009687 | 0.009397 | 6.231224 | 8.720060 | 7.268417 | 0.010240 | 0.010248 | 0.010066 | 0.009738 | 0.009344 | 0.009148 |
Sylvie | 0.028166 | 0.028332 | 0.028410 | 0.028429 | 0.027657 | 0.027463 | 0.025541 | 0.023691 | 7.279292 | 0.023178 | 0.022454 | 0.022186 | 0.022659 | 0.022498 | 6.242445 | 0.022436 | 7.279287 | 0.023411 | 0.024226 | 0.025352 | 0.025956 | 0.026738 | 0.027354 | 0.027840 |
Tapa | 0.013914 | 0.014064 | 0.014101 | 0.014051 | 0.013837 | 0.013418 | 0.012858 | 0.013028 | 0.012606 | 0.012580 | 0.012608 | 0.013103 | 0.013771 | 0.013086 | 0.013958 | 0.013673 | 0.012802 | 0.013900 | 0.013880 | 0.013354 | 0.013948 | 0.014024 | 0.013966 | 0.013901 |
Yves | 0.021338 | 0.021749 | 0.022090 | 0.022437 | 0.022706 | 0.022607 | 0.022451 | 0.022724 | 0.023056 | 0.022779 | 0.022432 | 0.022364 | 0.022682 | 0.022574 | 0.022645 | 0.022226 | 0.021916 | 0.022218 | 0.020715 | 0.020570 | 0.020505 | 0.020609 | 0.020660 | 0.020957 |
Zamba | 14.534592 | 0.022195 | 0.022073 | 0.022074 | 0.022251 | 0.022036 | 0.021994 | 0.022410 | 0.021636 | 0.021317 | 7.278447 | 0.022131 | 0.022268 | 0.021754 | 0.022349 | 6.242183 | 6.241515 | 0.021531 | 0.020775 | 0.021390 | 0.021508 | 0.021858 | 0.021842 | 0.021788 |
#Or, using the untransformed data, we can plot
df.hist(column='distance',by='MovDataID',figsize=(15,15));