Data Structures - Pandas Dataframes

ENV 859 - Geospatial Data Analytics | Fall 2023 | Instructor: John Fay

In this lesson, we continue to examine packages that enable scientific data analysis in Python. Moving on from NumPy’s n-dimensional array, we now focus on Panda’s dataframe object. We introduce what the dataframe is and then dive into Panda’s ability to create dataframes from our data and how dataframes and dataframe operations make many old analyses easier and also enable quite a few new analyses.

Like our NumPy lesson, the overall description of Pandas and dataframes is relatively short. The real learning comes from diving in and running through some Jupyter Notebooks, where we apply and tinker with code.

Topic	Learning Objective
What is Pandas?	• Explain Pandas importance in Python’s data science stack
Exploring Data using Pandas	• Describe the basic form of a Pandas dataframe • Load data from a CSV file into a dataframe • View and inspect data/dataframe properties • Select columns from a dataframe • Generate descriptive statistics from data in a dataframe • Create some basic plots in Pandas
Data Analysis in Pandas	• Calculating and updating fields • Selecting data in a dataframe - Selecting single rows, select rows, or row slices using `iloc` - Selecting rows and columns using `iloc` - Selecting rows and columns using `loc` - Selecting rows based on criteria - using queries - Selecting rows based on criteria - using masks - Updating values in selected rows/columns • Grouping and aggregating data in a dataframe • Transforming data with Pivot Tables
Quick Plots with Pandas	• Brief overview of plotting using Pandas

What is Pandas?

Pandas has often been termed “the Swiss Army knife” of data wrangling in Python, an apt description because of its overall utility in many aspects of reading, exploring, summarizing, visualizing, and exporting a multitude of data formats. Like NumPy, Pandas brings an important new data structure to our coding environment. This is the dataframe.

What is a Data Frame?

We will have plenty of opportunity to create and tinker with dataframes, which will be much more explanatory than my trying to describe what a dataframe is here. Briefly, however, a dataframe stores data in rows and columns. You may think this is a limitation over NumPy who’s arrays can store data in way more than two dimensions, but with proper structuring, a dataframe can actually hold many dimensions of data in its rows and columns.

Data frames are also central to the notion of tidy data, in that by organizing your data into a dataframe, you are likely to consider your data as a series of observed features (rows) with one or more observed attributes (columns). In a Pandas dataframe, values in a given column all share the same data type (string, integer, floating point number, geometry, etc.), and each row has a unique index value. Thus each value in our dataframe is referenced by its row index and column header.

This may fairly elementary, but the capability this structure – and Pandas’ ability to work with this structure – has powerful ramifications. Built to leverage NumPy’s speed and agility, Pandas and its dataframe allow us to quickly:

compute summary stats for the entire dataframe or subsets
subset/select/query specific row and or columns
sort, transform, pivot, melt data
aggregate and join
plot data

Pandas also offers commands to readily read data stored in a variety of formats. Comma-separated-value (CSV) format is one of the more popular formats, but Pandas can also easily read in any delimited text files, JSON, HTML, MS Excel, HDF, Stata, SAS, and some SQL server formats. (See full list from Pandas docs.)

So yeah, Pandas is a “Swiss army knife” for data analysis in Python!

Lesson Plan

A series of notebooks is included in the Scientific Computing GitHub repository that you have already forked and cloned prior to working on your NumPy exercises. These Notebooks, found in the Panda Folder include the following, listed with their learning objectives:

Recording Highlights

5.3.1 Pandas - Intro to Data Frames

Time	Topic
0:45	What is a Data Frame?
3:10	Dataframe as a list of lists
5:20	Dataframe as a collection of dictionaries
10:20	Loading data into a dataframe (`read_csv()`)
14:30	Exploring your data - revealing column data types
16:35	Specifying data types when importing with `read_csv()`
18:06	Specifying a column to be the index when importing with `read_csv()`

5.3.2 Pandas - Exploring Data

Time	Topic
0:20	Inspecting the data with `head()`, `tail()` and `sample()`
1:45	Reading in raw text files stored on the internet; more on `read_csv()`
4:45	Revealing aspects of your dataframe: `len(df)`, `df.shape`, `df.size`
6:42	Listing columns of your dataframe with `df.columns`
8:05	Listing the index of your dataframe with `df.index`
8:55	Setting the index column when reading in your data with `read_csv()`
10:35	Listing data frame info with `df.info()`
13:15	Selecting specific columns from your data frame into a new dataframe
16:15	Selecting a single column into a Series object; what is a Series object?
17:20	Referring to columns: brackets (`df['column']`) vs dot notation (`df.column`)
19:21	Descriptive statistics for a column of data
20:39	Quantiles
21:25	Correlations among numeric columns
22:58	Styling your correlation output
24:47	Generating summary stats with `df.describe()`
25:25	Listing unique values and number of unique values with `df.unique()` and `df.nunique()`
26:52	Listing number of records for each value with `value_counts()`
28:12	Basic plots in Pandas: histograms with df.hist()
30:32	Boxplots

3. Pandas - Analysis 1

Note that from 2:30-8:15 in the recording, I encountered some odd bug in Pandas that crashed by Jupyter session. Whatever issue that was has been fixed in the current version of Pandas. So, you can simply skip that section of the recording. Or you can watch and laugh at me (with me?) as I struggle to debug in real time…

Time	Topic
1:12	Calculating and updating fields
2:30	Debugging - not required viewing !
8:15	Resume calculating and updating fields
10:54	Selecting data: `iloc` and referring to cells by absolute position
16:48	Selecting data: `loc` and labeled rows (index) and columns

4. Pandas - Analysis 2

Time	Topic
1:02	Selecting data: Queries
9:22	Selecting data: Boolean masks
16:13	Updating values in selected rows

5. Pandas - Analysis 3

Time	Topic
0:00	Grouping data
3:10	The “`DataFrameGroupBy`” object
4:22	Computing summary stats on a `DataFrameGroupBy` object
8:00	Subsetting specific columns from the `DataFrameGroupBy` object summary stats
14:14	Specifying multiple aggregating functions on a `DataFrameGroupBy` object
16:50	Transforming/pivoting data

6. Quick Plots with Pandas

Time	Topic
0:32	Juptyer’s “magic commands” (to allow inline plotting)
2:40	Types of plots
3:30	Creating a default line plot; when is a line plot appropriate
4:45	Altering the `kind` of plot: `bar`, `barh`, `pie`,….
5:45	Setting the figure size with the `figsize()` parameter
6:45	Transforming axes with `logx` and `logy`
7:10	Adding a title
10:50	Creating and using the `plot` object
13:10	More advanced plotting: Stacked bar charts