Data Structures - Pandas Dataframes

ENV 859 - Geospatial Data Analytics   |   Fall 2024   |   Instructor: John Fay  

In this lesson, we continue to examine packages that enable scientific data analysis in Python. Moving on from NumPy’s n-dimensional array, we now focus on Panda’s dataframe object. We introduce what the dataframe is and then dive into Panda’s ability to create dataframes from our data and how dataframes and dataframe operations make many old analyses easier and also enable quite a few new analyses.

Like our NumPy lesson, the overall description of Pandas and dataframes is relatively short. The real learning comes from diving in and running through some Jupyter Notebooks, where we apply and tinker with code.

Topic Learning Objective
What is Pandas? • Explain Pandas importance in Python’s data science stack
Exploring Data using Pandas • Describe the basic form of a Pandas dataframe
• Load data from a CSV file into a dataframe
• View and inspect data/dataframe properties
• Select columns from a dataframe
• Generate descriptive statistics from data in a dataframe
• Create some basic plots in Pandas
Data Analysis in Pandas • Calculating and updating fields
• Selecting data in a dataframe
- Selecting single rows, select rows, or row slices using iloc
- Selecting rows and columns using iloc
- Selecting rows and columns using loc
- Selecting rows based on criteria - using queries
- Selecting rows based on criteria - using masks
- Updating values in selected rows/columns
• Grouping and aggregating data in a dataframe
• Transforming data with Pivot Tables
Quick Plots with Pandas • Brief overview of plotting using Pandas

What is Pandas?

Pandas has often been termed “the Swiss Army knife” of data wrangling in Python, an apt description because of its overall utility in many aspects of reading, exploring, summarizing, visualizing, and exporting a multitude of data formats. Like NumPy, Pandas brings an important new data structure to our coding environment. This is the dataframe.

What is a Data Frame?

We will have plenty of opportunity to create and tinker with dataframes, which will be much more explanatory than my trying to describe what a dataframe is here. Briefly, however, a dataframe stores data in rows and columns. You may think this is a limitation over NumPy who’s arrays can store data in way more than two dimensions, but with proper structuring, a dataframe can actually hold many dimensions of data in its rows and columns.

Data frames are also central to the notion of tidy data, in that by organizing your data into a dataframe, you are likely to consider your data as a series of observed features (rows) with one or more observed attributes (columns). In a Pandas dataframe, values in a given column all share the same data type (string, integer, floating point number, geometry, etc.), and each row has a unique index value. Thus each value in our dataframe is referenced by its row index and column header.

This may fairly elementary, but the capability this structure – and Pandas’ ability to work with this structure – has powerful ramifications. Built to leverage NumPy’s speed and agility, Pandas and its dataframe allow us to quickly:

  • compute summary stats for the entire dataframe or subsets
  • subset/select/query specific row and or columns
  • sort, transform, pivot, melt data
  • aggregate and join
  • plot data

Pandas also offers commands to readily read data stored in a variety of formats. Comma-separated-value (CSV) format is one of the more popular formats, but Pandas can also easily read in any delimited text files, JSON, HTML, MS Excel, HDF, Stata, SAS, and some SQL server formats. (See full list from Pandas docs.)

So yeah, Pandas is a “Swiss army knife” for data analysis in Python!


Lesson Plan

A series of notebooks is included in the Scientific Computing GitHub repository that you have already forked and cloned prior to working on your NumPy exercises. These Notebooks, found in the Panda Folder include the following, listed with their learning objectives:

More on Pandas


Recording Highlights

5.3.1 Pandas - Intro to Data Frames

Time Topic
0:45 What is a Data Frame?
3:10 Dataframe as a list of lists
5:20 Dataframe as a collection of dictionaries
10:20 Loading data into a dataframe (read_csv())
14:30 Exploring your data - revealing column data types
16:35 Specifying data types when importing with read_csv()
18:06 Specifying a column to be the index when importing with read_csv()

5.3.2 Pandas - Exploring Data

Time Topic
0:20 Inspecting the data with head(), tail() and sample()
1:45 Reading in raw text files stored on the internet; more on read_csv()
4:45 Revealing aspects of your dataframe: len(df), df.shape, df.size
6:42 Listing columns of your dataframe with df.columns
8:05 Listing the index of your dataframe with df.index
8:55 Setting the index column when reading in your data with read_csv()
10:35 Listing data frame info with df.info()
13:15 Selecting specific columns from your data frame into a new dataframe
16:15 Selecting a single column into a Series object; what is a Series object?
17:20 Referring to columns: brackets (df['column']) vs dot notation (df.column)
19:21 Descriptive statistics for a column of data
20:39 Quantiles
21:25 Correlations among numeric columns
22:58 Styling your correlation output
24:47 Generating summary stats with df.describe()
25:25 Listing unique values and number of unique values with df.unique() and df.nunique()
26:52 Listing number of records for each value with value_counts()
28:12 Basic plots in Pandas: histograms with df.hist()
30:32 Boxplots

3. Pandas - Analysis 1

Note that from 2:30-8:15 in the recording, I encountered some odd bug in Pandas that crashed by Jupyter session. Whatever issue that was has been fixed in the current version of Pandas. So, you can simply skip that section of the recording. Or you can watch and laugh at me (with me?) as I struggle to debug in real time…

Time Topic
1:12 Calculating and updating fields
2:30 :alarm_clock: Debugging - not required viewing !
8:15 Resume calculating and updating fields
10:54 Selecting data: iloc and referring to cells by absolute position
16:48 Selecting data: loc and labeled rows (index) and columns

4. Pandas - Analysis 2

Time Topic
1:02 Selecting data: Queries
9:22 Selecting data: Boolean masks
16:13 Updating values in selected rows

5. Pandas - Analysis 3

Time Topic
0:00 Grouping data
3:10 The “DataFrameGroupBy” object
4:22 Computing summary stats on a DataFrameGroupBy object
8:00 Subsetting specific columns from the DataFrameGroupBy object summary stats
14:14 Specifying multiple aggregating functions on a DataFrameGroupBy object
16:50 Transforming/pivoting data

6. Quick Plots with Pandas

Time Topic
0:32 Juptyer’s “magic commands” (to allow inline plotting)
2:40 Types of plots
3:30 Creating a default line plot; when is a line plot appropriate
4:45 Altering the kind of plot: bar, barh, pie,….
5:45 Setting the figure size with the figsize() parameter
6:45 Transforming axes with logx and logy
7:10 Adding a title
10:50 Creating and using the plot object
13:10 More advanced plotting: Stacked bar charts