Data Structures - Pandas Dataframes
In this lesson, we continue to examine packages that enable scientific data analysis in Python. Moving on from NumPy’s n-dimensional array, we now focus on Panda’s dataframe object. We introduce what the dataframe is and then dive into Panda’s ability to create dataframes from our data and how dataframes and dataframe operations make many old analyses easier and also enable quite a few new analyses.
Like our NumPy lesson, the overall description of Pandas and dataframes is relatively short. The real learning comes from diving in and running through some Jupyter Notebooks, where we apply and tinker with code.
Topic | Learning Objective |
---|---|
What is Pandas? | • Explain Pandas importance in Python’s data science stack |
Exploring Data using Pandas | • Describe the basic form of a Pandas dataframe • Load data from a CSV file into a dataframe • View and inspect data/dataframe properties • Select columns from a dataframe • Generate descriptive statistics from data in a dataframe • Create some basic plots in Pandas |
Data Analysis in Pandas | • Calculating and updating fields • Selecting data in a dataframe - Selecting single rows, select rows, or row slices using iloc - Selecting rows and columns using iloc - Selecting rows and columns using loc - Selecting rows based on criteria - using queries - Selecting rows based on criteria - using masks - Updating values in selected rows/columns • Grouping and aggregating data in a dataframe • Transforming data with Pivot Tables |
Quick Plots with Pandas | • Brief overview of plotting using Pandas |
What is Pandas?
Pandas has often been termed “the Swiss Army knife” of data wrangling in Python, an apt description because of its overall utility in many aspects of reading, exploring, summarizing, visualizing, and exporting a multitude of data formats. Like NumPy, Pandas brings an important new data structure to our coding environment. This is the dataframe.
What is a Data Frame?
We will have plenty of opportunity to create and tinker with dataframes, which will be much more explanatory than my trying to describe what a dataframe is here. Briefly, however, a dataframe stores data in rows and columns. You may think this is a limitation over NumPy who’s arrays can store data in way more than two dimensions, but with proper structuring, a dataframe can actually hold many dimensions of data in its rows and columns.
Data frames are also central to the notion of tidy data, in that by organizing your data into a dataframe, you are likely to consider your data as a series of observed features (rows) with one or more observed attributes (columns). In a Pandas dataframe, values in a given column all share the same data type (string, integer, floating point number, geometry, etc.), and each row has a unique index value. Thus each value in our dataframe is referenced by its row index and column header.
This may fairly elementary, but the capability this structure – and Pandas’ ability to work with this structure – has powerful ramifications. Built to leverage NumPy’s speed and agility, Pandas and its dataframe allow us to quickly:
- compute summary stats for the entire dataframe or subsets
- subset/select/query specific row and or columns
- sort, transform, pivot, melt data
- aggregate and join
- plot data
Pandas also offers commands to readily read data stored in a variety of formats. Comma-separated-value (CSV) format is one of the more popular formats, but Pandas can also easily read in any delimited text files, JSON, HTML, MS Excel, HDF, Stata, SAS, and some SQL server formats. (See full list from Pandas docs.)
So yeah, Pandas is a “Swiss army knife” for data analysis in Python!
Lesson Plan
A series of notebooks is included in the Scientific Computing GitHub repository that you have already forked and cloned prior to working on your NumPy exercises. These Notebooks, found in the Panda Folder include the following, listed with their learning objectives:
More on Pandas
- https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html
- https://www.datacamp.com/courses/pandas-foundations
- http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb
Recording Highlights
5.3.1 Pandas - Intro to Data Frames
Time | Topic |
---|---|
0:45 | What is a Data Frame? |
3:10 | Dataframe as a list of lists |
5:20 | Dataframe as a collection of dictionaries |
10:20 | Loading data into a dataframe (read_csv() ) |
14:30 | Exploring your data - revealing column data types |
16:35 | Specifying data types when importing with read_csv()
|
18:06 | Specifying a column to be the index when importing with read_csv()
|
5.3.2 Pandas - Exploring Data
Time | Topic |
---|---|
0:20 | Inspecting the data with head() , tail() and sample()
|
1:45 | Reading in raw text files stored on the internet; more on read_csv()
|
4:45 | Revealing aspects of your dataframe: len(df) , df.shape , df.size
|
6:42 | Listing columns of your dataframe with df.columns
|
8:05 | Listing the index of your dataframe with df.index
|
8:55 | Setting the index column when reading in your data with read_csv()
|
10:35 | Listing data frame info with df.info()
|
13:15 | Selecting specific columns from your data frame into a new dataframe |
16:15 | Selecting a single column into a Series object; what is a Series object? |
17:20 | Referring to columns: brackets (df['column'] ) vs dot notation (df.column ) |
19:21 | Descriptive statistics for a column of data |
20:39 | Quantiles |
21:25 | Correlations among numeric columns |
22:58 | Styling your correlation output |
24:47 | Generating summary stats with df.describe()
|
25:25 | Listing unique values and number of unique values with df.unique() and df.nunique()
|
26:52 | Listing number of records for each value with value_counts()
|
28:12 | Basic plots in Pandas: histograms with df.hist() |
30:32 | Boxplots |
3. Pandas - Analysis 1
Note that from 2:30-8:15 in the recording, I encountered some odd bug in Pandas that crashed by Jupyter session. Whatever issue that was has been fixed in the current version of Pandas. So, you can simply skip that section of the recording. Or you can watch and laugh at me (with me?) as I struggle to debug in real time…
Time | Topic |
---|---|
1:12 | Calculating and updating fields |
2:30 | Debugging - not required viewing ! |
8:15 | Resume calculating and updating fields |
10:54 | Selecting data: iloc and referring to cells by absolute position |
16:48 | Selecting data: loc and labeled rows (index) and columns |
4. Pandas - Analysis 2
Time | Topic |
---|---|
1:02 | Selecting data: Queries |
9:22 | Selecting data: Boolean masks |
16:13 | Updating values in selected rows |
5. Pandas - Analysis 3
Time | Topic |
---|---|
0:00 | Grouping data |
3:10 | The “DataFrameGroupBy ” object |
4:22 | Computing summary stats on a DataFrameGroupBy object |
8:00 | Subsetting specific columns from the DataFrameGroupBy object summary stats |
14:14 | Specifying multiple aggregating functions on a DataFrameGroupBy object |
16:50 | Transforming/pivoting data |
6. Quick Plots with Pandas
Time | Topic |
---|---|
0:32 | Juptyer’s “magic commands” (to allow inline plotting) |
2:40 | Types of plots |
3:30 | Creating a default line plot; when is a line plot appropriate |
4:45 | Altering the kind of plot: bar , barh , pie ,…. |
5:45 | Setting the figure size with the figsize() parameter |
6:45 | Transforming axes with logx and logy
|
7:10 | Adding a title |
10:50 | Creating and using the plot object |
13:10 | More advanced plotting: Stacked bar charts |