Fetching Data into Python

ENV 859 - Geospatial Data Analytics   |   Fall 2024   |   Instructor: John Fay  

To analyze data with Python, we need to get access to the data and bring them into our Python scripting environment. We’ve already seen how we can read text files using Python’s built-in open function to create a file object and read GIS tables using ArcPy’s cursor objects, but Python has several other, more effective means for accessing external data. In this session, we examine a number of helpful Python packages and how they are used to access, fetch, unpack, and manage data in various formats and from various sources.


Lab Prep

  • Fork and clone the repository found here https://github.com/ENV859/GettingData
  • To run these exercises you will need you “gis” environment created in the Spatial DataFrames section. A shortcut to open this environment is provided in the above repository.

Session Notebooks & Learning Objectives

The specific exercise notebooks are fairly self-explanatory and review an array of methods used to access and download data from the internet. They also touch on a few concepts that we will dig deeper into in upcoming sessions.

Specific learning objectives include:

Notebook Learning Objectives
0-Importing-Local-Files
review…
• Review how to load CSV file data into Python
  - Pure Python
  - CSV module
  - Numpy
  - Pandas
1a-Getting-data-with-Pandas
1b-DEMO-Bulk-Download-with-Pandas
• Grabbing static on-line files with Pandas’ read_csv() function
• Bulk downloading data with Pandas
2a-Fetching-Data-with-urllib
2b-Extract-Statewide-HUCs-with-urllib
• Use the urllib library to send web requests and handle responses
• Use the zipfile package to uncompress zipped files
• Form URLs interactively and handle them with urllib
The recordings below are optional…  
3-Fetching-files-with-ftplib
[deprecated]
• Use theftp package to fetch data from FTP servers
  - Create a link to the FTP server
  - Log into the server (anonymously)
  - Navigate the server’s file structure
  - Create a list of files to fetch
  - Iterate through each file, fetch & unzip it
4-Grabbing-HTML-tables-with-Pandas • Fetch data from formatted HTML pages w/Pandas read_html()
5a-Scraping-Data-With-BeautifulSoup • Use the requests library to build URLs programmatically
• Send URL requests and handle responses using requests.get()
• Parse raw HTML into searchable components w/BeautifulSoup
6-Using-specialized-packages-to-grab-data • Use the census package to download US Census data
• Explain the use of “keys” in download packages

→ Click on the link to fire up Jupyter in your cloned workspace and let’s go!


Video Highlights

6.3.2 Importing data from static text files

Time Topic
0:00 The data we will be importing
0:44 Reading data with base Python’s open() function
1:34 Using the csv() package to read CSV files
2:47 Reading CSV files using NumPy’s genfromtxt() function
4:09 Reading CSV files using Panda’s read_csv() function
6:23 Retrieving tab-delimited data from websites into a Pandas Dataframe
and skipping commented lines with comment='#'.
10:44 - Dropping rows (and columns) from dataframes with drop()
11:09 - Using the inplace=True modifier in Pandas
11:38 - Skipping lines when reading in text files with skiprows().
13:45 Saving data to local files with to_csv()
16:20 Bulk downloading files with Pandas and Python
16:40 -Installing packages with pip inside a notebook
18:00 -Introducing the us package
18:15 -Iterating the read_csv() with a dynamic URL to pull multiple datasets

6.3.3 Getting data with Urrlib and ZipFile

Time Topic
0:50 Introducing the urllib and zipfile packages
1:12 The data we’ll be fetching: Census data
2:20 Using urllib.request.urlretrieve() to fetch and save web files
4:02 -Running local commands in Jupyter with the ! character
5:18 Unzipping a local zip file with the zipfile
7:40 Remainder of video is outdated
(the remote server has been decommissioned)