Problem Set 1, Data Engineering in ArcGIS Pro

ENV 859 - Geospatial Data Analytics | Fall 2025 | Instructor: John Fay

Overview

Scenario

Your research team wants to explore trends in sulfur dioxide (SO2) concentrations across the US, and your supervisor recommended starting your exploration with the EPA’s daily SO2 dataset. You will apply the techniques learned in our in-class exercise exploring EPA’s PM2.5 data to generate some summaries and exploratory visualizations of the SO2 dataset.

More about sulfur dioxide: https://www.epa.gov/so2-pollution.

Assignment

♦ Task 1: Prepare your workspace

You are welcome to use the same workspace you created for the in-class exercise using the PM2.5 data.

Ensure your workspace has the appropriate structure and components (e.g. folders and subfolders, ReadMe file, etc.) as described in the in-class exercise document.
Add the MS Word document linked here to your workspace (in the docs folder) and rename it <netID>_Notes.docx, replacing <netID> with your Duke NetID. This document will include various notes and figures supporting your observations as you complete the tasks listed below.

♦ Task 2: Obtain the 2021 Daily SO2 Data and import into ArcGIS Pro

2.1 Fetch the data

The data you’ll want to use are located in the same location as the PM25 data used in the class exercises: within the “pre-generated data files” link found on the EPA’s Outdoor Air Quality site. Navigate to the “tables of daily and daily summary data”, to the section on “criteria gases”, and download the 2021 data file under the “SO2 (42401)” column. (Note: We are not combining these data with the site dataset as we did in the in-class exercises.)

Store the [unzipped] file in the appropriate location in your workspace.
Add an appropriate metadata file for the downloaded dataset.

2.2 Import the data into ArcGIS Pro

Import the data into ArcGIS Pro and convert to a feature class, using the latitude and longitude columns for locating the features.

Take care to ensure the columns are assigned the appropriate data types.
You can ignore the fact that some records are referenced to WGS 84 and other to NAD 83, treating all records as if they were referenced to WGS 84.
Document your procedure for importing and converting the data in a geoprocessing model.

2.3 Explore your data for possible inconsistencies

Examine the spatial distribution of points and address sites that appear incorrectly located, if any. Summarize your observations and any actions you decided to take in the Notes.docx document created in Task 1. If you did perform any operations to address incorrectly imported data, include those steps in a geoprocessing model.
Examine the variety of collection techniques used in the dataset.
- How may unique Parameter Codes are included?
- What about Sample Duration?
- Method Name?
From the above, you might note that there is a bit of nuance and redundancy in the the SO2 values reported in the Arithmetic Mean column. First of all, these daily mean concentrations are computed from several concentration values collected throughout the day: both in 1-hour and 3-hour blocks, and thus each site actually reports two mean daily concentrations. Also, SO2 concentrations are collected using a variety of instruments.

If you were to use these data in an actual analysis, you would likely want to separate the 1-hour from the 3-hour data and at least note that different instruments may result in slightly different values or have different sensitivities. However, for the remainder of this exercise, we'll treat all daily arithmetic mean SO2 concentrations and independent measurements.
Examine the range and distribution of mean daily SO2 concentrations in your dataset. Report the overall mean, minimum, and maximum average daily SO2 concentration for all records in the dataset as well as the number of records with missing values (if any) in the Notes.docx document.
→ Round values to the nearest two decimal places, and include units

You may observe negative values in the dataset. How can there be a negative concentration??

Measurements are imprecise! Negative values are indeed incorrect. The EPA could omit these records or “fix” them by setting them to zero. But how might doing either of those actions affect the overall average value for a given site?
The EPA reports that the 99% of all records for a year should typically have a daily max below 75 parts per billion (ppb) [source]. Any values an order of magnitude above this (i.e. 750 ppb) are either extreme events or possibly errors.

Query your dataset to reveal how many records have a daily maximum (1st Max Value) above 750 ppb, and document this query in your geoprocessing model. (Tip: the Select Layer by Attribute tool reveals the count of records meeting the criteria.) Make a note of your findings in the Notes.docx file in your workspace. Feel free to include any plots that support your description.
Zoom your map to the selected record(s) to observe how many unique locations have SO2 concentrations exceeding this threshold.
You should find that two site had the three extremely high records.
- Zooming in and showing labels reveals one site is in the town of Wood River, Illinois. A web search on “Wood River SO2 2021” and you will find an explanation. (Specifically this article…)
- A search on the other site, which falls in Detroit, MI could lead to another explaining event: see this article.
So, these records appear to be real values, not errors; we should keep them! Make a note of this in your Notes.docx file.

♦ Task 3: Transform your data

At present, we have multiple overlapping features at each sampling location, one for each daily sampling event. To allow us to examine spatial patterns in air quality, we need to group our data by location and compute aggregate values of SO2 concentration. We want to calculate annual summaries: both the annual median (not mean!) of the daily averages and the highest recorded level for the year (i.e., the max of the 1st Max Value values). We also want to calculate monthly medians of the average daily values to explore values at different times in the year. (Again, we are using medians, not means in this step!)

Compute a SiteID field, comprised of the State Code, County Code, and Site Code fields (as we did in the in-class exercise).
Create a Month field, extracting the month component from the Date Local field (again, as we did in the in-class exercise).
From your dataset of daily SO2, compute a new dataset of the the median annual SO2 for each site (median of the daily arithmetic mean values) and the max of the 1st Max Value for each site.
Construct an additional feature class that holds the median monthly SO2 concentration for each site.
Pivot the median monthly SO2 result so that months appear as columns, not as values in a single column. The result should thus have more columns and fewer rows.
Join the pivoted table of monthly median concentrations to the feature class of annual median and annual max concentrations.
Document all these processes in a new or existing geoprocessing model.

♦ Task 4: Enrich your SO2 sample locations with additional data

Enrich your SO2 sample locations with the following variables, setting the search distance to 10km (straight line):
- ESRI’s 2023 Dominant Life Mode Group Name (found within the Tapestry category, LifeMode group).
- 2023 Median Household Income (#) (found within the Income category, Common Income Variables group).
- 2023 Median Home Value (#) (found within the Housing category, 2023 Home Value (ESRI) group).
Use the procedures outlined in the in-class exercise to compute the distance (in meters) to the nearest coal plant for each SO2 monitoring site.
Tag each SO2 sample site with the EPA Region in which it occurs:
- Search for “b2ee0c5d62db4d1e98e2e7bd0e2c37e5” in the Living Atlas, and add the feature layer to you map.
- Spatially join the EPA Region (“EPAREGION”) values to your SO2 monitoring points.
Save the enriched SO2 concentration feature attribute table as a CSV file in your workspace.
Document these steps in a new or existing geoprocessing model.
Include in your Notes.docx file, the following summary information of your transformed and enriched dataset:
- Number of rows in the resulting dataset
- Median distance to coal plant, to the nearest km (Note, the near tool always reports in meters!)
- Mean SO2 concentration for the month of January, to two decimal places

♦ Task 5: Explore and visualize patterns in your data

Add the figures described below to your Notes.docx document, and beneath each figure, briefly interpret what you see in a paragraph or two.
- Note: It’s ok if the visualizations show no pattern! Just interpret what you see, even if what you see is no relationship.
- Note: Figures can just be screen grabs (e.g. using the Snipping tool). They should be legible, but you need not spend a great deal of time making them beautiful.

- Figure 5.1: Distribution of Median Annual SO2 Concentrations

Zoom your map to the lower 48 United States (i.e. don’t display Alaska, Hawaii, or any of the US territories). And set your base map to be the Light Grey Canvas.
Display your sampling locations symbolized by the median of daily mean SO2 concentrations, using the default symbology generated by the Data Engineering interface.
Create a histogram showing the distribution of median annual SO2 concentrations (median of the daily means for 2021) for all sites, including those not shown in the map’s extent. Set the histogram to display data in 25 bins and to show the normal distribution line as well as the overall mean and median values.
Make a screen grab (Shift-Win-S)of the map and histogram and add it to your Word document. Be sure the legend for the SO2 concentration layer is visible. (It doesn’t have to be pretty; this is to check your work…)
Include a sentence or two interpreting what you see with respect to any spatial patterns in the SO2 concentration values.

- Figure 5.2: Distribution of Median Annual SO2 Concentrations, North Carolina Area

In the Map pane, use the Locate tool (in the Inquiry section of the menu bar) to open the Locate tool. Enter “NC, USA” to zoom your map to North Carolina.
Adjust your histogram of median annual SO2 concentrations so it shows a histogram for features in the map's extent.
Make a screen grab of the map and histogram and add it to your Word document.
Include a sentence or two interpreting what you see with respect to any spatial patterns in the SO2 concentration values in this figure.

- Figure 5.3: Average Median Annual SO2 Concentrations By EPA Regions

Create a bar plot showing the mean of the median annual SO2 concentrations by EPA Region (for all records, not just those in the NC region).
Create a box plot showing the same.
Copy both plots and paste them into your Notes.docx document.
( Note: Under the Export dropdown menu in the plot window is an option to copy the graphic…)
Describe how the two plots show the same data differently. Also mention which one you prefer and why.

- Figure 5.4: Create a scatterplot matrix of the monthly median SO2 concentrations

Create a scatterplot matrix showing the scatterplots of the different monthly median SO2 concentrations on the lower left side, the R2 values on the upper right, and the month names/numbers along the diagonal.
Add a screen grab of this plot to your Notes.docx document.
Add a sentence or two indicating the pair of consecutive months that are most correlated and the pair that are least, according to their R2 values.

- Figure 5.5 - Plot median home value as a function of median annual income

Here we start with an assumption that people with more income have higher priced homes. If this were 100% true, a scatterplot of median home value (y-axis) vs median annual income (x-axis) would fall tightly along a diagonal line. Of course, there will very likely be some deviation from this line, and we want to explore patterns in the deviation: points falling above the diagonal represent cases where people have more valuable houses per unit income. Here we create a visualization allowing us to explore multiple attributes regarding this relationship.

Symbolize the SO2 site features using the distance to the nearest coal plant field in the data engineering interface.
Create a scatter plot with 2023 Median Household Income on the X axis and 2023 Median Home Value on the Y axis.
Show the linear trend line in the plot and set the points to be proportional to the median annual SO2.
Add this plot and your interpretation to your results Word document. In particular, mention whether you think sites far from coal plants or sites with low SO2 concentrations fall on one side of the trend line or not.

- Figure 5.6 - Regional or categorical exploration of home value vs median annual income

Repeat the same plot as above, but have it show only data either for one EPA region,or one “Dominant Life Mode” group.
Describe your findings, being sure to include the subset of data you chose in your visualization.

♦ Task 6: Clean up and submit

Tidy your workspace by removing any temporary or otherwise unneeded files.
Be sure all geoprocessing models have appropriate names, i.e., names that allow your future self or others (graders!) to easily identify what that model does.
Zip up your entire workspace and submit it to Canvas.
Submit your <NetID>_Notes.docx to Canvas.

RUBRIC

Task#	Pts	Description
1	10	Workspace
	2.5	Submit a properly formatted workspace.
	2.5	Workspace is free of “clutter”: clean out any clearly unnecessary files ensure all files are in appropriate subfolders.
	2.5	Datasets and geoprocessing models have useful and intuitive names (e.g. not “Model1”).
	2.5	Folders and filenames do not have any spaces or odd characters other than an underscore (“_”).
2	16	Obtain and import data
	2	Correct dataset downloaded.
	2	Metadata file included with appropriate contents.
	2.5	Columns use the correct data type.
	2.5	Import process is correct and is documented in geoprocessing model.
	2.5	Comments on the spatial distribution of the imported SO2 sites are found in the “_Notes.txt" file in your workspace, specifically whether you needed to address any inconsistencies in the imported dataset.
	2.5	Comments (and supporting visualizations, if desired) regarding the range and distribution of median and max daily SO2 concentrations in the context of EPA’s stated normal ranges.
	2	Comments on how you decided to deal with extreme daily maximum outliers.
3	12.5	Data Transformation
	2.5	Appropriate grouping variables are added to your SO2 daily concentration features.
	2.5	Annual median and annual max values are computed correctly.
	2.5	Monthly median values are computed correctly.
	2.5	Monthly median values are transformed and added to the annual average and max values.
	2.5	All Task 3 steps are properly documented in a geoprocessing model.
4	12.5	Data Enrichment
	2.5	SO2 sample locations enriched with the variables specified.
	2.5	Correct distances assigned to nearest coal plant to each SO2 sample location.
	2.5	Correct EPA Regions assigned to each SO2 sample location.
	2.5	Data saved as a CSV file in your workspace.
	2.5	All Task 4 steps documented in geoprocessing model.
5	49	Data Exploration and Visualization
5.1	2.5	Map is correctly zoomed and shows SO2 sites using requested symbology and base map.
	2.5	Histogram shows SO2 concentrations in 25 bins with normal distribution line shown.
	2.5	Short interpretation included.
5.2	2.5	Map zoomed to North Carolina.
	2.5	Histogram reflect only values shown in map.
	2.5	Short interpretation included.
5.3	2.5	Bar plot of SO2 concentrations by EPA Region is shown.
	2.5	Box plot of SO2 concentrations by EPA Region is shown.
	2.5	Short interpretation included.
5.4	2.5	Scatterplot matrix shows all monthly results.
	2.5	Scatterplots are shown in lower left side of figure.
	2.5	All R2 values shown in upper right side of figure.
	2.5	Month names shown along diagonal.
	2.5	Short interpretation included, with proper identification of most and least correlated monthly pairs.
5.5	2.5	Scatterplot shows median home value and median income, with values on the correct axes.
	2.5	Points are shaded by distance to coal plant.
	2.5	Points are sized by median annual SO2 concentration and trend line is shown.
	2.5	Short interpretation included.
5.6	2	Plot shown for subset of data.
	2	Short interpretation included, with subsetting variable specified.