Data Engineering in ArcGIS Pro

ENV 859 - Geospatial Data Analytics   |   Fall 2024   |   Instructor: John Fay  

Introduction

What is “Data Engineering”?

When presented with a lab exercise, you are typically given a tidy dataset which you can seamlessly bring into your analytical environment and start working with immediately. That’s useful for learning the task at hand, but in reality, the data you need to drive your analysis is seldom in an workable format and without any peculiarities: missing and/or duplicate values, improperly coded or entered values, etc. In fact, the process of cleaning and preparing data for your analysis - the process ESRI refers to as data engineering - can easily consume a significant amount of time on a given project.

Data engineering actually references two key functions in preparing data for analysis. The first focuses on simply making raw data operational. This step, sometimes referred to as ETL (extract, transform, load), pulls together data from multiple sources, often in multiple formats, and transforms the data into a format that can be analyzed most easily and efficiently. When coded correctly, e.g. in the model builder or using Python, this step can be automated to enable continuous data ingestion for when data are constantly being updated

The second role of data engineering is to allow you to explore and better understand raw data. Looking over a spreadsheet of raw values isn’t likely to reveal meaningful patterns or relationships in your raw data. However, generating summary statistics from your data and plotting the distribution of the values in your data, both as a whole and within groups, often reveals interesting patterns - and sometimes key errors - in your data that you might wish to explore further.

Data Engineering in practice

The good news is that vast amounts of new data are constantly being made available. The bad news is that these datasets seldom follow any consistent format and come with varying levels of background randomness or “noise”. As such, the process of operationalizing and exploring data is necessarily hands on, iterative, and non-systematic; the phrase “your mileage may vary” is quite appropriate here. Fortunately, the data engineering tools in ArcGIS Pro (as well as in other applications and coding languages), facilitate open-ended exploration both within and among feature attributes, which in turn, reveals new questions you might want to ask of your data, new hypotheses to test via further analysis.

In this exercise, we walk through one example of data engineering. Specifically, we look at EPA Air Quality Data for the United States. These data are not as messy as some other data you might encounter, yet you’ll see that there are still some key steps required to get these data into our analytical environment (ArcGIS Pro), and a broad swath of questions you can explore with these data.

Workflow & Learning outcomes

The set of exercises below will reveal a number of best practices associated with developing workflows in the ArcGIS Model Builder as well as some more advanced techniques you may have not learned in previous courses. By the end of these exercises, you should have competency with the following:

Section In completing this section, you will learn to…
1. Preparing for analysis • Recognize the importance of organization, documentation, and reproducibility in data analysis
• Create a robust ArcGIS Pro workspace for geospatial data analysis
2. Fetching the data • Navigate web sites to obtain data and metadata
• Document downloaded datasets for reproducibility and transparency
3. Importing & Exploring the data • Import CSV files into ArcGIS Pro
• Use the Data Engineering interface to identify the data types of the imported file
• Edit the schema.ini file to override default datatypes in an imported CSV file
• Understand the fundamental importance of data types in data analysis
• Use the Data Engineering interface to generate data summaries for the fields in a data table
• Use the Data Engineering interface to generate simple visualizations of data distributions
• Understand the differences among histograms, bar plots, box plots, and scatterplots
• Critically examine data for missing values, miscoded values, and odd values
4. Converting data to features & more exploring • Convert a CSV file with coordinates to a feature class
• Document your work within the ArcGIS Pro model builder
• Identify and address conversion issues
• Create a unique record ID field and describe the importance of unique record IDs
• Explore spatial patterns among attribute values in a feature class
• Identify and address records with missing or incorrect coordinate data
• Identify and address records with missing or incorrect attribute data
5. Bringing more data to your Features • Use ArcGIS Pro’s Enrich tool to add demographic and other data to your features
• Find, load, and use data from the Living Atlas into your ArcGIS Pro project
• Use ArcGIS Pro’s Near tool to associate features with the values of records in another feature class
• Save results as reproducible products and recognize the importance of doing so
6. Data Transformation • Work with datetime data types
• Group and summarize data
• Pivot tables to enable joins and other analyses
• Join tables
7. Deeper Explorations • Create bar plots to show data breakdowns by categories
• Create box plots to show distributions across categories
• Create scatter plots to show relationships across continuous variables
• Explore other plotting options available within ArcGIS Pro
8. Statistical Explorations • Create mean central features from point subsets to evaluate change over time
• Apply different techniques to define neighborhoods around features
• Compare neighborhood aggregate values to point values to find local outliers

Step 1. Preparing for Analysis

Data Engineering is a messy process. You’ll likely pursue some ideas and then find out they are dead ends, and along the way, you’ll likely accumulate a number of intermediate files. Keeping track of what you’ve done, what files are safe to delete, and what are essential for your overall analysis is super important, and the best way to do this is to maintain organization with a properly formatted workspace.

I use the following steps each time I begin a new data analysis project in ArcGIS Pro. You may eventually wish to customize the steps to your own liking, but for now, let’s follow this procedure to get an idea of what’s important and what has more flexibility in generating a robust and logical project workspace.

1.1 Create a new ArcGIS Pro project

  • Open ArcGIS Pro and create a new project in your class drive. For consistency sake, lets name the project “Data_Exploration”, and be sure the “Create a new folder for this project” is selected.

    ArcGIS Pro is project oriented: every time you start the application, you are asked to open an existing project or create a new one. And each project workspace includes a project file (.aprx), a toolbox file (.atbx), a geodatabase folder (.gdb), each with the same prefix. This is helpful in establishing our project’s root folder, which is where we’ll keep [nearly] everything associated with our project. The idea is that you can either remotely access or move this entire folder from machine to machine and everything you need for your analysis will be available.

1.2 Add subfolders

  • Within your new project root folder, add the following three folders: data, docs, scratch.

  • Within the data folder, add two new folders: raw, and processed

    We’ve just created default locations to organize our files. Here’s the role each folder/subfolder plays:

    • data - holds all our data (except data that are stored in the project geodatabase).
    • data/raw - holdsany data we obtain and associated metadata. These files should remain exactly as we obtained them.
    • data/processed - holdsany data that we created or modified.
    • docs - holds documentation about our project: anything related to our project that’s useful for others or our future self
    • scratch - a parking place for any intermediate or temporary files, i.e. files we know we can delete later and be ok

1.3 Document your workspace

  • Add a new text file called ReadMe.txt in your project workspace.
  • In this text file, include enough information to identify what this workspace is about, who created it (i.e, you), and the date. It’s often useful to include your email as well as that uniquely identifies you, in case someone has the same name.

We now have a nearly perfect workspace. It’s perfect except for the fact that our geodatabase lives outside our data folder and can contain “keeper” as well as “scratch” datasets. In the past, I’ve created a second geodatabase called “Scratch.gdb” (in the scratch folder) and sent all intermediate files there, but I’m finding that allowing the default project geodatabase to hold everything, and then being sure to save outputs I know I want to keep in the data/processed folder in a more stable and non-proprietary format (e.g. CSV, GeoJSON, GeoTIFF, etc.) actually works best.


Step 2. Fetching the Data

The scenario for our exercise is to evaluate trends in air quality across the United States, specifically particulates in the air that are less than 2.5 microns in diameter or “PM2.5”. (Read more about PM2.5 here.) Are there areas where air quality is particularly good or bad? Is there a temporal component to this? Are trends related to demographic variables. To begin answering these questions, we need to get our hands on air quality data, and we’ve received a tip that the US Environmental Protection Agency has a good dataset.

2.1 Explore the data source

Many useful datasets can be found on-line, but the web sites offering them follow no set format. Thus, finding the exact download locations and even the precise files you want will require a bit of exploration. Some are easier than others. The EPA’s site which hosts the data we want is one of the better organized ones, but you’ll still need to understand how the various datasets are structured to determine what files you need to download and understand the data.

Use these steps to get the data you’ll need for this exercise, but do spend a bit of time exploring the site and come to class with any unique observations or questions.

  • Navigate to the EPA’s Air Data website: https://www.epa.gov/outdoor-air-quality-data and explore the site.

  • Navigate the the Pre-generated data files and familiarize yourself with the datasets available. It’s a bit overwhelming at first, so take a moment to understand the structure of how these data are organized:

    • The link to the Description of data and formats leads to a bunch of useful descriptors of the data included in and format of each dataset.

    • You’ll also noticed that the are two spatial datasets (Site Listing and Monitor Listing) and then a variety of temporal datasets with various air quality measurements over time.

    :point_right: Why wouldn’t they just merge the spatial and temporal datasets rather than expect us to download both and merge them?

2.2 Fetch the datasets

For our study, we’re going to narrow in on PM2.5 data. To examine these values we’ll want to grab the Site Listing dataset as well as the PM2.5 FRM/FEM Mass records for 2021. (The 2022 data won’t be complete until November.)

  • Download the SiteListing.zip file to your data/raw folder and unzip.
  • Scroll down to the Particulates section, to the PM2.5 FRM/FEM Mass (88101) daily records dataset for 2021. Download the zip file to your Data/Raw folder and unzip.
  • Review in the Description of data and formats page what these datasets include and how they are formatted.

2.3 Document your raw datasets

For the sake of transparency and reproducibility, you need to document how you obtained your raw datasets.

  • For each dataset downloaded, create a README file. For example, for aqs_sites.csv file, create a aqs_sites_README.txt file.
  • Include in this file any information relevant to how and when it was obtained, enough so that someone else could easily repeat the process. Also include relevant information on the data contained in the file. This can be a direct copy from information found on-line, or if the data has a separate metadata document, you can download that and store it in this folder as well, being sure that it’s somehow obvious that the separate file is related to the appropriate data file. (Merely linking to the metadata document is insufficient in case the link breaks…)

Step 3. Importing & Exploring the Data

Before jumping into analyzing a new dataset, we should first to a quick check, both to familiarize ourselves with its structure, but also to confirm that it will serve our purposes and to check for any issues we might have to address before using it. The data we just obtained are in CSV format, which ArcGIS Pro can read directly (though not totally free of issues, as we’ll see). So our next step will be to import these files into ArcGIS Pro and then to see what we are dealing with.

3.1 Load and view the data

The data we just obtained are in CSV format, which ArcGIS Pro can read directly (though not totally free of issues, as we’ll see). So our next step will be to import these files into ArcGIS Pro and then to see what we are dealing with.

  • Add the aqs_sites.csv data to your ArcGIS Pro map.

  • Open and view the attribute table.

    You can glean some useful information from the attribute table. You can see what columns there are and some of the values in these columns. You might even see cells with missing data. You can sort columns and get an idea of the range of values. However, it’s not a very efficient way of looking at the data, and it’s quite unlikely you’ll see any patterns or trends in the data.

    One important characteristic of the attribute table to note, however, is its overall structure: each row in the table represents a feature, i.e., an entity that we are describing, and each column represents a measurement of that feature. This is referred to as a “tidy data format””, which will be discussed later.

  • Open in Data Engineering tab. You’ll see a list of the fields with some icons next to them as you hover over the field names. You’ll also the number of records (rows) and fields (columns) the dataset has.

  • The icon to the left of each field name indicates the data type of the field.

    :mega: Data types are a very important concept in data analysis as they define how the values in a given column are treated. ESRI provides a useful description of the different data types here: https://pro.arcgis.com/en/pro-app/latest/help/data/geodatabases/overview/arcgis-field-data-types.htm.

  • Examine the data types assigned to each field. Which types are included in this dataset? Are they all correct?

    The first three columns actually have the incorrect data type assigned. While they indeed include numeric values, the values themselves have no arithmetic meaning. Rather the numbers are actually names; these are nominal values and we should code them as such, otherwise the software allow us to do things we shouldn’t with the data, like take the sum of 06 (the State Code of California) and 37 (the State Code of N. Carolina), which is not Puerto Rico (State Fips of 43)!

3.2 Override default datatypes with imported CSV files

ArcGIS Pro assigns a datatype based on the values in a column. If we want to override these, we need to edit the “schema.ini” file found in the same folder. This file is a standard file many data management applications recognize for importing CSV data correctly, You can read more about them and their structure here.

  • Create the schema.ini file.
    If you don’t see a schema.ini file in the folder where the CSV file is, you can create one by simply by viewing the attribute table in ArcGIS Pro and sorting the data on any of the fields.

  • Edit the schema.ini file.
    Open the schema.ini file in any text editor and edit it so that it appears as below.

    [aqs_sites.csv]
    Col1="State Code" Text
    Col2="County Code" Text
    Col3="Site Number" Text
    Col22="Zip Code" Text
    
  • Remove and then re-add the CSV file to your ArcGIS Pro map and inspect the data types. The first three columns (and the Zip Code column) should now be shown as text fields.

3.3 Explore summaries of the dataset

One of the ways the Data Engineering interface facilitates data exploration is by quickly generating summaries of the data found in each field. Let’s have a look at this feature, keeping track of how data types dictates what summary statistics and plots are created.

  • Click the Add all Fields and Calculate button in the Data Engineering interface.
    ArcGIS Pro computes typical summary statistics for each field. Notice different stats are provided for different data types, and that you can filter fields based on data type via buttons at the top.

  • Show just the numeric fields and make a note of the stats shown.

    • What are the most common values for each numeric field? Does this suggest anything about the data?
    • What kind of summary plots are provided for numeric data?

  • Explore just the text fields.

    • Which have many unique values? Which have few?
    • What kind of summary plots are provided for text data?

  • Explore just the date fields.

    • What are the date ranges for each of the date fields.
    • What kind of summary plots are provided for date data?

  • Explore the summary plots generated.

    What does each type of plot tell you about the distribution of values in that field.

    • Numeric - histogram: normally distributed, # of peaks, location of peaks
    • Text - bar chart: distribution of values
    • Dates - line plot: count per bin of time, periodicity, trends

3.4 Critically Examine the Data

Summary statistics and plots allow us to perform quick reality checks on our data. This can be as easy as looking over the number of rows with missing data in a particular column and postulating why. Or you may be familiar enough with some of the values to know what a reasonable high or low range might be and can compare those values to the extremes found in your data.

  • Which fields have missing values? Are any missing most of their values? Why might this be?
  • Look at the histogram of elevation data. Does the distribution make sense? Is the median elevation reasonable? The maximum elevation?
  • Are the date ranges within reason? Could any dates have been miscoded? How might you determine this?
  • Can you think of other ways to look critically at the data?

As part of the last bullet, you may have thought about looking at the distribution of elevations in a subset of data, e.g. within North Carolina. And if you tried to perform that check, you’ll see that ArcGIS Pro can’t query CSV data. So to move on, we’ll have to convert our CSV to a binary format that ArcGIS Pro can work with.


Step 4. Converting Data to Features & More Exploring

We can explore many facets of our data while it remains in CSV format, but ArcGIS Pro requires data to be stored in a binary format (e.g. geodatabase table or feature class) to unleash its full power. We could simply export our table to the project geodatabase, but since our features have geographic coordinates, we’ll convert our data to a point feature class.

4.1 Convert the table to a feature class

The Display XY Data tool is accessible from many locations, including from the Data Engineering ribbon. We’ll use that to convert our table to a feature class.

  • Use Display XY Data to convert our aqs_sites.csv to a feature class named “aqs_sites” in the project database. Use the default WGS 1984 coordinate reference system.

    You may have noticed that the records in the dataset are tagged with the datum to which the coordinates are referenced, and that not all records use WGS 84. For simplicity sake, we’ll just assume all are WGS 84, but how might you deal with this correctly?

4.2 Investigate any warnings raised

When the tool ran, many warnings appeared. Let’s investigate.

  • Note import errors: why might these have occurred?

  • How many records are in the resulting feature class? Did we lose any records?

  • Have a look at the error message again (you can resurrect it from via the History panel). Examine some of the records that had issues converting. Does that explain the errors?

    Seems that some of the data have place holders of “0” for missing coordinate data. We should deal with those…

4.3 Document your work

As we derive new datasets from our raw data, we want to document each step so that our work is fully transparent and reproducible. The easiest way to do this in ArcGIS Pro is to add our processes to a geoprocessing model.

  • Create a new geoprocessing model. We’ll call this “Prep AQS sites”.
  • Open the History tab in ArcGIS Pro (from the Analysis tab). There you should see the instance of “Display XY Data” tool we just ran.
  • Drag that tool into your new model and save the model. Not only is it now documented, but if you had to you could just re-run this tool to regenerate the step!

4.4 Examine the spatial distribution of points

We’ll continue to review the success of our import by looking at the spatial distribution of points.

  • Have a look at the distribution of points. Do all point locations make sense, given what you know of the dataset? Why are there some points off the west coast of Africa? (Hint: it has everything to do with the errors found above.)

  • Do the points have any pattern that strikes you as peculiar?

    • Does any state have an unusually high density of sites compared to others?
    • How do sites seem to be distributed in relation to urban/rural divides? What about physiographic regions (e.g. mountains)?

4.5 Examine spatial patterns in the attribute values

Going beyond just the geographic distribution of points, let’s look at how various attributes vary spatially, cross referencing it with any other knowledge or understanding of the dataset we might have. The data engineering tool facilitates this: to the right of each field, if you hover over it is an icon to update the symbology using that field.

  • Explore the points using the symbology of various fields. Try it with the State Name field to get a feel for it. Do you see any unexpected values either in the legend or on the map?

  • Explore some other fields. In particular, explore the Elevation field values and zoom into Iowa.

    Iowa is a relatively flat state meaning elevation should not be too different among points that are close to each other. If you do see places where the values are different, inspect those values.
    What you are likely to find is that many points have an elevation of zero. That may be true for some values, but for many, it seems zero was entered as a place hold for missing data, which is incorrect. We’ll have to deal with that as well.

  • Continuing our examination of elevation data, create a box plot of elevation values. (Right click the item in the Data Engineering interface, select Create Chart>Box Plot).

    Learn how to read a box plot here: https://pro.arcgis.com/en/pro-app/latest/help/analysis/geoprocessing/charts/box-plot.htm

    Note the value of this outlier: It’s greater that 14,000 m in elevation. Mt. Everest is 8,8000 m tall…

  • Click the show outliers option so you can see the actual point that’s the extreme upper outlier.

  • Select that point, and the feature with that value is selected in the map. Zoom to that feature and look at the elevation value of that point and its neighbors. It seems safe to say that this elevation was coded incorrectly.

4.6 Address known data issues

We saw that some sites have incorrect values associated with them, in particular records with coordinate values and elevation values incorrectly coded as zero. We may find other records with issues later on, but let’s take care of these issues first.

4.6.1 Addressing records with missing or incorrect coordinate data

As we are interested in looking at spatial patterns, data missing spatial information have no value. Even worse are the records with incorrect spatial information as they would still factor into our results, but incorrectly. So, we are going to decide to remove these records.

  • Use the Select Layer by Attributes tool to select all records with where the latitude or longitude are null or have a value of zero.

  • Use the Delete Rows tool to delete these selected rows.

  • Add these processes to your “Prep AQS sites” geoprocessing model.

    Can you think of any way of attaching location to the points with missing or incorrect coordinate data?

4.6.2 Addressing records with incorrect elevation data

We have a choice what to do with the records with missing or incorrect elevation data: we could just leave them as is, we could remove them, or we could update the values by sampling a terrain dataset. The Data Engineering tools, in fact, include a Sample Data from Raster tool. If we had a DEM covering the extent of our points, we could do this fairly easily, but we are going to skip that approach for now. We don’t, however, want to remove sites with elevations incorrectly tagged as “0”, but we also don’t want to keep that incorrect value. So instead, we’ll update the dataset so that all elevation values of zero are set as null values.

  • Use the Select Layer by Attributes tool to select all records with where elevation values equal to 0 or elevation values > 9999.

  • Use the Calculate Fields tool to set the elevation field for these selected rows as “None”.

  • Add these processes to your “Prep AQS sites” geoprocessing model.

    Are there any potential issues with this approach to dealing with the elevation data?
    Also, have a look at the distribution of the remaining elevation values. Are there still unusually high values? From previous work with spatial datasets, you may have observed that values like -9999 and 9999 are sometimes used as placeholders for missing data. Check for this in your datasets.
    In the end, should we really trust the elevation values in the datasets?

4.7 Add a unique SiteID column

Currently each site is uniquely identified by a combination of its State Code, County Code, and Site Number attributes. Combining these into a single column allows us to join our data to other datasets. Let’s do that.

  • Calculate a new text field as the concatenation of the State_code, County code, and Site Number fields, calling this field “SiteID”.
  • Add this process to a your geoprocessing “Prep AQS sites” model.

Step 5. Bringing more data to your features

In previous courses, you learned how to spatially join features, e.g. tagging each point with the county in which it falls. Data Engineering highlights a few additional means for bringing assigning more attributes to your features via their spatial association.

Prepping for enrichment

Before executing this next task, we have to tweak the configuration of our ArcGIS Pro application to look for a local dataset rather than using AGOL. We will cover this in class, but the steps are provided here as well:

  1. Download the license file from this link to your local machine (you’ll need to authenticate via Canvas).
  2. In ArcGIS Pro, select the Project menu to open the Project Setting page.
  3. Click on the Licensing tab.
  4. Click on the Manage your data licensing button.
  5. Add the license file you downloaded in Step 1 and close the data licensing window.
  6. Click on the Options tab back in the Project Settings window.
  7. Click on the Business Analyst tab in the Options window.
  8. Click on the Add Dataset button.
  9. Navigate to \\ns-gis\ESRI_DATA\US_2023 and select dataset_description.xml.
  10. Close the window and resume your ArcGIS Pro session.

5.1 Data Enrichment

Data enrichment is a powerful ArcGIS Pro feature that allows easy access and joining vast amounts of demographic and other data to our spatial features. Here, we’ll explore how we can “enrich” our AQS sites with measures of health, income, race, and education so that we may later on explore whether air quality might have any relation with these variables.

  • From the Data Engineering ribbon, select Enrich from the Integrate toolbar.

  • From the Enrich tool, open the Variable explorer to see a menu of all the variables you extract for your point locations.

    The enrichment categories included are:

    • Health>2023 Health Care (Consumer Spending)>2023 Health Care [Avg]
    • Health>2023 Health Care (Consumer Spending)>2023 Medical Services [Avg]
    • Income>Household Income>2023 Income (ESRI)>2023 Median Household Income [#]
    • Race>Common Race Variables>2023 Diversity Index
    • Education>Educational Attainment>2023 Educational Attainment>2023 Pop 25+: HS Diploma [%]
    • Population>Common Population Variables>2023 Total Population

    :point_right: Documentation on the different calculated values associated with enrichment variables is available here.

  • When enriching data, you also specify a search radius around the feature used to define the area from which data will be summarized. Set yours to a 2.5 km straight line distance.

    By default, ArcGIS Pro is set to tap into datasets provided through ArcGIS Online, and using these datasets in the Enrich tool consumes AGOL credits. NSOE machines are set up to tap into local versions of these same datasets, avoiding the use of credits. However, you should be aware that the Enrich tool requires special settings.

  • Add the enriched dataset to your map and explore the new variables in the Data Engineering Tab.

  • Add this process to a new geoprocessing model called “Add data to AQS Sites”.

5.2 Pulling data from the nearest feature

We may want to explore how air quality measures relate to proximity to coal fueled power plants. We can use the Near tool to add attributes to our AQS sites feature class indicating the distance to the nearest coal plant.

  • Locate coal plant data by searching the Living Atlas for “power plants”. The first result should be “Power Plants in the US”, owned by the “Federal_User_Community”. If not, search for “b063316fac7345dba4bae96eaa813b2f”. Add this feature to your map.

  • Use the Select tool to create a new feature class of just the coal plants from the PowerPlants_US_EIA feature layer.

    :point_right: Why use Select instead of Select by Attribute?

    The Select and Select by Attribute tools accomplish similar tasks, with the key difference being the Select tool writes the selected features to a new file and the Select by Attributes simple creates an virtual selection of records. I find the Select tool to be more robust and trustworthy as you can be certain that processes run later have only the selected features; it’s possible the virtual selection from Select by Attributes might have been cleared or overwritten by the time a later process is run.

  • With the Enriched_AQS_Sites feature Data Engineering tab active, select Near from the Integrate toolbar in the Data Engineering ribbon.

  • In the Near tool, select the coal plants feature layer as the Near Features feature layer, leave the search radius empty but the Method to Geodesic, and for the field names include:
    • FeatureID as “CoalPlant_ID
    • Distance as “DistanceToCoalPlant
    • Set the distance unit to Kilometer.
  • Run the tool and add it to your geoprocessing model.

5.3 Keeping a copy of your results

It’s good practice to periodically write results in a non-proprietary format. Our attribute table has latitude and longitude coordinates so that we could easily recreated the feature class, so let’s export it to a CSV file.

  • Use the Export Table tool to write the attribute table of your Enriched AQS Sites feature class to a CSV file.

  • Set the output file to be in your Data/Processed folder and be sure it has the CSV extension.

  • Run the tool and add it to your geoprocessing model.

    Ideally, you should also taka a moment to write a short metadata file explaining how this file was derived. This file could simply allude to the geoprocessing tools used to create it. You can skip this step for this demonstration exercise.


Step 6. Data Transformation

We are now going to bring in air quality data and associate these data with our enriched site attributes. As you’ll see, however, the air quality data has many records for each site (one for each day of the year), so we’ll have to either summarize our air quality data so that only one record occurs for each site or join our site data to each of the air quality records. We’ll focus on the former for now, and in doing so explore useful data transformation skills - as well as revisit some skills we learned above.

6.1 Add and examine the air quality dataset

  • Add the daily_88101_2021.csv file (downloaded earlier) to your map.
  • Explore the dataset as you did with the aqs_sites.csv file.
    • Do any of the imported data types need correcting? If so, correct them.
    • Are any fields missing values? Any missing a lot of values?
    • Do the range and distribution of values seem logical, given what you know of the dataset?
  • Convert these data to features as you did with the AQS sites CSV file. Call this file “PM25_2021”.
  • Add these processes to a new geoprocessing model called “Prep Daily PM25 Site Data”.

6.2 Create a unique site ID column

As we did with the AQS Site data, we want to create a new field containing a unique site identifier, derived from the state code, county code, and site number of each feature.

  • Calculate a new text field as the concatenation of the State_code, County code, and Site Num fields, calling this field “SiteID”.
  • Add this process to a your geoprocessing model.

6.3 Create a Month column

We next want to extract the month from the sample date column so we can compute monthly summary values. Date-time fields can easily be decomposed into specific time components (e.g. month) using Arcade or Python functions. For example, the ISOMonth() Arcade function will extract the month from the date-time value.

  • Calculate a new numeric field as the month from the Date_Local field, naming this field “Month”.
    • Use the Arcade expression ISOMonth($feature.Date_Local).
  • Add this process to your geoprocessing model.

6.4 Compute summary annual values from daily values

In many cases, we’ll want to group and summarize our data. ArcGIS provides tools to do this. Specifically, if we want to group and summarize our data on attributes alone, we can use the Summary Statistics tool. Or, if we want to summarize our data on geometry as well as attributes, we use the Dissolve tool. We can use either in our case (since each location is a unique feature), but we’ll use the Dissolve tool so that our output remains a feature class instead of a table.

  • Use the Dissolve tool to group data and compute summary statistics on grouped values.

    We could have also used the Summary Statistics tool to accomplish the same task, setting the SiteID as the case (grouping) field. The key difference would be that this tool produces a table, not a feature class, which is why I chose the Dissolve tool. :star:And new to the scene is the Pairwise dissolve tool. Functionally, these tools are identical; the Pairwise Dissolve tool is able to use parallel processing, which makes it faster (link).

  • The Dissolve fields are our grouping variables. Set it as the SiteID field. This means that the output feature class will have one and only one row (feature) for each unique SiteID value.
  • The Statistics fields are the columns for which we’ll calculate summaries. If a single SiteID occurs 10 times in our table, all 10 original rows will be collapsed into a single output row. We set here how that single value will be calculated: a mean? a median? a maximum?
    Set the tool to compute the mean of the Arithmetic_mean field and also the median of the Arithmetic_mean.
  • Set the output as “PM25_Annual_Mean”.
  • Set “Create Multipart Features” to False. This will create an output with Point rather than MulitPoint geometries . This shouldn’t make a significant difference, but some tools run better on Point vs MultiPoint geometries.
  • Add this process to your geoprocessing model.

6.5 Compute monthly summaries for each site

By repeating the above steps, but adding a column for the month the sample was collected, we can also examine air quality values over month.

  • Use the Dissolve (or Pairwise Dissolve) tool to compute monthly stats for each site.

    • Set the Dissolve fields to group the data on SiteID and Month.
    • Set the Statistics field to compute the mean and median of the Arithmetic Mean.
    • Name the output “PM25_2021_Monthly”.
  • Add this process to your geoprocessing model.

6.5 Pivot to Monthly tables

If you have a look at the table generated above, we still have multiple rows for each site. This can cause issues if we wanted to join these data to our AQS Site features since we’d have multiple records for each site. The solution is to pivot the table so that each month becomes its own column, with the values in that column being the mean monthly PM2.5 value we just computed.

  • From the Format menu in the Data Engineering ribbon, select Pivot Table.

  • Set the input fields to be our site specific field (SiteID).

  • Set the Pivot field to be the Month field. This means the output will have a new column for each unique value in the Month field.

  • Set the Value field to be the MEAN_Arithmetic_Mean. This means the value within each of the new Month field will be taken from the MEAN_Arithmetic_Mean column. (We would have to repeat this tool to create another table if we wanted to show the monthly medians.)

  • Name the output “PM25_2021_Monthly_Mean”.

  • Add the process to your geoprocessing model.

Examine the structure of the PM25_2021_Monthy_Mean table you created. Did pivoting the table compute any new values? Do we have more or fewer rows? What about columns? (Be sure you understand what it means to “pivot” a table and why we would want to do it – and perhaps why we might NOT want to use pivoted data.)

6.6 Combine the datasets

Next, we’ll want to combine our annual and monthly air quality with our enriched AQS site data through a series of joins.

  • Create a new copy of the enriched AQS feature class (so we have a fallback if our next steps don’t work as we wanted). Call this AQS_PM25
  • Join the MEAN_Arithmetic_Mean field in the PM2.5_Annual_Mean dataset to AQS_PM25 feature class.
  • Join the 12 monthly mean fields in the PM25_2021_Monthly_Mean dataset to the AQS_PM25 feature class.
  • Export the AQS_PM25 attribute table to a CSV file in your Data/Processed folder as a file named AQS_PM25_2021.csv.
  • Add these processes to a new geoprocessing model called “Step 7. Join AQS datasets”

Step 7. Deeper Explorations

We now have many attributes associated with our AQS sites and can really explore various relationships among the variables in our data. In this section we’ll examine some of the more advanced plotting and visualization capabilities of ArcGIS Pro.

7.1. Visualizing grouped data - bar plots

Bar plots show data broken down by categories. The default is the count of features in each class, but we can alter that to show the breakdown of continuous values by category. We’ll use this to explore the notion that annual air quality might be different across different land cover types.

  • Create bar plot of land use;
  • Alter the plot to show mean annual average PM2.5 concentration
  • Sort the bars on the Y-axis from largest to smallest. Do the values shown seem surprising?
  • Zoom your map to the southeast US, then activate the Extent filter on the plot. The plot now reflects only the points shown in the map.
  • Explore a few different regions and how the relationship changes.
  • A the Location Setting attribute as the Split By field. Does this provide any more insight into your data?

7.2 Visualizing grouped distributions - box plots

Box plots offer more data than bar plots in that you can see the distribution of values over time.

  • Add the PM25_Monthly feature class to your map
  • Open the dataset in the Data Engineering tab
  • Create a box plot of the Mean_Arithmetic_Mean column
  • Set the category to be Month: How does PM 2.5 values change over time? Which month sees the most variability in PM25 measurements?
  • Explore different geographies. Do the monthly trends vary across regions?
  • Can you create a box plot so that it shows the monthly distribution for only North Carolina sites (State code = 37)?

7.3 Visualizing the interaction among two continuous variables - scatter plots

Are high values in one continuous variable coincident with high values in another? How tight is this relationship? Are there unique aspects to individual features that don’t follow the relationship (i.e., outliers)? We explore these questions with scatterplots.

  • Open the enriched AQS data in the Data Engineering tab again.

  • Select both the Month7 and Month8 variables, then right click and select Create Chart>Scatterplot
    • Examine the distribution.
    • Inspect points off the line.
    • Interpret the statistics shown: the formula and the R2 value.
    • Explore different geographies and see how the relationship changes.
    • Select outliers and explore where they are geographically
  • Set the split by Land Cover.
    • Examine the R2 among the different groups of data.
  • Look at other combinations of variables.

    A great resource on choosing an appropriate visualization: https://www.data-to-viz.com/ .


Step 8. Statistical Explorations

Finally, we’ll examine a few statistical exploration techniques, searching for patterns in our data over space and time.

8.1: How does the spatial distribution of air quality change over the months of the year?

Box plots and bar plots of average air quality grouped by months reveals which months have particularly high or low values. But are there spatial trends in air quality among months? We’ll have a look, but we’ll confine our examination to a smaller geography - California - since spatial patterns at too broad a geography tends to over generalize the results. We’ll also need to project our data since measurements of distribution more suited to planar coordinate systems.

  • Subset the PM25_2021 dataset for California records (State Code = 6) and the Sample Duration is “1 HOUR”.
  • Project to WGS84 UTM 10N.
  • Compute Mean Center, using Arithmetic Mean as weight field and dimension field, Month as case field.
  • Compute Directional Distribution, using 1 standard deviation for the ellipse size, setting the weight and case fields the same as for computing mean center.
  • Add these processes to a new geoprocessing model named “Explore monthly air quality distribution”.

What can we infer from the visualizations? Do the patterns coincide with anything else that may have happened in California in 2021?

8.2: Exploring neighborhood outliers

Now we’ll explore characteristics of specific air quality monitoring sites using spatial analysis. The first rule of geography states is that features closer together are likely more related than those far apart. So we’ll compare air quality of each AQS site with its neighbors to explore which are more closely aligned, but more interestingly, which are not.

  • Select features where the mean arithmetic mean is not null. (Those records will cause issues…)
  • Compute Neighborhood Summary Statistics on your Enriched AQS site feature class
    • Compute mean for the MEAN_Arithmetic_Mean field.
    • Start with a neighborhood defined as the closest 10 neighbors, unweighted.
  • Create scatterplot where the X axis is the Mean of Arithmetic mean and the Y axis is the [neighborhood] mean of Mean of Arithmetic Mean
    • Sites above the line are AQS sites where they are higher than their neighbors: why?
    • Symbolize on other layers: any pattern?
  • Select an outlier and investigate
    • What’s it’s annual mean? It’s neighbors annual mean
    • View the raw data (monthly values)
      • Zoom to the area, set the extent filter and view distributions
  • Try different neighborhood definitions: A distance band of 50 km. How do these alter the results?