1 Background

Geographic Information Systems (GIS) provide versatile tools for uncovering spatial patterns of public health and environmental justice (EJ) issues at multiple scales within the exposome. There are many open-source GIS platforms that allow for cost-effective and equitable integration of GIS tools into data analysis workflows. With open-source GIS, public health researchers can effectively collaborate and easily share reproducible methodologies and findings, helping to foster more inclusive and innovative approaches to cross-cutting health challenges.

In this workshop, you will be introduced to open-source GIS using R, as well as techniques for leveraging GIS tools and spatial data to advance public health research related to the plural environmental health effects of climate change in EJ communities. By participating in this workshop, you will learn the basics of GIS data-structures, loading and visualizing spatial data, and spatial relationships in regression analyses.

1.1 Modeling

Modeling, at its most simple, is to use of mathematics and statistical assumptions on sample data to make predictions about the real world. Geospatial modeling adds the assumption that spatial patterns in the data and locations of observations can uncover relationships that may otherwise go missed. Geospatial modeling has applications in diverse topics from the analysis of geomorphic features and physical processes to logistics and supply chains. One domain of study that can be enhanced through geospatial modeling is public health and epidemology, in which associations between health outcomes and predictor variables can be viewed as inherently spatial processes. To understand the nuance of geospatial modeling as it relates to public health, we must first go over the distinction between global and local models.

1.1.1 Global Models

Modeling approaches that don’t explicitly account for spatial relationships, but rather weight every observation the same regardless of its spatial location are referred to here as global. On the other hand, local models also consider spatial relationships between observations as an important explanatory variable. A global model considers the relationships between all dependent and independent variables across the entire study extent, with the assumption that observed relationships hold true at all locations within the extent. The implications of these assumptions are that data from different locations within a model can be used with equal weighting, allowing for the estimation of a single parameter for each relationship being modeled. Geographers have pointed out for some time that the assumptions of global models have flaws, especially when examining complex relationships. In certain cases, spatial relationships act as lurking variables that drastically change the direction of \(\beta\) coefficients - this is an example of Simpson’s paradox in spatial relationships. Sachdeva and Fotheringham (2023) explore this in detail here.

1.1.2 Local Models

In the context of spatial statistics, a local model considers the relationships between dependent and independent variables (or a sub-set of these variables) across a localized area of the study extent. We refer to these localized areas as ‘neighborhoods’. Local models assume that observed relationships only hold within the neighborhood, and may vary across neighborhoods. This allows for local models to potentially model processes that vary over space with greater accuracy. This is valuable in the realm of public health, in which associations between environmental exposure, human behavior, and health outcomes are complex and often present as non-linear, spatially heterogeneous relationships.

1.1.2.1 Geographically Weighted Regression (GWR)

Geographically weighted regression (GWR) is a spatial regression technique that evaluates a local model of the variable or process you are trying to understand or predict by fitting a regression equation to every feature within a predetermined neighborhood in the dataset. While GWR is a local model, it does not assess the spatial scale at which processes may vary. In other words it assumes variance across an extent, but that all variances exist at the same spatial scale for every model term.

1.1.2.2 Multiscale Geographically Weighted Regression (MGWR)

Multiscale geographically weighted regression (MGWR) builds upon the GWR technique by optimizing neighborhood sizes for each explanatory variable. This allows for the MGWR framework to examine both spatial variability and scalar effects simultaneously in local modeling.

1.1.2.3 Other Modeling Approaches

Spatial autoregressive models (SAR) are one of the most common approaches for handling spatial autocorrelation in datasets. Generally, they utilize one of two techniques: 1) a spatial error model computes an error coefficient (\(\lambda\)) to test whether spatial error between terms is correlated, and 2) a spatial lag model introduces a spatial lag term \(\rho\) to test if th dependent variable is affected by multiple independent observations at different locations. These methods are most commonly conducted in the open source GIS package, GeoDa. Various other local spatial models include Bayesian Spatially Varying Coefficients Models, Spatial Filtering methods, and Spatially Clustered Coefficient models. Additionally, other kinds of modeling approaches, such as machine learning based models, can be used to assess spatial relationships in data. This workshop will focus on the MGWR framework, given its computational efficiency, robust support in R, and widespread use.

1.1.2.4 Spatial Autocorrelation

Spatial Autocorrelation is when data from locations near one another in space are more likely to be similar (i.e. are more correlated) than data from locations remote from one another. Spatial autocorrelation can be thought of as a double-edged sword. From a geographic perspective, we expect that raw data will show spatial autocorrelation, and use this to our advantage. For example, there is a notion in geography known as Tobler’s First Law of Geography, in which it is assumed that objects in space closer to each other are more likely to be related than objects that are farther apart. When examining processes that vary over space, we assume that data representing these processes will be non-randomly distributed (i.e. show spatial autocorrelation), and that in assessing these distributions through spatial statistics, we can infer otherwise hidden properties of the process.

In this workshop, we will measure the spatial autocorrelation of model residuals. If residuals are spatially autocorrelated then we can conclude that the given model is poorly specified by not attending to all spatial relationships between model terms. Local models do not necessarily account for all spatial autocorrelation in model residuals.

2 Example: Low Birthweight Rates

In this workshop, you will examine how rates of low-weight births (LWB) in Colorado relate to environmental pollution and socioeconomic covariates. Our main objective is to demonstrate how spatial relationships may be lurking variables in regression models.

Using this data, we will model the relationship between LWB and environmental pollution, corrected for social vulnerability variables using:

  1. An ordinary least squares (linear) model,
  2. A geographically weighted regression, and, optionally
  3. A multiscale geographically weighted regression.

For each model we will examine model diagnostics, including:

  1. Overall model strength and predictive capability,
  2. Distribution (normality) of model residuals, and
  3. Spatial autocorrelation of model residuals.

2.1 Getting Started

Like most scripting languages, base R offers an array of basic commands for data manipulation and analysis. However, if you want to do anything more technical or esoteric, you have to download and install specialized packages.

\(\textbf{A note on R:}\) R is an extensive coding environment. It would be impossible to explain every command in this workshop. The intent of this workshop is to introduce you to performing spatial data-analysis in R, and give you to the background to learn and apply these techniques in your own research.

The open source community has designed “cheat sheets” for different R topics. Please refer to the Base R cheat sheet for descriptions of common commands and operators. In R Studio, you can navigate to Help>Cheatsheets to access all of them. There is also a wealth of community information online. A quick Google search can point toward solutions for many problems. Below we have provided basic information on key ideas and commands for this workshop:

  • variable - In R we define data-objects, such as a spreadsheet of SVI data per census tract, as a variable. Variables allow us to easily reference complex data objects in code functions.

  • $ (extractor operator) - The $ is the extractor operator. It is used to reference specific variables in a data table or object.

  • <- (assignment operator) - The <- is the assignment operator. This can be thought of as an ‘=’ (equals). We use the assignment operator to define variables.

  • %>% (pipe operator) - The %>% is the pipe operator. This is a way to chain operations together (i.e. ‘pipe’ information). It takes the output from the left of the pipe and passes it to an expression on the right. For example, we can use the pipe to take the data-object represented by a variable, and input (‘pipe’) this into a function. Pipes are not a part of base R, which is why we call in the dplyr library.

  • Data Frame - Tables of data in R are referred to as ‘data frames’. This is basically a data-object that acts like a spreadsheet, in which columns are variables/fields and rows are observations.

2.1.1 Install packages

Both code chunks below load the necessary packages to complete the workshop. If you have installed all of the required packages, skip the first chunk and run the second one by clicking on the green arrow in the top right corner of the chunk. If you need to install any libraries, un-comment any or all of the lines below by removing the #. (*Hint: you can comment or un-comment lines by selecting them and pressing Ctrl+Shift+C.)

## Installing libraries
## If you don't have any of the required packages installed, un-comment and run the necessary lines
# install.packages("GWmodel")
# install.packages("reshape2")
# install.packages("ggplot2")
# install.packages("GGally")
# install.packages("gridExtra")
# install.packages("tmap")
# install.packages("dplyr")
# install.packages("sf")
# install.packages("stringr")
# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages('car')
# install.packages('Rtools')
# install.packages('nortest')
# install.packages('spdep')

2.1.2 Load Libraries into R Environment (STEP 0)

Even if you’ve downloaded and installed the necessary libraries, they haven’t been called directly into the current R environment. This step is required for running the functions throughout the workshop.

# Import libraries
library(GWmodel)
library(reshape2)
library(ggplot2)
library(GGally)
library(gridExtra)
library(tmap)
library(dplyr)
library(sf)
library(stringr)
library(tidyverse)
library(Hmisc)
library(car)
library(nortest)
library(spdep)

source("plotting_functions.R")

2.2 Data

This workshop uses the datasets below, all of which contain data aggregated to census tracts in Colorado:

\(\textbf{Outcome:}\) 2018 rate of low weight births (LWB) provided in the Colorado Department of Public Health and Environment (CDPHE) Open Data portal.

\(\textbf{Main Exposure:}\) 10-year average (2008 - 2018) of total releases in pounds from Toxic Release Inventory (TRI) facilities within 50 kilometers of population-weighted census tract centroids in Colorado.

\(\textbf{Covariates:}\) Estimated proportions of census tract-level social vulnerability from the 2018 ATSDR-CDC Social Vulnerability Index.

Here is a summary of all variables we will use for this analysis:

  1. LWB_ADJRATE - 2018 Adjusted rate of low weight births (LWB) by census tract in Colorado
  2. avgReleasesLb_10yr - 10-year average (2008-2018) of total toxic releases in lbs. within 50-km of population-weighted census tract centroids
  3. RPL_THEME1 - Percentile ranking by census tract of SVI socioeconomic status theme (described below)
  4. RPL_THEME2 - SVI household characteristics theme
  5. RPL_THEME3 - SVI minority status and language theme
  6. RPL_THEME4 - SVI housing type and transportation theme

Raw data is imported in the form of spreadsheets, in which observations have location identifiers. We then put this data into a GIS environment and spatialize it by joining the data to spatial objects. For this workshop we will use vector data. Vector data can simply be thought of as polygons (shapes) with location information. In other words, areas projected onto the surface of the Earth. For this workshop we will use vector data in shapefile format. In the case of this workshop, these shapes represent the census tracts of the state of Colorado. By joining the raw data to census tracts, we can use spatial statistics to analyze the spatial relationships within the data.

Shapefiles are a data format for storing geometry and attribute information of a spatial dataset. Notice that we only read in one file with a .shp extension, but that the file directory ./Tracts contains 7 items with the same name but different extensions. All of these files are necessary for GIS software to read and display a .shp.

tl_2010_08_tract10.shp is a product of the U.S. Census Bureau called the TIGER/Line Shapefiles (TIGER stands for Topologically Integrated Geographic Encoding and Referencing system.)

2.2.1 Import Shapefile (STEP 1)

We will use the sf, or Simple Features, library to import the shapefile of census tracts and name it dat_tracts. We then draw a very basic map just to ensure that it is rendering correctly. Run the code chunk below by clicking on the green arrow in the top right of the chunk.

# Import shapefile
dat_tracts <- st_read(dsn = "./Tracts/tl_2010_08_tract10.shp")
Reading layer `tl_2010_08_tract10' from data source 
  `C:\Users\jacks\Documents\UNM\P30\Workshop\INSPIRES-SpatialReg-workshop\Tracts\tl_2010_08_tract10.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1249 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -109.0603 ymin: 36.99242 xmax: -102.0409 ymax: 41.00344
Geodetic CRS:  NAD83
# Plot shapefile with base R - this is just to check that it renders correctly
par(mar=c(0,0,0,0))
plot(dat_tracts, col="#333333", lwd=0.25, border=0, max.plot = 1)

2.2.2 Import Exposure Data (STEP 2)

Next, we will use base R to read in the CSV of exposure data and name it dat_tri. The first thing we need to do after importing the CSV data table is ensure that the census tract ID (GEOID) is the correct length. Some states have a GEOID that begins with 0, Colorado being one of them (08). The problem is that CSV files will often remove the leading 0 in numbers. Since all GEOIDs are standardized by the US Census to a length of 11 characters \(^1\), here we make sure that every GEOID in the table is 11 characters long by padding them with a 0 on the left side.

dat_tri <- read.csv('./Exposure/avgReleases_tract_popCenter_50km_CO_2008-2018.csv')
# Pad GEOID to 11 characters
dat_tri$GEOID10 <- str_pad(dat_tri$GEOID10, width = 11, side = "left", pad = 0)

unique(nchar(dat_tri$GEOID10)) # Un-comment this line if you want to check if padding worked
[1] 11

All GEOIDs are 11 characters long. Note: if this says something other than 11 make sure the chunk above is running correctly.

2.2.3 Import Confounders (STEP 3)

This data table contains 2018 SVI data aggregated to census tracts. We first read in the data table and define it as a variable svi, and format to pad GEOIDs (called ‘FIPS’ in the SVI dataset)\(^1\) as done in Step 2. Next, we pipe the svi variable into a filtering function ‘filter()’, to filter out data that is not located with Colorado census tracts, and define this filtered data as a new variable svi_sub. Next, we convert missing values (-999) to NA; this will make it easier to exclude missing values from statistical analysis later on. Lastly, we pipe the svi_sub variable into the select() function to select only fields that start with “RPL_THEME”, and overwrite the svi_sub variable with this selection.

There are four RPL_THEME variables in the 2018 SVI. These are the summary percentile rankings by census tract of variables included in four themes generally related to social vulnerability:

  1. \(\textbf{Theme 1}\): Socioeconomic status
    • Below poverty
    • Unemployed
    • Income
    • No high school diploma
  2. \(\textbf{Theme 2}\): Household characteristics
    • Aged 65 or older
    • Aged 17 or younger
    • Civilian with a disability
    • Single-parent households
  3. \(\textbf{Theme 3}\): Minority status and language
    • Minority
    • Aged 5 or older who speaks English “less than well”
  4. \(\textbf{Theme 4}\): Housing type and transportation
    • Multi-unit structures
    • Mobile homes
    • Crowding
    • No vehicle
    • Group quarters

\(^1\) You will often see FIPS and GEOID used interchangeably. This is technically incorrect. Per the US Census Bureau, Federal Information Processing Series (FIPS) codes are 5-digit county subdivisions. A GEOID is a concatenation of codes for state, county, and census tract, totaling 11 characters.

svi <- read.csv('./Confounders/SVI_2018_US.csv') # SVI data for all census tracts in US
svi$FIPS <- str_pad(svi$FIPS, width = 11, side = "left", pad = 0) # Pad GEOIDs so that they are all 11 characters long
unique(nchar(svi$FIPS))
[1] 11
svi_sub <- svi %>%
  filter(ST_ABBR %in% c("CO")) # Subset SVI to only include tracts in Colorado

svi_sub[svi_sub == -999] <- NA # Missing values are coded as -999, here we replace -999 with NA

# Keep percentage, percentile, theme ranking, flag variables, and total population
svi_sub <- svi_sub %>%
  dplyr::select(FIPS, starts_with(c("RPL_THEME")))

unique(nchar(svi_sub$FIPS))
[1] 11

All GEOIDs in the SVI dataset are 11 characters long.

Note: if this says something other than 11 make sure the chunk above is running correctly.

2.2.4 Import Outcome (STEP 4)

This data table contains health outcome data aggregated to census tracts. We first read in the data table and define it as a variable outcome. Next, we select the fields we need (GEOID and any variables related to LWB) and define this selection as a new variable lwb. Lastly, we again format to pad the GEOIDs as done in steps 2 and 3.

outcome <- read.csv('./Outcome/CDPHE_Composite_Selected_Health_Outcome_Dataset_(Census_Tract).csv')
lwb <- outcome[, c(2, 26:32)] # Subset to only include low birthweight variables and GEOID

lwb$TRACT_FIPS <- str_pad(lwb$TRACT_FIPS, width = 11, side = "left", pad = 0) # Pad GEOIDs so that they are all 11 characters long

unique(nchar(lwb$TRACT_FIPS))
[1] 11

All GEOIDs in the LWB dataset are 11 characters long.

Note: if this says something other than 11 make sure the chunk above is running correctly.

2.2.5 Combine Data Tables (STEP 5)

Join the formatted data tables of TRI releases (exposure), SVI confounders, and LWB (outcome) to the census tract shapefile, creating a combined spatial dataset. This combined dataset will then be ready to be fed into statistical models. However, since attribute joins can’t be performed on a large SpatialPolygonsDataframe directly, we first convert dat_tracts into a normal data frame, allowing us to perform join operations.

We define the converted shapefile as new variable dat_full, then we pipe the census tracts into a series of join functions right_join(), joining the TRI, SVI, and LWB data tables to create a combined table. We overwrite dat_full with the new combined data table. Next, we create a subset of the dataset and define as a new variable global_dat for use in the global linear regression model.

Optional: after you finish the workshop, come back to this step and follow instructions in the code comments to investigate how removing potential outliers affects results.

Next, we convert dat_full back into spatial format for use in spatial statistical models. Lastly, we return a summary of dat_full to make sure data was joined properly (reference the finished HTML document to make sure your summary looks correct).

# Combine data tables

# Join data using the dplyr library - this doesn't work on SpatialPolygonDataFrame data types, so convert first
dat_full <- st_as_sf(dat_tracts)

dat_full <- dat_full %>%
  right_join(dat_tri, by = c("GEOID10" = "GEOID10")) %>% # Exposure
  right_join(lwb, by = c("GEOID10" = "TRACT_FIPS")) %>% # Outcome
  right_join(svi_sub, by = c("GEOID10" = "FIPS")) # Confounders

# Create a subset for global models
global_dat <- dat_full %>%
  # select(GEOID10, avgReleasesLb_10yr, LWB_ADJRATE, E_TOTPOP, starts_with("EP_")) # Keep total population estimate (E_TOTPOT) and land area (ALAND10) for later reference
  select(GEOID10, avgReleasesLb_10yr, LWB_ADJRATE, starts_with("RPL_"))
global_dat <- 
  data.frame(global_dat[, 1:8]
             ) 
# %>% # To only keep certain observations, uncomment this line and move the pipe (%>%) up to the line above
#   # Only retain observations we're analyzing
#   slice(
#     -659
#   )

dat_full <- as(na.omit(dat_full), "Spatial") # Convert dat_full back to spatial - we will use this for the spatial models

summary(dat_full)
Object of class SpatialPolygonsDataFrame
Coordinates:
         min        max
x -109.06026 -102.04158
y   36.99242   41.00344
Is projected: FALSE 
proj4string : [+proj=longlat +datum=NAD83 +no_defs]
Data attributes:
  STATEFP10          COUNTYFP10         TRACTCE10           GEOID10         
 Length:1201        Length:1201        Length:1201        Length:1201       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    NAME10           NAMELSAD10          MTFCC10           FUNCSTAT10       
 Length:1201        Length:1201        Length:1201        Length:1201       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    ALAND10             AWATER10         INTPTLAT10         INTPTLON10       
 Min.   :1.556e+05   Min.   :       0   Length:1201        Length:1201       
 1st Qu.:1.932e+06   1st Qu.:       0   Class :character   Class :character  
 Median :3.449e+06   Median :    5172   Mode  :character   Mode  :character  
 Mean   :1.879e+08   Mean   :  858774                                        
 3rd Qu.:1.919e+07   3rd Qu.:  157285                                        
 Max.   :9.748e+09   Max.   :73093771                                        
 avgReleasesLb_10yr  LWB_ADJRATE       LWB_L95CI        LWB_U95CI    
 Min.   :     0     Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
 1st Qu.:  4893     1st Qu.: 5.800   1st Qu.: 2.710   1st Qu.: 8.89  
 Median :  5013     Median : 7.190   Median : 3.910   Median :10.73  
 Mean   : 25259     Mean   : 7.421   Mean   : 4.057   Mean   :11.31  
 3rd Qu.: 13070     3rd Qu.: 8.840   3rd Qu.: 5.240   3rd Qu.:13.14  
 Max.   :613759     Max.   :25.000   Max.   :12.680   Max.   :45.74  
 LWB_STATEADJRATE   LWB_SL95CI     LWB_SU95CI   LWB_DISPLAY       
 Min.   :7.43     Min.   :7.34   Min.   :7.52   Length:1201       
 1st Qu.:7.43     1st Qu.:7.34   1st Qu.:7.52   Class :character  
 Median :7.43     Median :7.34   Median :7.52   Mode  :character  
 Mean   :7.43     Mean   :7.34   Mean   :7.52                     
 3rd Qu.:7.43     3rd Qu.:7.34   3rd Qu.:7.52                     
 Max.   :7.43     Max.   :7.34   Max.   :7.52                     
   RPL_THEME1       RPL_THEME2       RPL_THEME3       RPL_THEME4    
 Min.   :0.0004   Min.   :0.0012   Min.   :0.0228   Min.   :0.0000  
 1st Qu.:0.1411   1st Qu.:0.1462   1st Qu.:0.2734   1st Qu.:0.1327  
 Median :0.3178   Median :0.3249   Median :0.4555   Median :0.3849  
 Mean   :0.3690   Mean   :0.3773   Mean   :0.4735   Mean   :0.4227  
 3rd Qu.:0.5792   3rd Qu.:0.5855   3rd Qu.:0.6658   3rd Qu.:0.6940  
 Max.   :0.9998   Max.   :0.9992   Max.   :0.9654   Max.   :0.9999  
   RPL_THEMES    
 Min.   :0.0001  
 1st Qu.:0.1159  
 Median :0.3017  
 Mean   :0.3724  
 3rd Qu.:0.6155  
 Max.   :0.9967  

2.2.5.1 Correlation Matrix

Before we pump the data into models, we’ll start by exploring pairwise relationships between them to assess whether there are any nonlinear relationships.

The code chunk below creates a correlation matrix plot of the variables from our combined dataset. We first define a new variable vars as a list of the variable names. Next, we create a subset data table by selecting from global_dat the list of variables contained in vars. Lastly, we pass this data to a plotting function ggpairs() and define the output plot as variable p_cor, and print p_cor to display it below the code chunk.

# # Optional: Log-transform exposure variable
# global_dat <- global_dat %>%
#   mutate(
#     avgReleasesLb_10yr = log10(avgReleasesLb_10yr + 0.00001)
#   )

# Plot correlation matrix of variables from the global model
vars <- c(
  "avgReleasesLb_10yr", "RPL_THEME1", "RPL_THEME2", "RPL_THEME3", "RPL_THEME4"
)

# Create a subset of 'global_dat' with selected variables
global_dat_sub <- global_dat[, c("LWB_ADJRATE", vars)]

# Plot the correlation matrix using ggpairs
p_cor <- ggpairs(
  global_dat_sub,
  upper = list(continuous = wrap("points", alpha = 0.2, size = 0.5)),
  lower = list(continuous = "cor")
)

print(p_cor)

In the correlation matrix above, we don’t see any strongly nonlinear relationships between explanatory variables, but we do see that the exposure variable, avgReleasesLb_10yr is very right-skewed, meaning that the majority of observations are at or close to 0. However, we won’t perform any transformations to the variables to make interpretation and comparison of the three models more straightforward.

If you want to try log-transforming the exposure variable, un-comment the top 5 lines in the chunk above and run the rest of the markdown file as normal. Be sure that you account for this in your interpretation.

2.3 Global Model

We’ll start by fitting a global linear model to predict LWB rates by census tract using our environmental exposure variable, avgReleasesLb\_10yr and 16 variables from the SVI.

2.3.1 Fit Full Linear Model (STEP 6)

In this code chunk we run an ordinary least squares linear regression, with LWB rates as the dependent variable, TRI releases as the exposure, and SVI variables as confounders. We first pass global_dat through the linear model function, lm(), and define the output of the model as lm_full. Then we return a summary of lm_full.

# Fit global model
lm_full <- lm(LWB_ADJRATE ~ avgReleasesLb_10yr + RPL_THEME1 + RPL_THEME2 + RPL_THEME3 + RPL_THEME4
              , data = global_dat)

summary(lm_full)

Call:
lm(formula = LWB_ADJRATE ~ avgReleasesLb_10yr + RPL_THEME1 + 
    RPL_THEME2 + RPL_THEME3 + RPL_THEME4, data = global_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6963 -1.4698 -0.1622  1.2064 18.2728 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.195e+00  1.815e-01  34.127  < 2e-16 ***
avgReleasesLb_10yr  5.240e-06  1.218e-06   4.303 1.82e-05 ***
RPL_THEME1          3.181e+00  4.672e-01   6.810 1.54e-11 ***
RPL_THEME2         -5.813e-01  3.310e-01  -1.756   0.0793 .  
RPL_THEME3          1.452e-01  3.877e-01   0.374   0.7082    
RPL_THEME4          1.668e-01  2.990e-01   0.558   0.5771    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.491 on 1195 degrees of freedom
  (69 observations deleted due to missingness)
Multiple R-squared:  0.1127,    Adjusted R-squared:  0.109 
F-statistic: 30.37 on 5 and 1195 DF,  p-value: < 2.2e-16

2.3.2 Global Model Interpretation (STEP 7)

2.3.2.1 Model Diagnostics

Next, we will conduct a few diagnostics of the global model. There are many more possible diagnostics that aren’t presented here. For example, examining a) the leverage of individual observations and b) the individual contribution of variables with added-variable plots.

2.3.2.1.1 Normality of Residuals

First, we plot the distribution of model residuals against a normal probabilistic distribution (a quantile-quantile, or QQ, plot) to see if the residuals demonstrate any unexpected patterns, and to identify any data points that exert increased influence on overall model fit.

In this chunk, we pass the residuals from lm_full into the car library function, qqPlot.

# QQ Plot 
qq_lm_full <- qqPlot(residuals(lm_full), main = "QQ Plot", xlab = "Normal Quantiles", ylab = "Residuals")

print(qq_lm_full)
 641 1216 
 616 1169 

The QQ plot does suggest that the model is somewhat poorly specified because a large proportion of residuals vs. normal values fall outside the standard error range at both the lower and upper bounds, suggesting some heteroskedasticity or kurtosis of model residuals.

If we were to proceed only with a linear model, we might consider taking measures including transforming data or excluding individual observations to account for this nonlinear structure.

2.3.2.1.2 Spatial Autocorrelation of Residuals

Run the code chunk below to test for spatial autocorrelation in the residuals of our linear model.

We do this by first selecting the residuals from the model output and defining the residuals as new variable lm_resid_col. Next, using row names as a key, we join the column of residuals to dat_full.

The function below performs a Anselin Local Moran’s \(I\), which differs from the global Moran’s \(I\) test for spatial autocorrelation by computing the \(I\) statistic for every location, \(i\), as well as the local variance and \(z\) value, as opposed to a single \(I\) for the whole datset. Like the global approach, the local test tells us the degree of similarity between every observation and its neighbors, however it allows us to identify more localized clusters of (dis)similarity.

Once the data are joined we prepare for the test by constructing a list of neighbors for every observation, nb. Next we compute spatial weights lw for every neighbor to pass into the Moran’s \(I\) test by summing all residuals in each neighborhood divided by the number of observations per neighborhood. Then we pass these parameters, along with dat_full$residuals, to the moran.mc function to test for spatial autocorrelation over 999 permutations. Each permutation randomly rearranges the neighborhood values around each observation to account for the inevitable random clustering of values.

# Spatial autocorrelation of global linear model residuals
lm_resid_col <- lm_full$residuals # Make data table of residuals from lm_red_AIC
dat_full$residuals <- lm_resid_col[as.character(rownames(dat_full@data))] # Append residuals to dat_full by row name

nb <- poly2nb(dat_full, queen = TRUE) # Define neighbors for each polygon
lw <- nb2listw(nb, style = "U", zero.policy = TRUE) # Assign weights to neighbors
moran_lm <- moran.mc(dat_full$residuals
                    , lw
                    , nsim = 999
                    , alternative = "two.sided")
print(rbind("Moran's I statistic: " ,moran_lm$statistic))
     statistic              
[1,] "Moran's I statistic: "
[2,] "0.107918388173201"    
print(rbind("P-value of Moran's I statistic: ", moran_lm$p.value)) 
     [,1]                              
[1,] "P-value of Moran's I statistic: "
[2,] "0"                               

The Moran’s \(I\) statistic of the linear model is 0.108 and the p-value is 0. Because the \(p-value\) is \(<0.05\), we can conclude at a 95% CI that the residuals are significantly spatially autocorrelated, meaning that there are significant spatial relationships in the data that are not reconciled by the current global model.

2.4 Local Models

2.4.1 Calculate Distance Matrix (STEP 8)

A key element of geographically weighted regressions is the spatial distance between observations. This is often measured as a straight line (Euclidean distance), but there are other approaches. For example, geodesic distance considers the curvature of the earth, which is appropriate when observations are very far apart

Run the below code chunk to calculate a distance matrix, DM, between all observations in dat_full. A distance matrix is a matrix in which elements correspond to estimates of a pairwise distance between the sequences in a set. This can be simply though of as a measure of the distances between the locations of all of our observations in our data table.

# Calculate distance matrix
DM <- gw.dist(dp.locat = coordinates(dat_full))

2.4.2 Compute Bandwidths (STEP 9)

Bandwidth, or neighborhood size, is the most influential parameter in GWR because it can significantly change parameter estimates. Simply put, larger bandwidths have smaller parameter weights thus increasing variation, and shorter bandwidths have larger weights to reduce local variation of parameters. Here, we use an automated cross-validation (CV) approach to optimize weights for each model term.

# Define optimal bandwidths for GWR
bw <- bw.gwr(LWB_ADJRATE ~ avgReleasesLb_10yr 
                         + RPL_THEME1
                         + RPL_THEME2
                         + RPL_THEME3
                         + RPL_THEME4
           , data = dat_full
           , approach = "CV"
           , kernel = "gaussian"
           , adaptive = TRUE
           , longlat = TRUE
           , dMat = DM)
Adaptive bandwidth: 749 CV score: 7413.456 
Adaptive bandwidth: 471 CV score: 7386.594 
Adaptive bandwidth: 297 CV score: 7378.946 
Adaptive bandwidth: 192 CV score: 7343.997 
Adaptive bandwidth: 124 CV score: 7320.2 
Adaptive bandwidth: 85 CV score: 7307.044 
Adaptive bandwidth: 58 CV score: 7317.09 
Adaptive bandwidth: 99 CV score: 7319.052 
Adaptive bandwidth: 73 CV score: 7303.862 
Adaptive bandwidth: 69 CV score: 7311.845 
Adaptive bandwidth: 79 CV score: 7302.908 
Adaptive bandwidth: 79 CV score: 7302.908 

2.4.3 Run GWR (STEP 10)

Run to code chunk below to perform a basic GWR. We do this by passing our distance matrix DM and bandwidth bw variables, along with dat_full, to a GWR function gwr.basic() and defining its output as a new variable gwr_results_dat. We then print a summary of model results.

# Basic GWR
gwr_results_dat <- gwr.basic(LWB_ADJRATE ~ avgReleasesLb_10yr 
                                         + RPL_THEME1
                                         + RPL_THEME2
                                         + RPL_THEME3
                                         + RPL_THEME4
                           , data = dat_full
                           , bw = bw
                           , dMat = DM
                           , kernel = "gaussian"
                           , adaptive = TRUE
                           , longlat = TRUE)
gwr_results_dat
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-08-19 16:50:31.451545 
   Call:
   gwr.basic(formula = LWB_ADJRATE ~ avgReleasesLb_10yr + RPL_THEME1 + 
    RPL_THEME2 + RPL_THEME3 + RPL_THEME4, data = dat_full, bw = bw, 
    kernel = "gaussian", adaptive = TRUE, longlat = TRUE, dMat = DM)

   Dependent (y) variable:  LWB_ADJRATE
   Independent variables:  avgReleasesLb_10yr RPL_THEME1 RPL_THEME2 RPL_THEME3 RPL_THEME4
   Number of data points: 1201
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-8.6963 -1.4698 -0.1622  1.2064 18.2728 

   Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
   (Intercept)         6.195e+00  1.815e-01  34.127  < 2e-16 ***
   avgReleasesLb_10yr  5.240e-06  1.218e-06   4.303 1.82e-05 ***
   RPL_THEME1          3.181e+00  4.672e-01   6.810 1.54e-11 ***
   RPL_THEME2         -5.813e-01  3.310e-01  -1.756   0.0793 .  
   RPL_THEME3          1.452e-01  3.877e-01   0.374   0.7082    
   RPL_THEME4          1.668e-01  2.990e-01   0.558   0.5771    

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 2.491 on 1195 degrees of freedom
   Multiple R-squared: 0.1127
   Adjusted R-squared: 0.109 
   F-statistic: 30.37 on 5 and 1195 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 7413.404
   Sigma(hat): 2.486562
   AIC:  5608.273
   AICc:  5608.367
   BIC:  4492.545
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 79 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ****************Summary of GWR coefficient estimates:******************
                             Min.     1st Qu.      Median     3rd Qu.    Max.
   Intercept           3.9654e+00  5.8843e+00  6.2531e+00  7.1628e+00 13.6423
   avgReleasesLb_10yr -1.5155e-03 -2.9393e-04  4.1393e-06  7.3558e-06  0.0004
   RPL_THEME1          9.2644e-01  1.8952e+00  2.2256e+00  2.7404e+00  4.4528
   RPL_THEME2         -2.1869e+00 -1.0989e+00 -7.0184e-01 -2.6987e-01  1.4044
   RPL_THEME3         -2.1105e+00  3.5157e-01  7.9077e-01  1.2518e+00  2.9512
   RPL_THEME4         -4.6958e-01  1.7138e-01  4.1072e-01  6.5197e-01  1.3664
   ************************Diagnostic information*************************
   Number of data points: 1201 
   Effective number of parameters (2trace(S) - trace(S'S)): 57.17895 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 1143.821 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 5581.704 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 5535.158 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 4586.145 
   Residual sum of squares: 6818.376 
   R-square value:  0.1839553 
   Adjusted R-square value:  0.143126 

   ***********************************************************************
   Program stops at: 2024-08-19 16:50:31.711461 

2.4.4 GWR Model Interpretation (STEP 11)

2.4.4.1 Model Diagnostics

Run the chunk below to view basic model diagnostics of gwr_results_dat.

# Model diagnostics
gwr_results_dat$GW.diagnostic
$RSS.gw
[1] 6818.376

$AIC
[1] 5535.158

$AICc
[1] 5581.704

$enp
[1] 57.17895

$edf
[1] 1143.821

$gw.R2
[1] 0.1839553

$gwR2.adj
[1] 0.143126

$BIC
[1] 4586.145
2.4.4.1.1 Normality of Residuals
qq_gwr_results_dat <- qqPlot(gwr_results_dat$SDF$residual, main = "QQ Plot", xlab = "Normal Quantiles", ylab = "Residuals")

print(qq_gwr_results_dat)
[1]  616 1169

Again, we see that the residuals aren’t normally distributed in the GWR model. There are still some lurking relationships.

2.4.4.1.2 Spatial Autocorrelation of Residuals

Just like we did for the linear model, we’ll test the spatial autocorrelation of model residuals. First, we define the neighborhood kernel function nb and weighting scheme lw.

# Spatial autocorrelation of basic GWR residuals
nb <- poly2nb(gwr_results_dat$SDF, queen = TRUE) # Define neighbors for each polygon
lw <- nb2listw(nb, style = "U", zero.policy = TRUE) # Assign weights to neighbors
moran_gwr <- moran.mc(gwr_results_dat$SDF$residual
                    , lw
                    , nsim = 999
                    , alternative = "two.sided")
print(rbind("Moran's I statistic: " ,moran_gwr$statistic))
     statistic              
[1,] "Moran's I statistic: "
[2,] "0.0551949767251689"   
print(rbind("P-value of Moran's I statistic: ", moran_gwr$p.value)) 
     [,1]                              
[1,] "P-value of Moran's I statistic: "
[2,] "0.002"                           

The Moran’s \(I\) statistic of the GWR model is 0.0552 and the p-value is 0.002. Because the \(p-value\) is \(<0.05\), we can conclude at a 95% CI that the residuals are significantly spatially autocorrelated, meaning that there are significant spatial relationships in the data that are not reconciled by the current local model.

2.4.4.2 Plot Distribution of Beta Coefficients

Similar to global linear models, GWR returns \(\beta\) coefficients for each explanatory variable. However, there are beta coefficients for each explanatory variable at each observation as defined by the model fit in the local neighborhood. A helpful way to interpret this is to look at the overall distribution of each \(\beta\).

We first reshape the coefficient data table using the melt() function to reshape the data from wide to long format and define a new variable melt_coefficients. We then use GGplot2 to draw boxplots of the distributions with geom_boxplot() and draw a red dashed line at y=0 with geom_hline to help with our interpretation. The plot is named p_gwr_bp.

# Boxplots of coefficients
# coefficients <- data.frame(mgwr_results_dat_full$SDF[, 2:11])
coefficients <- data.frame(gwr_results_dat$SDF[, 2:6])
melt_coefficients <- reshape2::melt(coefficients)

p_gwr_bp <- ggplot(melt_coefficients, aes(x = fct_reorder(variable, value, .fun = mean), y = value))
p_gwr_bp <- p_gwr_bp + geom_boxplot()
p_gwr_bp <- p_gwr_bp + labs(x = "", y = "Coefficient") + ggtitle("Boxplots of GWR Coefficients")
p_gwr_bp <- p_gwr_bp + theme(axis.text.x = element_text(angle = 40, hjust = 1))
p_gwr_bp <- p_gwr_bp + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

print(p_gwr_bp)

Generally, group means >0 suggest a positive relationship between the explanatory variable and outcome and vice versa. Sometimes \(\beta\) coefficients are both positive and negative – this can make interpretation difficult, requiring deeper interrogation of how local relationships play out. In the plots above, we see that avgReleasesLb_10yr, RPL_THEME2, RPL_THEM3 and RPL_TMEME4 Next, we will map these coefficients to see where this is occurring.

2.4.4.3 Mapping GWR Beta Coefficients

Run the code chunk below to generate maps. The code will generate a map of the census tracts of Colorado colored by the \(\beta\) coefficients of each explanatory variable in gwr_results_dat. We further shade each census tract where the observation is insignificant at a 95% CI based on the local \(T\) value.

We do this by creating a series of sub-plots, using a custom plotting function plot.gwr.coefs. We pass MGWR results for each variable to this function, generating a sub-plot that we define as its own variable p1_gwr - p6_gwr. Next, we pass these sub-plot variables to a plot arrangement function tmap_arrange that will organize the sub-plots into a neat, combined plot. We define this combined plot as p_gwr. Note: depending on your computer, the maps may take a while to generate.

# Map coefficients
p1_gwr <- plot.gwr.coefs(gwr_results_dat$SDF, "Intercept", gwr_results_dat$SDF$Intercept_TV)
p2_gwr <- plot.gwr.coefs(gwr_results_dat$SDF, "avgReleasesLb_10yr", gwr_results_dat$SDF$avgReleasesLb_10yr_TV)
p3_gwr <- plot.gwr.coefs(gwr_results_dat$SDF, "RPL_THEME1", gwr_results_dat$SDF$RPL_THEME1_TV)
p4_gwr <- plot.gwr.coefs(gwr_results_dat$SDF, "RPL_THEME2", gwr_results_dat$SDF$RPL_THEME2_TV)
p5_gwr <- plot.gwr.coefs(gwr_results_dat$SDF, "RPL_THEME3", gwr_results_dat$SDF$RPL_THEME3_TV)
p6_gwr <- plot.gwr.coefs(gwr_results_dat$SDF, "RPL_THEME4", gwr_results_dat$SDF$RPL_THEME4_TV)


p_gwr <- tmap_arrange(p1_gwr
                    , p2_gwr
                    , p3_gwr
                    , p4_gwr
                    , p5_gwr
                    , p6_gwr
                    , ncol = 2)
print(p_gwr)

In the maps above, we see that avgReleasesLb_10yr is significant (95% CI) and positive (though only slightly) across most of the state, but that it is consistently insignificant throughout the central portion of Colorado where most of the population is concentrated and where census tracts tend to be considerably smaller. What other explanatory variables might help improve these results?

2.4.4.4 Map Local R2

Another important output from GWR models is the local \(R^2\) – the \(R^2\) value at every observation.

# Plot local R2
p_gwr <- tm_shape(gwr_results_dat$SDF) +
  tm_polygons("Local_R2", palette = "RdYlBu", title = "Local R2") +
  tm_layout(legend.position = c("right", "bottom"), 
            legend.outside = TRUE 
            )

print(p_gwr)

In the map above, we see that the predictive power of the GWR model is spatially heterogeneous, with the strongest predictive power in census tracts throughout the south-central portion of Colorado. Further, we see that the model is poorly specified in most census tracts along the eastern portion of the state. Beyond including spatial relationships as model dependencies, GWR can also help us identify specific locations or regions where our models fail to predict outcomes, possibly narrowing our search for other potential covariates.

3 Conclusion

Congratulations! You have made it to the last step of the workshop. So far, you have:

  1. Formatted data
  2. Ran global and local statistical tests
  3. Analyzed model diagnostics, and
  4. Plotted simple maps of local model results.

The main takeaway from this workshop is that there are sophisticated techniques for exploring the inherently spatial nature of environmental exposures and health outcomes. With some minimal data formatting, you should be able to modify the code contained here to explore these types of relationships in your own research.

When you run the last code chunk below you will see that the overall predictive power of the local regression (GWR) is better than the global linear model – this is apparent when comparing the \(R^2\) and adjusted \(R^2\) values. Also, the GWR model strength is better, as is shown by lower \(AIC\) and \(BIC\) values. However, we haven’t accounted for spatial autocorrelation, as you can see in the row for Moran’s \(I\) and the associated \(p\)-value.

This is a crucial takeaway: spatial regression models don’t automatically resolve lurking spatial relationships, but they do account for some portion of them. Aside from identifying more variables to explain the spatial nature of exposure-outcome models, recall that there are other approaches like spatial autoregressive (SAR) models, which introduce spatial lag or spatial error terms to the regression. It’s worth noting that SAR models are inherently global because they cannot spatially locate outliers.

# Data frame for summary table
table_data <- data.frame(
  Metric = c("R2", "Adj. R2", "AIC", "BIC", "Moran's I (p-value)"),
  LM = c(summary(lm_full)$r.squared %>% signif(3)
       , summary(lm_full)$adj.r.squared %>% signif(3)
       , AIC(lm_full) %>% signif(7)
       , BIC(lm_full) %>% signif(7)
       , paste0(moran_lm$statistic %>% signif(3), " (", moran_lm$p.value %>% signif(3), ")")),
  GWR = c(gwr_results_dat$GW.diagnostic$gw.R2 %>% signif(3)
       , gwr_results_dat$GW.diagnostic$gwR2.adj %>% signif(3)
       , gwr_results_dat$GW.diagnostic$AIC %>% signif(7)
       , gwr_results_dat$GW.diagnostic$BIC %>% signif(7)
       , paste0(moran_gwr$statistic %>% signif(3), " (", moran_gwr$p.value %>% signif(3), ")"))
)

# Print the table
knitr::kable(table_data, format = "markdown")
Metric LM GWR
R2 0.113 0.184
Adj. R2 0.109 0.143
AIC 5608.273 5535.158
BIC 5643.909 4586.145
Moran’s I (p-value) 0.108 (0) 0.0552 (0.002)

If you have time, consider running the remaining code to conduct a multiscale GWR. As long as everything before this point has been run, the MGWR will work with all of the data and objects in the current R environment.

3.1 Optional Step: MGWR

3.1.1 Run MGWR (OPTIONAL STEP 1)

Run the below code chunk to perform a multiscale geographically weighted regression (MGWR). The model is set up the same was as the GWR, using the gwr.multiscale() function. The major difference here is that we do not pre-define a distance matrix and bandwidth. The MGWR does this automatically, optimizing distance and bandwidth based on scalar relationships in the data. To do this, MGWR runs though many iterations, eventually discovering an optimized model based on a criterion. This is very computationally intensive and can take several hours to run depending on the data and number of iterations you specify. For the purposes of this workshop, we will only do 1 iteration. Running more iterations allows the model to converge on a pre-specified criterion - here we use the changing values of the residual sum of squares (CVR) backfitting procedure to reach convergence.

To see what an optimized model looks like, refer to the HTML document in which the model was run with a maximum of 100 iterations. If you happen to have a powerful laptop, try running this step again after you complete the workshop with more iterations to see how the model improves. This will take a at least a couple of hours with the current dataset.

Note: even when using 1 iteration, the model can take several minutes to run. Start the code chunk below and sit back. If it takes to long, do not fret! Feel free to follow along with us for the final steps and reference to pdf

# Multiscale geographically weighted regression
# Documentation: https://search.r-project.org/CRAN/refmans/GWmodel/html/gwr.multiscale.html
mgwr_results_dat_full <- gwr.multiscale(LWB_ADJRATE ~ avgReleasesLb_10yr 
                                               + RPL_THEME1
                                               + RPL_THEME2
                                               + RPL_THEME3
                                               + RPL_THEME4
                                      , data = dat_full
                                      , max.iterations = 1 # Set max iterations higher to optimize model - it's at 2 now to minimize computational burden
                                      , kernel = "gaussian" # wgt = exp(-.5*(vdist/bw)^2);
                                      # , kernel = "exponential" # wgt = exp(-vdist/bw);
                                      # , kernel = "bisquare" # wgt = (1-(vdist/bw)^2)^2 if vdist < bw, wgt=0 otherwise;
                                      # , kernel = "tricube" # wgt = (1-(vdist/bw)^3)^3 if vdist < bw, wgt=0 otherwise;
                                      # , kernel = "boxcar" # wgt=1 if dist < bw, wgt=0 otherwise
                                      , adaptive = TRUE
                                      , criterion = "CVR"
)
------ Calculate the initial bandwidths for each independent variable ------
Now select an optimum bandwidth for the model:  LWB_ADJRATE~1 
[1] "bws0[i]<-bw.gwr(LWB_ADJRATE~1,data=regression.points,kernel=kernel,approach=approach,adaptive=adaptive,dMat=dMats[[var.dMat.indx[i]]], parallel.method=parallel.method,parallel.arg=parallel.arg)"
Adaptive bandwidth (number of nearest neighbours): 749 AICc value: 5725.029 
Adaptive bandwidth (number of nearest neighbours): 471 AICc value: 5719.045 
Adaptive bandwidth (number of nearest neighbours): 297 AICc value: 5712.573 
Adaptive bandwidth (number of nearest neighbours): 192 AICc value: 5697.876 
Adaptive bandwidth (number of nearest neighbours): 124 AICc value: 5681.93 
Adaptive bandwidth (number of nearest neighbours): 85 AICc value: 5673.174 
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: 5660.343 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5645.808 
Adaptive bandwidth (number of nearest neighbours): 32 AICc value: 5640.273 
Adaptive bandwidth (number of nearest neighbours): 28 AICc value: 5638.047 
Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 5634.44 
Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 5634.44 
Now select an optimum bandwidth for the model:  LWB_ADJRATE~avgReleasesLb_10yr 
[1] "bws0[i]<-bw.gwr(LWB_ADJRATE~avgReleasesLb_10yr,data=regression.points,kernel=kernel,approach=approach,adaptive=adaptive,dMat=dMats[[var.dMat.indx[i]]], parallel.method=parallel.method,parallel.arg=parallel.arg)"
Adaptive bandwidth (number of nearest neighbours): 749 AICc value: 5714.007 
Adaptive bandwidth (number of nearest neighbours): 471 AICc value: 5708.703 
Adaptive bandwidth (number of nearest neighbours): 297 AICc value: 5703.638 
Adaptive bandwidth (number of nearest neighbours): 192 AICc value: 5692.62 
Adaptive bandwidth (number of nearest neighbours): 124 AICc value: 5680.436 
Adaptive bandwidth (number of nearest neighbours): 85 AICc value: 5671.101 
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: 5657.442 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5646.502 
Adaptive bandwidth (number of nearest neighbours): 32 AICc value: 5641.187 
Adaptive bandwidth (number of nearest neighbours): 28 AICc value: 5639.247 
Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 5635.307 
Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 5635.307 
Now select an optimum bandwidth for the model:  LWB_ADJRATE~RPL_THEME1 
[1] "bws0[i]<-bw.gwr(LWB_ADJRATE~RPL_THEME1,data=regression.points,kernel=kernel,approach=approach,adaptive=adaptive,dMat=dMats[[var.dMat.indx[i]]], parallel.method=parallel.method,parallel.arg=parallel.arg)"
Adaptive bandwidth (number of nearest neighbours): 749 AICc value: 5606.055 
Adaptive bandwidth (number of nearest neighbours): 471 AICc value: 5602.52 
Adaptive bandwidth (number of nearest neighbours): 297 AICc value: 5599.378 
Adaptive bandwidth (number of nearest neighbours): 192 AICc value: 5592.723 
Adaptive bandwidth (number of nearest neighbours): 124 AICc value: 5588.039 
Adaptive bandwidth (number of nearest neighbours): 85 AICc value: 5583.093 
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: 5570.616 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5563.951 
Adaptive bandwidth (number of nearest neighbours): 32 AICc value: 5565.543 
Adaptive bandwidth (number of nearest neighbours): 48 AICc value: 5563.519 
Adaptive bandwidth (number of nearest neighbours): 54 AICc value: 5568.482 
Adaptive bandwidth (number of nearest neighbours): 48 AICc value: 5563.519 
Now select an optimum bandwidth for the model:  LWB_ADJRATE~RPL_THEME2 
[1] "bws0[i]<-bw.gwr(LWB_ADJRATE~RPL_THEME2,data=regression.points,kernel=kernel,approach=approach,adaptive=adaptive,dMat=dMats[[var.dMat.indx[i]]], parallel.method=parallel.method,parallel.arg=parallel.arg)"
Adaptive bandwidth (number of nearest neighbours): 749 AICc value: 5694.799 
Adaptive bandwidth (number of nearest neighbours): 471 AICc value: 5687.316 
Adaptive bandwidth (number of nearest neighbours): 297 AICc value: 5680.086 
Adaptive bandwidth (number of nearest neighbours): 192 AICc value: 5669.326 
Adaptive bandwidth (number of nearest neighbours): 124 AICc value: 5660.7 
Adaptive bandwidth (number of nearest neighbours): 85 AICc value: 5652.042 
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: 5643.764 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5642.186 
Adaptive bandwidth (number of nearest neighbours): 32 AICc value: 5643.137 
Adaptive bandwidth (number of nearest neighbours): 48 AICc value: 5642.4 
Adaptive bandwidth (number of nearest neighbours): 38 AICc value: 5642.996 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5642.186 
Now select an optimum bandwidth for the model:  LWB_ADJRATE~RPL_THEME3 
[1] "bws0[i]<-bw.gwr(LWB_ADJRATE~RPL_THEME3,data=regression.points,kernel=kernel,approach=approach,adaptive=adaptive,dMat=dMats[[var.dMat.indx[i]]], parallel.method=parallel.method,parallel.arg=parallel.arg)"
Adaptive bandwidth (number of nearest neighbours): 749 AICc value: 5673.152 
Adaptive bandwidth (number of nearest neighbours): 471 AICc value: 5665.005 
Adaptive bandwidth (number of nearest neighbours): 297 AICc value: 5661.156 
Adaptive bandwidth (number of nearest neighbours): 192 AICc value: 5646.028 
Adaptive bandwidth (number of nearest neighbours): 124 AICc value: 5631.212 
Adaptive bandwidth (number of nearest neighbours): 85 AICc value: 5627.872 
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: 5619.362 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5614.68 
Adaptive bandwidth (number of nearest neighbours): 32 AICc value: 5612.499 
Adaptive bandwidth (number of nearest neighbours): 28 AICc value: 5612.058 
Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 5612.188 
Adaptive bandwidth (number of nearest neighbours): 28 AICc value: 5612.058 
Now select an optimum bandwidth for the model:  LWB_ADJRATE~RPL_THEME4 
[1] "bws0[i]<-bw.gwr(LWB_ADJRATE~RPL_THEME4,data=regression.points,kernel=kernel,approach=approach,adaptive=adaptive,dMat=dMats[[var.dMat.indx[i]]], parallel.method=parallel.method,parallel.arg=parallel.arg)"
Adaptive bandwidth (number of nearest neighbours): 749 AICc value: 5672.334 
Adaptive bandwidth (number of nearest neighbours): 471 AICc value: 5667.349 
Adaptive bandwidth (number of nearest neighbours): 297 AICc value: 5662.794 
Adaptive bandwidth (number of nearest neighbours): 192 AICc value: 5649.303 
Adaptive bandwidth (number of nearest neighbours): 124 AICc value: 5637.043 
Adaptive bandwidth (number of nearest neighbours): 85 AICc value: 5632.113 
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: 5624.241 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5619.521 
Adaptive bandwidth (number of nearest neighbours): 32 AICc value: 5619.662 
Adaptive bandwidth (number of nearest neighbours): 48 AICc value: 5620.12 
Adaptive bandwidth (number of nearest neighbours): 38 AICc value: 5620.974 
Adaptive bandwidth (number of nearest neighbours): 44 AICc value: 5619.521 
------            The end for the initial selections              ------
Calculate the initial beta0 from the above bandwidths.
End of calculating the inital beta0.
Find the optimum bandwidths for each independent variable.
    Iteration 1
Now select an optimum bandwidth for the variable: Intercept
The newly selected bandwidth for variable: Intercept is 8
The bandwidth used in the last iteration is: 22 and the difference in bandwidths is: 14
The bandwidth for variable Intercept will be continually selected in the next iteration.
Now select an optimum bandwidth for the variable: avgReleasesLb_10yr
The newly selected bandwidth for variable: avgReleasesLb_10yr is 13
The bandwidth used in the last iteration is: 22 and the difference in bandwidths is: 9
The bandwidth for variable avgReleasesLb_10yr will be continually selected in the next iteration.
Now select an optimum bandwidth for the variable: RPL_THEME1
The newly selected bandwidth for variable: RPL_THEME1 is 136
The bandwidth used in the last iteration is: 48 and the difference in bandwidths is: 88
The bandwidth for variable RPL_THEME1 will be continually selected in the next iteration.
Now select an optimum bandwidth for the variable: RPL_THEME2
The newly selected bandwidth for variable: RPL_THEME2 is 198
The bandwidth used in the last iteration is: 44 and the difference in bandwidths is: 154
The bandwidth for variable RPL_THEME2 will be continually selected in the next iteration.
Now select an optimum bandwidth for the variable: RPL_THEME3
The newly selected bandwidth for variable: RPL_THEME3 is 137
The bandwidth used in the last iteration is: 28 and the difference in bandwidths is: 109
The bandwidth for variable RPL_THEME3 will be continually selected in the next iteration.
Now select an optimum bandwidth for the variable: RPL_THEME4
The newly selected bandwidth for variable: RPL_THEME4 is 914
The bandwidth used in the last iteration is: 44 and the difference in bandwidths is: 870
The bandwidth for variable RPL_THEME4 will be continually selected in the next iteration.
Iteration 1 change of RSS (CVR) is 1044.67.
# Model outputs
print(mgwr_results_dat_full)
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-08-19 16:51:16.73084 
   Call:
   gwr.multiscale(formula = LWB_ADJRATE ~ avgReleasesLb_10yr + RPL_THEME1 + 
    RPL_THEME2 + RPL_THEME3 + RPL_THEME4, data = dat_full, kernel = "gaussian", 
    adaptive = TRUE, criterion = "CVR", max.iterations = 1)

   Dependent (y) variable:  LWB_ADJRATE
   Independent variables:  avgReleasesLb_10yr RPL_THEME1 RPL_THEME2 RPL_THEME3 RPL_THEME4
   Number of data points: 1201
   ***********************************************************************
   *                       Multiscale (PSDM) GWR                          *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidths for each coefficient(number of nearest neighbours): 
              (Intercept) avgReleasesLb_10yr RPL_THEME1 RPL_THEME2 RPL_THEME3
   Bandwidth            8                 13        136        198        137
              RPL_THEME4
   Bandwidth         914

   ****************Summary of GWR coefficient estimates:******************
                             Min.     1st Qu.      Median     3rd Qu.   Max.
   Intercept           3.3097e+00  5.5051e+00  6.0570e+00  6.5932e+00 8.7950
   avgReleasesLb_10yr -3.2364e-04 -4.1579e-05  2.1347e-06  1.1319e-05 0.0002
   RPL_THEME1          9.5182e-01  1.7539e+00  2.1220e+00  2.4468e+00 3.2744
   RPL_THEME2         -9.8605e-01 -7.7168e-01 -4.8491e-01 -2.7664e-01 0.2710
   RPL_THEME3          7.3008e-02  7.5191e-01  9.3871e-01  1.3215e+00 2.2504
   RPL_THEME4          2.6014e-01  2.7397e-01  2.9280e-01  3.6921e-01 0.4155
   ************************Diagnostic information*************************
   Number of data points: 1201 
   Effective number of parameters (2trace(S) - trace(S'S)): -17.42154 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 1218.422 
   AICc value:  5580.073 
   AIC value:  5443.191 
   BIC value:  4920.9 
   Residual sum of squares:  5957.864 
   R-square value:  0.2869441 
   Adjusted R-square value:  0.297148 

   ***********************************************************************
   Program stops at: 2024-08-19 16:51:27.282775 

3.1.2 MGWR Model Interpretation (OPTIONAL STEP 2)

3.1.2.1 Model Diagnostics

# Model diagnostics
print(mgwr_results_dat_full$GW.diagnostic)
$RSS.gw
[1] 5957.864

$AICc
[1] 5580.073

$AIC
[1] 5443.191

$BIC
[1] 4920.9

$R2.val
[1] 0.2869441

$R2adj
[1] 0.297148

$edf
[1] 1218.422

$enp
[1] -17.42154
3.1.2.1.1 Normality of Residuals
qq_gwr_results_dat <- qqPlot(mgwr_results_dat_full$SDF$residual, main = "QQ Plot", xlab = "Normal Quantiles", ylab = "Residuals")

print(qq_gwr_results_dat)
[1]  616 1169

What do you see in the model residuals? What does it mean if they’re not normally distributed?

3.1.2.1.2 Spatial Autocorrelation
# Spatial autocorrelation of MGWR residuals
nb <- poly2nb(mgwr_results_dat_full$SDF, queen = TRUE) # Define neighbors for each polygon
lw <- nb2listw(nb, style = "U", zero.policy = TRUE) # Assign weights to neighbors
moran_mgwr <- moran.mc(mgwr_results_dat_full$SDF$residual
                       , lw
                       , nsim = 999
                       , alternative = "two.sided")
print(rbind("Moran's I statistic: " ,moran_mgwr$statistic))
     statistic              
[1,] "Moran's I statistic: "
[2,] "-0.0272170785540848"  
print(rbind("P-value of Moran's I statistic: ", moran_mgwr$p.value)) 
     [,1]                              
[1,] "P-value of Moran's I statistic: "
[2,] "0.106"                           

The Moran’s \(I\) statistic of the multiscale GWR model is -0.0272 and the p-value is 0.106. Because the \(p-value\) is \(>0.05\), we can conclude at a 95% CI that the residuals are not significantly spatially autocorrelated, meaning that the model is correctly specified in terms of lurking spatial relationships.

3.1.2.2 Plot Distribution of Beta Coefficients

Exactly as with GWR models, MGWR returns \(\beta\) coefficients for each explanatory variable. Run the code chunk below to plot boxplots of MGWR coefficients, p_mgwr_bp.

# Boxplots of coefficients
# coefficients <- data.frame(mgwr_results_dat_full$SDF[, 2:11])
coefficients <- data.frame(mgwr_results_dat_full$SDF[, 2:6])
melt_coefficients <- reshape2::melt(coefficients)

p_mgwr_bp <- ggplot(melt_coefficients, aes(x = fct_reorder(variable, value, .fun = mean), y = value))
p_mgwr_bp <- p_mgwr_bp + geom_boxplot()
p_mgwr_bp <- p_mgwr_bp + labs(x = "", y = "Coefficient") + ggtitle("Boxplots of MGWR Coefficients")
p_mgwr_bp <- p_mgwr_bp + theme(axis.text.x = element_text(angle = 40, hjust = 1))
p_mgwr_bp <- p_mgwr_bp + geom_hline(yintercept = 0, linetype = "dashed", color = "red")

print(p_mgwr_bp)

In the plots above, we see that RPL_THEME1 and RPL_TMEME3 (to an extent) are both positive and negative. Again, we will map these coefficients to see where this is occurring.

\(\textbf{A note on MGWR iterations:}\) In the HTML document you will see that we ran the MGWR with a maximum of 100 iterations in hopes that the model will converge using the CVR backfitting criterion. Even though it doesn’t converge with this adjustment, we do see different results for multiple model diagnostics, holding everything else constant. Most notably, \(\beta\) coefficients for RPL_THEME1 are now solidly positive. While you can make meaningful inference about model strength with one or a few iterations, it’s good practice to let the model approach convergence.

3.2 Mapping Results (OPTIONAL STEP 3)

Run the code chunk below to generate maps of MGWR \(\beta\) coefficients. Note: depending on your computer, the maps may take a while to generate.

# Map coefficients
p1_mgwr <- plot.gwr.coefs(mgwr_results_dat_full$SDF, "Intercept", mgwr_results_dat_full$SDF$Intercept_TV)
p2_mgwr <- plot.gwr.coefs(mgwr_results_dat_full$SDF, "avgReleasesLb_10yr", mgwr_results_dat_full$SDF$avgReleasesLb_10yr_TV)
p3_mgwr <- plot.gwr.coefs(mgwr_results_dat_full$SDF, "RPL_THEME1", mgwr_results_dat_full$SDF$RPL_THEME1_TV)
p4_mgwr <- plot.gwr.coefs(mgwr_results_dat_full$SDF, "RPL_THEME2", mgwr_results_dat_full$SDF$RPL_THEME2_TV)
p5_mgwr <- plot.gwr.coefs(mgwr_results_dat_full$SDF, "RPL_THEME3", mgwr_results_dat_full$SDF$RPL_THEME3_TV)
p6_mgwr <- plot.gwr.coefs(mgwr_results_dat_full$SDF, "RPL_THEME4", mgwr_results_dat_full$SDF$RPL_THEME4_TV)


p_mgwr <- tmap_arrange(p1_mgwr
                     , p2_mgwr
                     , p3_mgwr
                     , p4_mgwr
                     , p5_mgwr
                     , p6_mgwr
                     , ncol = 2)
print(p_mgwr)

In the map above, we see that the main exposure variable, avgReleasesLb_10yr is only significant in about half of the census tracts, and that the linear relationship between it and LWB_ADJRATE is generally quite small, albeit positive. This suggests that while the model is generally well-specified, future work ought to explore more influential exposure variables on LWB.

Finally, let’s compare selected diagnostics across all three models:

# Data frame for summary table
table_data <- data.frame(
  Metric = c("R2", "Adj. R2", "AIC", "BIC", "Moran's I (p-value)"),
  LM = c(summary(lm_full)$r.squared %>% signif(3)
       , summary(lm_full)$adj.r.squared %>% signif(3)
       , AIC(lm_full) %>% signif(7)
       , BIC(lm_full) %>% signif(7)
       , paste0(moran_lm$statistic %>% signif(3), " (", moran_lm$p.value %>% signif(3), ")")),
  GWR = c(gwr_results_dat$GW.diagnostic$gw.R2 %>% signif(3)
       , gwr_results_dat$GW.diagnostic$gwR2.adj %>% signif(3)
       , gwr_results_dat$GW.diagnostic$AIC %>% signif(7)
       , gwr_results_dat$GW.diagnostic$BIC %>% signif(7)
       , paste0(moran_gwr$statistic %>% signif(3), " (", moran_gwr$p.value %>% signif(3), ")")),
  MGWR = c(mgwr_results_dat_full$GW.diagnostic$R2.val %>% signif(3)
       , mgwr_results_dat_full$GW.diagnostic$R2adj %>% signif(3)
       , mgwr_results_dat_full$GW.diagnostic$AIC %>% signif(7)
       , mgwr_results_dat_full$GW.diagnostic$BIC %>% signif(7)
       , paste0(moran_mgwr$statistic %>% signif(3), " (", moran_mgwr$p.value %>% signif(3), ")"))
)

# Print the table
knitr::kable(table_data, format = "markdown")
Metric LM GWR MGWR
R2 0.113 0.184 0.287
Adj. R2 0.109 0.143 0.297
AIC 5608.273 5535.158 5443.191
BIC 5643.909 4586.145 4920.9
Moran’s I (p-value) 0.108 (0) 0.0552 (0.002) -0.0272 (0.106)

This workshop is supported by New Mexico Integrative Science Program Incorporating Research in Environmental Sciences (NM-INSPIRES) Center at the University of New Mexico (1P30ES032755). Additional support comes from the University of New Mexico METALS Superfund Research Program (1P42ES025589), the the UNM Center for Native Environmental Health Equity Research – A Center of Excellence In Environmental Health Disparities Research (P50MD015706), and the UNM Department of Communication and Journalism. The material presented here has not been formally reviewed by the funding agencies. The views expressed herein are solely those of the authors and do not necessarily reflect those of the agencies.

4 Selected References

Brunsdon, C., S. Fotheringham, and M. Charlton. 1998. Geographically Weighted Regression. Journal of the Royal Statistical Society: Series D (The Statistician) 47 (3):431–443.

Bureau, U. C. 2021. Centers of Population. Census.gov. https://www.census.gov/geographies/reference-files/time-series/geo/centers-population.html (last accessed 7 February 2024).

———. 2023. American National Standards Institute (ANSI), Federal Information Processing Series (FIPS), and Other Standardized Geographic Codes. Census.gov. https://www.census.gov/library/reference/code-lists/ansi.html (last accessed 7 February 2024).

———. 2024. TIGER/Line Shapefiles. Census.gov. https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html (last accessed 7 February 2024).

CDPHE Composite Selected Health Outcome Dataset (Census Tract). 2022. https://data-cdphe.opendata.arcgis.com/datasets/cdphe-composite-selected-health-outcome-dataset-census-tract/explore (last accessed 7 February 2024).

Fotheringham, A. S., W. Yang, and W. Kang. 2017. Multiscale Geographically Weighted Regression (MGWR). Annals of the American Association of Geographers 107 (6):1247–1265.

GISGeography. 2017. What is Tobler’s First Law of Geography? GIS Geography. https://gisgeography.com/tobler-first-law-of-geography/ (last accessed 7 February 2024).

Kögel, T. 2017. Simpson’s paradox and the ecological fallacy are not essentially the same: The example of the fertility and female employment puzzle. https://www.researchgate.net/publication/301300241.

Sachdeva, M., and A. S. Fotheringham. 2023. A Geographical Perspective on Simpson’s Paradox. Journal of Spatial Information Science (26):1–25.

US EPA. 2023. TRI Basic Data Files: Calendar Years 1987-Present. https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-data-files-calendar-years-1987-present (last accessed 7 February 2024).