Uncategorized – A.Z. Andis Arietta

Structural Equation Modeling in R

azandis@gmail.com — Wed, 24 Jul 2024 21:52:46 +0000

Introduction

Today we are talking about structural equation models (SEM). There are lots of synonyms, sub-categories, and adjacent techniques that you may have heard before–covariance structure analysis, linear structural relations, path analysis, latent variable modeling, causal modeling, confirmatory factor analysis, exploratory factor analysis, latent growth modeling, mixture/multigroup/heirarchical/multilevel structural modeling, construct analysis, etc., etc…

So many names exist because there are a LOT of things that you can do with these types of models. It turns out that allowing for and testing structure in modeling can help solve lots of problems in research.

I think that the best way to understand SEMs is to start with simple regressions. Let’s consider a regression predicting a rooster crowing from the rising of the sun:

y = b₀ + b₁x + ε

y ~ Rooster Crowing,
x ~ Sun Rising,
b₀ = 0,
ε ~ N(0, 1)

This is a silly regression because we all understand the relationship: the rooster goes cock-a-doodle-do when the sun crests the horizon. But in the language of mathematics, there is no reason I can’t rewrite this equation as:

x = ( y – ε ) b₁^-1

This formulation make no sense. Basically, we are saying that the rising of the sun is a function of the rooster crowing! Even though this is totally mathematically viable, it defies our common sense of causation.

The language of structural equation modeling allows one way to impose some directional structure on mathematical equations. We usually visualize that language as directed graphical models like the one below.

In a graphical model, the observed variables are displayed as boxes Unobserved or latent variables are displayed as circles. Constants are triangles. The functional relationship between variables is displayed as directional arrows. Non-directional or double-headed arrows indicate a variance or covariance.

This graphical model above is the same model as the regression in the equation above. Our independent variable x has a linear relationship with the dependent variable y with the slope parameter b₁. y has a constant intercept b₀ of 0. Finally, the residual variation in y not caused by x is assumed to come from some other unobserved cause with it’s own variance. Rather than thinking of variables as independent or dependent, we used the terms exogenous or endogenous. Exogenous variables are those (like x) that have no incoming paths. These are the most ‘upstream’ variables in the causal paths. Endogenous variables are those that receive causal paths. We’ll see later that some endogenous variables can also be causes of other endogenous variables, but they are still considered endogenous.

In practice, we generally ignore the peripherals and intercepts in structural models, yielding a simplified graph:

Now that we have a shared language. Let’s take a look at a toy example to understand why SEMs can be so useful in research.

Motivating example

You will need a handful of packages for this tutorial:

packages_for_sem_workshop <-
  c(
    'tidyverse', # basic data wrangling
    'tidygraph', # graph visualization
    'ggraph', # graph visualization
    'lavaan', # sem tools
    'piecewiseSEM', # sem tools
    'mgcv', # nonlinear modeling
    'lme4', # random effect modeling
    'cvsem' # cross-validating sems
    )

install_and_load_packages <-
  function(x){
    for( i in x ){
      if( ! require( i , character.only = TRUE ) ){
        install.packages( i , dependencies = TRUE )
        library( i , character.only = TRUE )
      }
    }
  }

install_and_load_packages(packages_for_sem_workshop)

Fitting a simple regression

I find that, for most applied researchers, the language of code is more intuitive than the language of math. So, let’s simulate a toy dataset and see how we can fit the same system as a linear model or an SEM.


simple_ex <-
  data.frame(
    x = runif(n = 100, min = 0, max = 10),
    e = rnorm(n = 100, 0, 1)
  )

simple_ex <- simple_ex %>%
  mutate(
    y = 1 + 2.5*x + e
  )

Now, let’s fit a good old-fashioned linear regression model:

fit_simple_ex_lm <- 
  lm(y ~ x, data = simple_ex)

summary(fit_simple_ex_lm)

> summary(fit_simple_ex_lm)

Call:
lm(formula = y ~ x, data = simple_ex)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61653 -0.49110 -0.01622  0.51680  2.76976 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.08641    0.17916   6.064 2.49e-08 ***
x            2.52441    0.02871  87.913  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8972 on 98 degrees of freedom
Multiple R-squared:  0.9875,	Adjusted R-squared:  0.9874 
F-statistic:  7729 on 1 and 98 DF,  p-value: < 2.2e-16

We can fit the same model using the lavaan package. The syntax for lavaan is to pass the list of models as a simple character string separated by next lines.

simple_ex_sem <-
  '
  y ~ x
  y ~ 1
'

fit_simple_ex_sem <-
  sem(model = simple_ex_sem,
      data = simple_ex)

summary(fit_simple_ex_sem)

> summary(fit_simple_ex_sem)
lavaan 0.6-18 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  y ~                                                 
    x                 2.524    0.028   88.805    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 1.086    0.177    6.125    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 0.789    0.112    7.071    0.000

Lavaan give us an informative output. We see that the parameters are estimated with ‘ML’, maximum likelihood, which differs from ‘lm’ which estimates via ordinary least squares (OLS). This means that estimates might differ slightly. We will trust the default optimizer ‘NLMINB‘.

We see that the model is estimating 3 parameters–the regression slope, the intercept, and the residual variance. Note that we typically do not estimate the intercepts in SEMs (I’m doing so here for continuity). To fit the SEM without the intercept, simply remove the 'y ~ 1' from the model list.

We will pass over the Model Tests and focus on the parameter estimates. The slope estimate of 2.524 and the intercept estimate of 1.086 perfectly match our lm estimates. We don’t get an estimate of the residual variance from lm, but we can extract the residuals from the lm model and calculate the variance ourselves.

fit_simple_ex_lm %>%
  resid() %>%
  var()

> fit_simple_ex_lm %>%
+   resid() %>%
+   var()
[1] 0.7968917

That values is a bit off due to the difference in estimation techniques, I believe.

Fitting complex systems

Now that we’ve seen how linear regression models can be fit as a special case of SEM, let’s take a look at an example that shows where SEM surpasses linear regression.

Again, we will use simulated data, but we’ll download the data so that we can’t see the underlying structure used to generate it. Instead, we’ll use SEMs to determine the structure.

source("https://raw.githubusercontent.com/andisa01/202407_SEM_turorial/main/scripts/SEM_tutorial_example_source.R")

The traditional approach

First, let’s analyze this data using linear regression modeling with stepwise variable selection like almost (the technique use by almost half of all ecological and animal behavioral researcher (including my past self)). In this methodology, we start with a ‘full’ model including all variables and their interactions. Then, we drop non-significant parameters starting with the least parsimonious and refit the model until only significant parameters remain.

# Fit the full or 'global model'
mod_ex01_full <- lm(Y ~ X1 + X2 + X1:X2, data = example01_data_anon)
summary(mod_ex01_full)

> summary(mod_ex01_full)

Call:
lm(formula = Y ~ X1 + X2 + X1:X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5935  -3.7755   0.1861   3.6929  13.7593 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.163110   7.944663  -0.146    0.884
X1          -0.264413   0.387296  -0.683    0.496
X2           0.155742   0.109397   1.424    0.156
X1:X2        0.002022   0.003950   0.512    0.609

Residual standard error: 5.066 on 196 degrees of freedom
Multiple R-squared:  0.144,	Adjusted R-squared:  0.1309 
F-statistic: 10.99 on 3 and 196 DF,  p-value: 1.06e-06

None of our parameters are significant, so we drop the interaction term.

# Drop the interaction
lm(Y ~ X1 + X2, data = example01_data_anon) %>% 
  summary()

> lm(Y ~ X1 + X2, data = example01_data_anon) %>% 
+   summary()

Call:
lm(formula = Y ~ X1 + X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5962  -3.8684   0.1564   3.6213  13.8291 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -4.73378    3.79270  -1.248  0.21346   
X1          -0.08449    0.16216  -0.521  0.60295   
X2           0.19734    0.07308   2.701  0.00753 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.057 on 197 degrees of freedom
Multiple R-squared:  0.1428,	Adjusted R-squared:  0.1341 
F-statistic: 16.41 on 2 and 197 DF,  p-value: 2.555e-07

Now, x₂ is significant, but x₁ is not. So, we drop x₁.

lm(Y ~ X2, data = example01_data_anon) %>% 
  summary()

> lm(Y ~ X2, data = example01_data_anon) %>% 
+   summary() # Drop X1

Call:
lm(formula = Y ~ X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.4915  -3.8082  -0.0172   3.6272  13.5921 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.28441    2.57317  -1.276    0.203    
X2           0.16227    0.02839   5.716 3.97e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.047 on 198 degrees of freedom
Multiple R-squared:  0.1416,	Adjusted R-squared:  0.1373 
F-statistic: 32.67 on 1 and 198 DF,  p-value: 3.966e-08

And, now we have a minimum adequate model to explain the system! We can explain 14% of the variation of y with x₂. For every 1 unit increase in x₂, we expect y to increase by 0.16. And look at that p-value! Let’s go publish a paper!!!

In fact, this method of stepwise variable selection is so common that you can do it all in one command with dredge() from the MuMIn package.

MuMIn::dredge(mod_ex01_full)

> MuMIn::dredge(mod_ex01_full)
Fixed term is "(Intercept)"
Global model call: lm(formula = Y ~ X1 + X2 + X1:X2, data = example01_data_anon)
---
Model selection table 
   (Int)       X1     X2    X1:X2 df   logLik   AICc delta weight
3 -3.284          0.1623           3 -606.551 1219.2  0.00  0.626
4 -4.734 -0.08449 0.1973           4 -606.413 1221.0  1.81  0.254
8 -1.163 -0.26440 0.1557 0.002022  5 -606.279 1222.9  3.64  0.101
2  4.870  0.31890                  3 -610.048 1226.2  6.99  0.019
1 11.280                           2 -621.824 1247.7 28.49  0.000
Models ranked by AICc(x)

The complexity of complex systems

The problem with the linear regression approach is that there is an inherent assumption that the independent variables are, well… independent. But, in natural systems, there are almost always relationships between the independent variables. For instance, here are just a few structures that could underly the relationship between y, x₁, and x₂.

The model structure we derived in our stepwise regression model is the single effect graph in the top left. The model implied by the multiple regression model is a ‘common effect’ structure where the exogenous variables are uncorrelated–in other words x₁ and x₂ have independent effects on y.

But other structures could exist that we cannot easily capture in regression. For instance, For instance, x₁ and x₂ have independent effects on y but remain correlated with each other (common cause with correlation (bottom right)). Or x₁ may have no direct effect on y, but may affect x₂ which affects y in turn (this is called a chain or fully mediated effect (top middle)). Or, x₁ might directly affect y in addition to the indirect effect (partial mediator (top right)).

x₂ may not have any effect on y at all, but may still covary because both are directly affected by x₁ (common cause (bottom left)).

Using SEMs to compare structural hypotheses

Given all the possible structures, how can we ever know which governs our study system? Well, the first pass is to use common sense. If x₁ is height and x₂ is weight, the principle of allometry should have us exclude any model without a relationship between them.

The second pass is to use knowledge of your system from the literature. For example, if x₁ is maternal genotype and x₂ is F1 genotype (of the progeny), there can be no direct effect of the maternal effect on y if the species has an annual lifecycle, but might be partially mediated for perennial species.

Once you have a set of plausible structural hypotheses, we can use the mechanics of SEM to ask which structure best fits the data.

For now, we will assume that all 6 hypotheses above are plausible. We’ll fit each in turn. To do so, I’ll introduce a new operator ~~ which relates the undirected covariation between two variables (i.e. covariance) or with itself (i.e. variance).

# single effect
ex01_formula_x2effect <- '
Y ~ X2
X1 ~~ X1
'

ex01_sem_x2effect <- sem(ex01_formula_x2effect, data = example01_data_anon)

summary(ex01_sem_x2effect)

> summary(ex01_sem_x2effect)
lavaan 0.6-18 ended normally after 8 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                               377.733
  Degrees of freedom                                 2
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Y ~                                                 
    X2                0.162    0.028    5.745    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    X1               32.093    3.209   10.000    0.000
   .Y                25.220    2.522   10.000    0.000

We won’t use the summary output for now, so I will exclude it when fitting the rest of the models.

# chain
ex01_formula_chain <- '
X2 ~ X1
Y ~ X2
'

ex01_sem_chain <- sem(ex01_formula_chain, data = example01_data_anon)

# partially mediated
ex01_formula_mediator <- '
X2 ~ X1
Y ~ X2
Y ~ X1
'

ex01_sem_mediator <- sem(ex01_formula_mediator, data = example01_data_anon)

# common cause
ex01_formula_commoncause <- '
X2 ~ X1
Y ~ X1
'

ex01_sem_commoncause <- sem(ex01_formula_commoncause, data = example01_data_anon)

# common effect (uncorrelated)
ex01_formula_commoneffect <- '
Y ~ X1
Y ~ X2
'

ex01_sem_commoneffect <- sem(ex01_formula_commoneffect, data = example01_data_anon)

# common effect (correlated)
ex01_formula_commoneffect2 <- '
Y ~ X1
Y ~ X2
X1 ~~ X2
'

ex01_sem_commoneffect2 <- sem(ex01_formula_commoneffect2, data = example01_data_anon)

We’ll use the anova() command to get some summary statistics on all of the models.

anova(
  ex01_sem_chain, 
  ex01_sem_commoneffect, 
  ex01_sem_commoncause, 
  ex01_sem_x2effect, 
  ex01_sem_commoneffect2, 
  ex01_sem_mediator
)

Chi-Squared Difference Test

                       Df    AIC    BIC    Chisq Chisq diff RMSEA Df diff Pr(>Chisq)    
ex01_sem_commoneffect   0 1218.8 1228.7   0.0000                                        
ex01_sem_commoncause    0 2425.5 2442.0   0.0000       0.00 0.000       0               
ex01_sem_commoneffect2  0 3688.8 3708.6   0.0000       0.00 0.000       0               
ex01_sem_mediator       0 2425.5 2442.0   0.0000       0.00 0.000       0               
ex01_sem_chain          1 2423.8 2437.0   0.2754       0.28 0.000       1     0.5997    
ex01_sem_x2effect       2 2480.4 2490.3 377.7330     377.46 1.372       1     <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Warning message: lavaan->lavTestLRT():  
   some models have the same degrees of freedom

The models are not mutually nested, so we can’t interpret the chi-squared test. However, we can use the two information criteria (AIC and BIC) since those are agnostic to the particular models in comparison. Both AIC and BIC indicate that the uncorrelated common effect structure is the most likely model.

That’s not the end of the story, though…

Cross-validating SEMs

As an applied data scientist, I’ve become very skeptical of parametric goodness-of-fit metrics. After all, I can fit a neural net that perfectly fits any dataset by memorizing it. What I am really interested is in how well a model can make accurate predictions.

Ideally, we would fit our model on one dataset and then collect separate data to validate its predictive accuracy. This is called cross-validation. Given the limitations of field research, however, a more common approach is to split your single dataset into two groups. One for fitting and one for validation.

The simple form of cross-validation only gives us one chance to measure accuracy (or lack thereof, i.e. error). In order to maximize the use of our data and get better estimates, it is a standard practice to split the dataset into a given number of evenly distributed, randomly sampled splits (i.e. folds). Then we can perform the cross-validation as many times as we have folds. This is called k-fold cross-validation.

We can use the library cvsem to easily conduct a 10-fold cross validation on all of our models.

# Cross validating
models_to_cv <- 
  cvgather(
    ex01_sem_chain, 
    ex01_sem_commoneffect, 
    ex01_sem_commoncause, 
    ex01_sem_x2effect, 
    ex01_sem_commoneffect2, 
    ex01_sem_mediator
    )

cvsem(
  data = example01_data_anon,
  Models = models_to_cv,
  k = 10
)

Cross-Validation Results of 6 models 
based on  k =  10 folds. 

                   Model E(KL-D)   SE
1         ex01_sem_chain    0.32 0.16
3   ex01_sem_commoncause    0.33 0.16
6      ex01_sem_mediator    0.33 0.16
4      ex01_sem_x2effect    5.90 1.32
5 ex01_sem_commoneffect2    7.16 1.23
2  ex01_sem_commoneffect    7.20 1.23

Both of the ‘common cause’ structures exhibit the worst predictive performance. They have the highest errors (by default, the error is calculated via KL-Divergence–the difference between the covariance matrix of the validation set and the covariance matric implied by the fitted model). To be fair, a very quick look at the relationship between x₁ and x₂ could have told us that any model that fails to account for the relationship of these two models is not realistic.

The next highest error comes from the model suggested by our initial stepwise regression method. Thus, based on predictive performance, we can conclusively exclude these hypothesized structures.

The chain, common cause, and mediator structures all have similar predictive accuracy. How can we decide which is the true structure of our system? Unfortunately, there is no statistical method to differentiate with our data. However, we can still use the structural model to help us develop further experiments to select the final model. For example, we could envision conducting an experiment where we held x₂ constant in one set of plots and let it vary in another.

If the system were a common cause structure, holding x₂ would cause no difference in the y values between plots. However, if the system were a fully mediated chain through x₂ , holding x₂ would completely decouple the association between x₁ and y, whereas controlling x₂ in a partially mediated system would only attenuate the covariation between x₁ and y.

In our toy example, the true causal structure will be easy to diagnose when I tell you what the variables really are:

y ~ Shark attacks,
x₁ ~ Weather,
x₂ ~ Ice cream sales

Estimating SEM coefficients

You might be tempted to take the SEM for the common cause structure and use that in your paper. I’d advise against it. Since we used all of our data in estimating the structure of the system, we don’t want to reuse the same data to estimate parameter coefficients. That would be double-dipping on our data. Having our cake and eating it too.

Instead, we need to either collect additional data (ideal) or reserve a split of our original data to use as the final fitting data. Since this is a simulated dataset, we can ‘collect’ all of the new data we want.

set.seed(666)

n <- 200

weather <- rnorm(n, mean = 20, sd = 5)

ice_cream_sales <- 50 + 2 * weather + rnorm(n, mean = 0, sd = 5)

shark_attacks <- 5 + 0.3 * weather + rnorm(n, mean = 0, sd = 5)

example01_newdata <- data.frame(shark_attacks, weather, ice_cream_sales)

# common cause
ex01_new_commoncause <- '
ice_cream_sales ~ weather
shark_attacks ~ weather
# shark_attacks ~ 1
'

ex01_sem_new_commoncause <- sem(ex01_new_commoncause, data = example01_newdata)

summary(ex01_sem_new_commoncause)

> summary(ex01_sem_new_commoncause)
lavaan 0.6-18 ended normally after 10 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                    Estimate  Std.Err  z-value  P(>|z|)
  ice_cream_sales ~                                    
    weather            2.131    0.065   32.853    0.000
  shark_attacks ~                                      
    weather            0.419    0.061    6.852    0.000

Covariances:
                     Estimate  Std.Err  z-value  P(>|z|)
 .ice_cream_sales ~~                                    
   .shark_attacks       3.026    1.644    1.841    0.066

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .ice_cream_sals   24.452    2.445   10.000    0.000
   .shark_attacks    21.727    2.173   10.000    0.000

Our coefficient estimates are close, but not spot on the true parameter values we simulated. The reason is that the simulated errors are pretty high (sd = 5). Rather than fitting the final model once, we can borrow another technique from machine learning and repeatedly fit bootstrap replicates. This will both stabilize our estimates and provide a convenient, non-parametric confidence interval.

Boostrapping SEM coefficients

Content to come!

Extensions to SEM

Content to come!

Laten variables

Content to come!

Composite variables

Content to come!

Non-linear relationships

Content to come!

Hierarchical relationships

Content to come!

Background

This tutorial was originally developed for a graduate workshop at Yale University on 2024 07 25.

Text analysis using AI in R

azandis@gmail.com — Fri, 06 Oct 2023 00:02:01 +0000

Introduction

Analyzing qualitative data is challenging. Such analyses are even more difficult when the topic is controversial and the results will drive important policy decisions. This post explores AI methods for qualitative research, using chatGPT for categorization, embeddings to find hidden topics, and long-context summarization with Claude2 on a case study analyzing free-text public comments to a controversial Environmental Impact decision.

Background

Quite a while ago, I detailed why replacing wolves on Isle Royale National Park was a bad policy decision back by even worse science. Since then, the National Park Service (NPS) decided to commit to wolf replacement anyway, dropping 19 new wolves on the island in 2018 and 2019. The results were expected. The new wolves killed the last original male wolf in 2019, almost certainly ensuring that the new wolf population will be genetically disconnected from the prior population. Of the 20 wolves that NPS attempted to relocate, one died before making it to the island, one voluntarily crossed the ice back to the mainland*, and four others died by the end of 2019. The surviving 14 wolves successfully bred and the population now stands at 31. So, in the end, we have a new, synthetic wolf population that is entirely disjunct from a genetic and ecological perspective. As I predicted in my original post: “in reality, this is not a genetic rescue project, it is a genetic replacement project,” which violates both the scientific and management purpose of the Park.

* This contradicts one of the primary justifications for replacing the wolves. Proponents argued that the lack of ice due to climate change would make natural repopulation impossible.

But neither science nor policy drove NPS’s decision. Management of charismatic mammals, especially in a well-known National Park, is largely a matter of public sentiment. In fact, it is a codified part of the decision process. Federal managers are required to seek public comments as part of the NEPA process.

In general, I am a huge supporter of public voices in important conservation decisions (I’ve even written papers advocating for it). But, sometimes I worry about how advocacy groups can skew the perception of organic public sentiment. That’s what I’d like to analyze in this post.

All of the public comments submitted to NPS on the Isle Royale wolf-moose management plan are public record. You can download and read all 1117 pages of comments.

But 1117 pages is a lot of text to read and digest. In this post, I want to show how you can easily process lots of text using AI (both generative large-language models (LLM), like chatGPT, and LLM embeddings) to make quantitative (or semi-quantitative) analyses.

Basic analyses

Visit my GitHub repo for this project for a fully reproducible analysis.

First, we’ll set up the environment and load in necessary packages.

# Load libraries
library(pdftools) # We will use 'pdftools' to convert the pdf to plain text
library(tidyverse)
library(stringr)
library(RColorBrewer)

# Set up the directory structure:
make_new_dir <- 
     function(DIR_TO_MAKE){
          if(dir.exists(DIR_TO_MAKE) == FALSE){
               dir.create(DIR_TO_MAKE)
          }else{
               print("Directory exists")
          }
     }

make_new_dir("./data/")
make_new_dir("./figs/")

We can download the comments from the NPW website.

download.file(
     url = "https://parkplanning.nps.gov/showFile.cfm?projectID=59316&MIMEType=application%252Fpdf&filename=ISRO%5FMWVPlan%5FAllCorrespondence%5FPEPC%2Epdf&sfid=232552",
     destfile = "./data/ISRO_MWVPlan_AllCorrespondence_PEPC.pdf",
mode = "wb"
)

The first step to analyze the public comments is to parse the pdf into text. This is a tedious process. I won’t show it here, but you can follow all of the steps on my GitHub repo for this project.

Example comment from the formatted PDF document.

You can download my pre-processed dataset to short-cut the the PDF parsing steps.

download.file(
     url = "https://www.azandisresearch.com/wp-content/uploads/2023/09/EIS_comments.csv",
     destfile = "./data/EIS_comments2.csv"
)

EIS_comments <- read.csv("./data/EIS_comments.csv")

The formatting follow the same structure for every comment. I’ve extracted the ‘Comment ID’, ‘Received’ date time, ‘Correspondence Type’, and ‘Correspondence’ text into a dataframe. I’ve also truncated the longest comments (…comment 68 looks like someone copy and pasted their term paper) to 12,000. This will be important later because the context window for chatGPT is 4000 tokens.

EIS_comments %>% glimpse()
Rows: 2,776
Columns: 4
$ ID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,…
$ Received        2015-07-12 20:45:30, 2015-07-14 23:18:34, 2015-07-15 12:03:55, 2015-07-15 13:14:52, 2015-07-15 13:35:47, …
$ Correspondence  "Web Form Correspondence", "Web Form Correspondence", "Web Form Correspondence", "Web Form Correspondence"…
$ Content         "The alternatives are complete enough as a starting point. The issues will be related to the details. The …

We can do some basic summary analysis on these initial variables. The most comments were submitted in the week before the comment deadline on Sept 1. The vast majority of comments were received through the web form. Less than 10% of comments were physical letters and 51 of the 2777 comments were form cards given to Park visitors.

Often, large influxes of web and email comments are the product of advocacy groups encouraging their members to submit pre-written comments. I’ve used this tactic myself in conservation campaigns, so I won’t cast dispersions. But, I’ll also be the first to admit that a copy-and-pasted form letter is far less sincere than a uniquely crafted opinion.

After checking for matches among the comments, it is clear that there were two archetypical pre-written texts. These include 733 near identical comment in favor of wolf replacement (i.e. Alternative B), likely from National Parks Conservation Association:

EIS_comments %>%
+   filter(grepl("I care about the wildlife at our national parks, including the wolves and moose at Isle Royale. Right now there are only three", Content)) %>%
+   group_by(Content) %>%
+   tally() %>%
+   arrange(desc(n)) %>%
+   ungroup() %>%
+   filter(row_number() == 1) %>%
+   .$Content %>% 
+   cat()
Dear Superintendent Green, I care about the wildlife at our national parks, including the wolves and moose at Isle Royale. Right now there are only three wolves left at the park- -the lowest number of wolves in more than 50 years- -threatening the overall ecosystem health of this iconic national park. I support management Alternative B to bring new wolves to the island, but urge the Park Service to do this as needed, rather than one time only. Without wolves, the moose population on the island will continue to increase, eating until the food sources are gone. If we bring new wolves to the island, they will help keep the moose population from rapidly expanding and minimize impacts to the native vegetation. This option is much less intrusive in this wilderness park than culling moose, removing moose from the island, or having to replant native vegetation once the moose consume it. As stewards of this park, the National Park Service should take the least intrusive action that results in the biggest benefit to the island's wildlife and ecosystem. I support the Park Service taking action to bring new wolves to the park immediately, before the population vanishes altogether. Thank you for considering my concerns. Sincerely,

And 55 nearly identical comments in favor of Wilderness (i.e. Alternative A), likely from Wilderness Watch:

EIS_comments %>%
+   filter(grepl("Isle Royale's wilderness designation requires that we protect the area's unmanipulated, untrammeled wilderness character. Wild", Content)) %>%
+   group_by(Content) %>%
+   tally() %>%
+   arrange(desc(n)) %>%
+   ungroup() %>%
+   filter(row_number() == 1) %>%
+   .$Content %>% 
+   cat()
Isle Royale's wilderness designation requires that we protect the area's unmanipulated, untrammeled wilderness character. Wilderness designation means we let Nature call the shots. Transplanting wolves from the mainland to Isle Royale is a major manipulation of the Isle Royale Wilderness and must not be done. Alternative Concept A, the No Action Alternative, is the best alternative to protect Isle Royale's unmanipulated, untrammeled wilderness character.

It is important to flag these duplicated comments because the methods that we will use later on will not behave correctly with nearly identical strings.

EIS_comments_deduplicated <- 
     EIS_comments %>%
     # Remove comments with no content
     filter(!is.na(Content)) %>%
     # Flag the web form duplicates
     mutate(form_duplicate = ifelse(grepl("I care about the wildlife at our national parks, including the wolves and moose at Isle Royale. Right now there are only three", Content), "for Alt B", NA)) %>%
     mutate(form_duplicate = ifelse(grepl("Isle Royale's wilderness designation requires that we protect the area's unmanipulated, untrammeled wilderness character. Wild", Content), "for Alt A", form_duplicate)) %>%
     # Form duplicates are not exact matches
     mutate(Content_dup = ifelse(is.na(form_duplicate), Content, form_duplicate)) %>%
     group_by(Content_dup) %>%
     # Retain one of the duplicate sets
     slice_sample(n = 1)

After removing the duplicates and cleaning the data, we are left with 1970 unique comments.

Text analysis with chatGPT

Now, we can start analyzing the content. There are many ways that we could do this, depending on the question we want to answer. For instance, maybe we want to see with questions naturally group together to see if we can find common themes? Traditionally, a common way to do this type of natural language processing would be to use an approach like a Latent-Dirchelt allocation topic analysis that groups comments by tf-idf values of the stems of words contained in the comment. (I cover tf-idf in a previous post). But, one problems with this approach is that the context of words is lost.

If we want to capture the context of the text, we might try using word embeddings from a LLM like GPT. We’ll try this approach later.

In our case, maybe we just want to know how many comments support a given policy.. It would be hard to answer that from the embeddings ourselves, but we could treat GPT as an agent who could read and categorize comments by preferred policy alternative.

We’ll use two packages. httr helps us interact with the chatGPT API. The API speaks in json format. jsonlite helps us parse formatted prompts and responses.

library(httr)
library(jsonlite)

Working with chatGPT is a lot like working with a new intern. Like an new intern, it has no prior contextual understanding of our specific task–we have to be very explicit with our directions. On the bright side, our chatGPT intern has endless patience and never sleeps!

We will be interacting with chatGPT through the API. This differs from the dialectical way that most people interact with chatGPT. We need to engineer our prompt to get a robust response in exactly the same format, every time. We can do that by passing in quite a bit of context in our prompt and giving specific directions for the output, with examples. Here is the prompt we’ll use:

You are a federal employee tasked with reading the following comment submitted by a member of the public in response to the The Isle Royale National Park Moose-Wolf-Vegetation Management Plan/EIS. The Plan/EIS is a document that evaluates management alternatives for the moose and wolf populations on the island National Park land.
Management alternatives include:

- Alternative A: No Action. Continue the current management of letting nature take its course, without any intervention or manipulation of the moose or wolf populations or their habitats.
- Alternative B: Immediate Wolf Introduction. Introduce 20-30 wolves over a three-year period, starting as soon as possible to reduce the moose population and its impacts on vegetation.
- Alternative C: Wolf Introduction after Thresholds are Met. Introduce wolves if certain thresholds are met, such as the extirpation of wolves, the overabundance of moose, or the degradation of vegetation. The number and timing of wolf introductions would depend on the conditions at the time.
- Alternative D: Moose Reduction and Wolf Assessment. Reduce the moose population by lethal and non-lethal means, such as hunting, contraception, or relocation. The goal would be to lower the moose density to a level that would allow vegetation recovery and assessing introducing wolves to the island in the future.

Here is the text of the public comment: '[INSERT COMMENT TEXT]'.

State which alternative the commenter is most likely to favor (A, B, C, D).
State if the comment is 'For', 'Against', or 'Neutral' on wolf introductions.
State if the strength of the commenter's opinion on a scale from 'Extremely strong', 'Very strong', 'Strong', 'Somewhat strong', or 'Mild'.

Produce the output in json format like this:
{
"favored_alternative": "",
"wolf_opinion": "",
"opinion_strength": ""
}

ChatGPT 3.5 costs 0.002$ per 1000 tokens. We can use the OpenAI tokenizer to estimate the number of tokens constituting our input prompt.

Our input is 420 tokens. The output should be less than 50 tokens. So we can round to assume 500 tokens per query. So, it will cost us about $1 to process 1000 comments. Much cheaper than paying a human!

In the old days, you could pass a list of inputs into chatGPT ‘completions’ model all at once. This is no longer possible. Now, to use the ‘chat/completions’ API requires looping through each of the inputs and making individual requests. Unfortunately, the API often fails or hits the request rate limit. So, we need to be smart about staging and error handling with this larger loop. The structure of this loop is to define the prompt, wait 18 seconds to avoid the rate limit, run a tryCatch block to test if the API call fails, and if so, it skips to the next record and logs the records that the error occurred on, otherwise, parse the response and store the output in a file.

After getting initial responses, I also want to rerun 500 randomly selected comments in order to check chatGPT’s consistency. This is a critical part of using a generative model in quantitative analysis. I’ll talk more about this later.

Here’s the loop. It will take quite a while depending on your rate limit. I’d suggest either running it overnight or putting in on a remote server. Because we write each response out to file, there’s no problem if it fails. Just note the number of the last successful iteration (which will be printed to the screen) and start back up there.

set.seed(7097)

# Randomly select 500 records to resample
IDs_to_resample <- sample(unique(EIS_comments_deduplicated$ID), 500, replace = FALSE)
ID_list <- c(unique(EIS_comments_deduplicated$ID), IDs_to_resample)

# Create a vector to store failed IDs
failed_ids <- c()

ID_list <- Still_need_IDs

for (i in 1:length(ID_list)) {
  ID_number = ID_list[i]
  # Define the prompt
  prompt_content <- paste0( "Here is the text of the public comment: '", EIS_comments_deduplicated %>%
        filter(ID == ID_number) %>%
        .$Content,
      "'.
    State which alternative the commenter is most likely to favor (A, B, C, D).
State if the comment is 'For', 'Against', or 'Neutral' on wolf introductions.
State if the strength of the commenter's opinon on a scale from 'Extremely strong', 'Very strong', 'Strong', 'Somewhat strong', or 'Mild'.
Produce the output in json format like this:\n{\n\"favored_alternative\": \"\",\n\"wolf_opinion\": \"\",\n\"opinion_strength\": \"\"\n}"
    )
  
  # Initialize gpt_response
  gpt_response <- NULL
  
  # With my account, I can make 3 requests per minute. To avoid denied API calls, I add a 18 second pause in each loop.
  Sys.sleep(18)
  
  tryCatch({
    # Call GPT for a response
    gpt_response <- 
      POST(
        url = "https://api.openai.com/v1/chat/completions", 
        add_headers(Authorization = paste0("Bearer ", read_lines("../credentials/openai.key"))),
        content_type_json(),
        encode = "json",
        body = list(
          model = "gpt-3.5-turbo",
          messages = list(
            list(
              "role" = "system",
              "content" = "You are a federal employee tasked with reading the following comment submitted by a member of the public in response to the The Isle Royale National Park Moose-Wolf-Vegetation Management Plan/EIS. The Plan/EIS is a document that evaluates management alternatives for the moose and wolf populations on the island National Park land.
Management alternatives include:
- Alternative A: No Action. Continue the current management of letting nature take its course, without any intervention or manipulation of the moose or wolf populations or their habitats.
- Alternative B: Immediate Wolf Introduction. Introduce 20-30 wolves over a three-year period, starting as soon as possible to reduce the moose population and its impacts on vegetation.
- Alternative C: Wolf Introduction after Thresholds are Met. Introduce wolves if certain thresholds are met, such as the extirpation of wolves, the overabundance of moose, or the degradation of vegetation. The number and timing of wolf introductions would depend on the conditions at the time.
- Alternative D: Moose Reduction and Wolf Assessment. Reduce the moose population by lethal and non-lethal means, such as hunting, contraception, or relocation. The goal would be to lower the moose density to a level that would allow vegetation recovery and assessing introducing wolves to the island in the future."
            ),
            list(
              "role" = "user",
              "content" = prompt_content
            )
          )
        )
      )
    print(paste0("API call successful for ID: ", ID_number, ", index: ", i))
  }, error = function(e) {
    # Handle API call errors
    cat("API call failed for ID: ", ID_number, ", index: ", i, "\n")
    failed_ids <- c(failed_ids, i)
  })
  
  # If the API call was successful, proceed with data wrangling and output
  if (!is.null(gpt_response)) {
    # parse the response object as JSON
    content <- content(gpt_response, as = "parsed")
    
    # Assign the ID to the GPT response
    gpt_response_df <- data.frame(
      response_id = ID_number,
      gpt_response = content$choices[[1]]$message$content
    )
    
    # Convert the JSON to a dataframe and join to the record data
    output <- bind_cols( EIS_comments_deduplicated %>%
        filter(ID == ID_number),
      fromJSON(gpt_response_df$gpt_response) %>% 
        as.data.frame()
    ) %>%
      mutate(response_created_time = Sys.time())
    
    # Append the data to the extant records and write the output to a file. (This is a bit less memory efficient to do this within the loop, but I )
    if (!file.exists("./EIS_GPT_responses.csv")) {
      write.csv(output, "./EIS_GPT_responses.csv", row.names = FALSE)
    } else {
      read.csv("./EIS_GPT_responses.csv") %>%
        mutate(across(everything(), as.character)) %>%
        bind_rows(output %>%
                    mutate(across(everything(), as.character))
        ) %>%
        write.csv("./EIS_GPT_responses.csv", row.names = FALSE)
    }
    
    print(paste0("Completed response ", i))
  }
}

# Log the failed IDs to a file
if (length(failed_ids) > 0) {
  write.csv(data.frame(ID = failed_ids), "./failed_ids.csv", row.names = FALSE)
  cat("Failed IDs logged to 'failed_ids.csv'\n")
}

ChatGPT is nondeterministic, so your responses will differ. You can download the responses I got to follow along.

download.file(
     url = "https://www.azandisresearch.com/wp-content/uploads/2023/09/Final_GPT_Responses.csv",
     destfile = "./data/GPT_output.csv"
)

GPT_output <- read.csv("./data/GPT_output.csv")

GPT_output %>% glimpse()
Rows: 2,470
Columns: 13
$ ID                     93, 440, 2164, 636, 839, 2335, 36, 487, 1268, 2303, 1781, 60, 1033, 1948, 1826, 1538, 1685, 308, 22…
$ Received               "7/29/2015 9:09", "8/9/2015 5:14", "8/27/2015 14:36", "8/18/2015", "8/25/2015", "8/28/2015 12:30", …
$ Correspondence         "Web Form Correspondence", "Web Form Correspondence", "Web Form Correspondence", "Web Form Correspo…
$ Content                "\"100% o wolves examined since 1994...have spinal anomalies.\"- -Of the six alternatives put forth…
$ form_duplicate         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Content_dup            "\"100% o wolves examined since 1994...have spinal anomalies.\"- -Of the six alternatives put forth…
$ favored_alternative    "C", "C", "Alternative D", "C", "C", "B", "C", "C", "D", "C", "Unknown", "C", "B", "A", "B", "A", "…
$ wolf_opinion           "For", "Against", "Neutral", "For", "Neutral", "For", "For", "For", "Against", "For", "Neutral", "F…
$ opinion_strength       "Very strong", "Very strong", "Strong", "Strong", "Somewhat strong", "Very strong", "Strong", "Stro…
$ response_created_time  "32:19.2", "33:11.7", "33:16.9", "33:19.5", "34:35.2", "34:54.2", "34:55.4", "36:15.1", "36:16.3", …
$ Favored_alternative    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Wolf_opinion           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Opinion_strength       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

A couple of interesting things to note here. First, I apparently was not specific enough in my instructions for classifying the favored alternative because chatGPT sometimes returns “Alternative B” instead of just “B”. This is one of the struggles with using chatGPT, a generative model, in this way. It strays from instructions just like human survey respondents when inputting free-text results. For example, common responses to the survey question, “How are you feeling on a scale from 1 (bad) to 10 (good)?” might be “I’m good” or “Okay” or “nine” or “0”. None of those answers fit the instructions, so we have to clean them up.

In the case of chatGPT, we might be able to reduce these errors with more specific prompt engineering. For now, we’ll just clean up the responses on the backend.

# Fix erroneous column names
GPT_output <-
     GPT_output %>%
     mutate(
          favored_alternative = ifelse(is.na(favored_alternative), Favored_alternative, favored_alternative),
          wolf_opinion = ifelse(is.na(wolf_opinion), Wolf_opinion, wolf_opinion),
          opinion_strength = ifelse(is.na(opinion_strength), Opinion_strength, opinion_strength)
          ) %>%
     select(
          -Wolf_opinion,
          -Favored_alternative,
          -Opinion_strength
     )

# There are probably more elegant ways to write generalized rules to classify these reponses, but this does the trick
GPT_output <-
     GPT_output %>%
     # Fix 'favored alternative' responses
     mutate(
          favored_alternative_edit = case_when(
               (grepl(" and ", favored_alternative) | grepl(" or ", favored_alternative) | grepl("/", favored_alternative) | grepl("&", favored_alternative) | favored_alternative == "B, D") & !grepl(" and Wolf ", favored_alternative) & !grepl("N/A", favored_alternative) ~ "Multiple",
               grepl("\\bAlternative A\\b", favored_alternative) | favored_alternative %in% c("A", "No Action (A)") ~ "A",
               grepl("\\bAlternative B\\b", favored_alternative) | favored_alternative == "B" ~ "B",
               grepl("\\bAlternative C\\b", favored_alternative) | favored_alternative %in% c("C", "Concept C") ~ "C",
               grepl("\\bAlternative D\\b", favored_alternative) | favored_alternative == "D" ~ "D",
               TRUE ~ "Other"
          )
     ) %>%
     # Fix 'opinion strength' responses
     mutate(opinion_strength = tolower(opinion_strength)) %>%
     mutate(
          opinion_strength_edit = case_when(
               opinion_strength %in% c("strong", "very strong", "mild", "somewhat strong", "extremely strong") ~ opinion_strength,
               TRUE ~ "other"
          )
     ) %>%
     # Fix 'wolf opinion' responses
     mutate(wolf_opinion = tolower(wolf_opinion)) %>%
     mutate(
          wolf_opinion_edit = case_when(
          wolf_opinion %in% c("for", "against", "neutral") ~ wolf_opinion,
          TRUE ~ "other"
          )
     )

Let’s take a look at the results.

We can see that the majority of comments favor Alternative B: immediate wolf introduction. However, if we exclude the duplicated comments, our conclusion shifts to a majority in favor of the more moderate Alternative C: introduce wolves only after certain thresholds are met. Almost no one supports Alternative D: moose reduction and wolf assessment.

Comments that favored Alternative A were stronger proportionally. Alternative B supporters had mostly strong opinions but very few extremely strong or mild opinions. Supporters of Alternatives C and D were the least opinionated.

Validating chatGPT responses

It is worth asking ourselves how reliable chatGPT is at classifying these responses. One way to test this is to rerun a subset of comments, like we did above and check for agreement. This is called inter-rater reliability* (IRR).

* Although, maybe it should be called intra-rater reliability in this case. I guess it depends on out definition of ‘individual’ with LLM queries, but that’s a very philosophical bag of worms!

First, we need to subset our dataset to the responses that we scored twice.

IRR_comparisons <- 
     GPT_output %>%
     group_by(ID) %>%
     arrange(response_created_time) %>%
     mutate(ID_row_count = row_number()) %>%
     filter(ID_row_count <= 2) %>%
     mutate(n = n()) %>%
     filter(n > 1) %>%
     ungroup()

Then we can see how reliably the favored alternative was scored,

IRR_comparisons %>%
     select(ID, favored_alternative_edit, ID_row_count) %>%
     pivot_wider(
          id_cols = "ID",
          names_from = "ID_row_count",
          values_from = "favored_alternative_edit",
          names_prefix = "val"
     ) %>%
     group_by(val1 == val2) %>%
     tally() %>%
     mutate(
          total = sum(n),
          prop = n/total
     )
# A tibble: 2 × 4
  `val1 == val2`     n total  prop
              
1 FALSE              2   500 0.004
2 TRUE             498   500 0.996

ChatGPT gave consistent responses in 498 out of 500 cases. That’s pretty good! Let’s look at the comments where it disagreed with itself.

IRR_comparisons %>%
     select(ID, favored_alternative_edit, ID_row_count) %>%
     pivot_wider(id_cols = "ID", names_from = "ID_row_count", values_from = "favored_alternative_edit", names_prefix = "val") %>%
     filter(val1 != val2)
# A tibble: 2 × 3
     ID val1  val2 
1   288 C     B    
2  1160 B     C    
 
EIS_comments_deduplicated %>%
     filter(ID == 288) %>%
     .$Content %>%
     cat()
There should be a balance between the wolf population and moose. When it is not balanced there is more harm than good done to the environment. Please introduce more wolves on this island instead of decreasing their population and this will keep the moose in check. Please add more wolves to contain the moose population. So many wolves are under attack in other states and decreasing their population is NOT the answer. It only creates more problems to the environment. There should be intense management of the wolf population to help it thrive and return the land back to it's natural state where there are enough moose and wolves. I think the public should be consulted as far as future plans for any culling. There should be intense management to monitor the effects of climate change as this will affect all aspects of wildlife and plant life on the island. I do not like the idea of a moose cull. I like the idea of introducing more wolves to the island so long as there is harmony with the existing wolves on the island. Maybe possibly try to introduce another type of animal that would be a good balance with the wolves and moose but only if it does not disrupt the balance and create new problems. Other states have adopted disastrous wolf culling plans that are only in the interests of farmers and ranchers. As the wolf population is dwindling, other problems will begin to develop as there is not a proper balance. Please keep wolves in mind and do your best to increase their population before it is too late and more animals will be needlessly killed without the proper balance of mother nature.> 
 
EIS_comments_deduplicated %>%
     filter(ID == 1160) %>%
     .$Content %>%
     cat()
I have heard both sides of this situation and I believe that new wolves should be introduced on Isle Royale. Climate change has made a large impact on the amount of ice that freezes in the Isle Royale region. Previously wolves from the mainland could cross the ice that formed and take up residence on the Isle. The ice hasn't been stable enough for these crossings in the last few years and the wolves are becoming inbred and dying off. If you will check a video that I have watched about the wolves being reintroduced to Yellowstone, you will see that the ecology of the region is benefited by the wolves being there. If enough wolves are transported to Isle Royale, the wolves will keep the moose in check and the ecology will improve. Allowing the pack to die off is really not a positive move. Introducing a new bloodline to the pack will help. I believe the wilderness designation of Isle Royale is a positive thing and that the wolves help to keep the ecosystem there in good order. Thank you for taking comments from the public.

In both cases, chatGPT vacillated between classifying the comment as favoring alternative B or C. Difference between those alternatives is admittedly nuanced. Both alternatives propose replacing wolves, the only difference is in the timing. In Alternative B, wolves would be introduced immediately and in Alternative C wolve would be introduced, “if certain thresholds are met, such as the extirpation of wolves, the overabundance of moose, or the degradation of vegetation. The number and timing of wolf introductions would depend on the conditions at the time.”

Both of the comments that made chatGPT disagree with itself focus on the environmental conditions that wolf introductions might remedy. However, these comments seems to presuppose that those conditions have been met and seem to suggest immediate introduction is necessary. So, I can see where chatGPT might have a hard time solidly classifying these comments.

Let’s also check the IRR for chatGPT’s classification of ‘opinion strength.’ Unlike the favored alternative, where most folks explicitly stated their preference, classifying the strength of an opinion is a far more subjective task.

IRR_comparisons %>%
     select(ID, opinion_strength_edit, ID_row_count) %>%
     pivot_wider(
          id_cols = "ID",
          names_from = "ID_row_count",
          values_from = "opinion_strength_edit",
          names_prefix = "val") %>%
     group_by(val1 == val2) %>%
     tally() %>%
     mutate(
          total = sum(n),
          prop = n/total
     )
# A tibble: 2 × 4
  `val1 == val2`     n total  prop
              
1 FALSE              5   500  0.01
2 TRUE             495   500  0.99

ChatGPT disagreed with itself in 5 cases, but gave reliable classifications 99% of the time. That’s pretty good! However, just assessing binary disagreement or agreement isn’t a strong metric for this variable. A switch from “extremely strong” to “very strong” is less of an issue than a vacillating from “extremely strong” to “mild”.

Instead, we can use the Krippendorff’s Alpha. This metric provides a formal way to assess the the amount of inter-rater disagreement. There are multiple metrics that we could use, but Krippendorff’s Alpha is nice because it can generalize to any number of reviewers and can handle many types of disagreement (i.e. binary, ordinal, interval, categorical, etc.). Here’s a great post for understanding Krippendorff’s Alpha. We’ll use the irr package to estimate it.

library(irr)

The irr package needs the dataset in wide format matrix with one row per reviewer and each record (the package calls records ‘subjects’ because this metric is traditionally used in social science research) as a column. For this analysis, we’ll consider the first and second responses from chatGPT as individual reviewers. We also need to enforce the order of our opinion strength levels; otherwise, R will naturally order them alphabetically.

IRR_comparisons %>%
     mutate(opinion_strength_edit = fct_relevel(
          opinion_strength_edit,
          c(
               "other",
               "mild",
               "somewhat strong",
               "strong",
               "very strong",
               "extremely strong"
           )
     )) %>%
     select(
          ID,
          opinion_strength_edit,
          ID_row_count
     ) %>%
     pivot_wider(
          id_cols = "ID_row_count",
          names_from = "ID",
          values_from = "opinion_strength_edit",
          names_prefix = "ID_"
     ) %>%
     select(-ID_row_count) %>%
     as.matrix() %>%
     kripp.alpha(method = "ordinal")
  
Krippendorff's alpha

 Subjects = 500 
   Raters = 2 
    alpha = 0.996

Krippendorff’s Alpha ranges from -1 to 1, where 1 means perfect concordance, 0 means random guesses among reviewers, and -1 is perfect negative concordance. At .996, we are pretty near perfect reliability.

For many datasets, there will be a lower degree of IRR. But, it is important to remember to interpret the alpha value in context. Perfect concordance may not be realistic, especially in highly subjective classifications. In most cases our goals is not perfect concordance, but simply greater reliability than we’d get if we hired a bunch of humans to do the annotating. Preliminary evidence seems to indicate that even version 3.5 of chatGPT is more reliable than humans (even domain experts!) in subjective classification tasks.

In most cases, you won’t have the resources to get human annotations for an entire dataset for comparison. Instead, you could 1.) get human annotations for a small subset, 2.) use a similar benchmark dataset, or 3.) spot-check responses yourself. If you choose to spot check, I’d suggest rerunning chatGPT multiple times (> 3) in order to estimate the variance in responses. High variance responses indicate especially difficult classifications that you should target for spot-checks. Another tip is to ask chatGPT to return it’s justification with each response. Ultimately, this process will help you diagnose problematic types of responses and enable you to engineer better prompts to deal with those edge cases.

The bottom line is that working with chatGPT is less like working with a model and more like working with human raters–and all of the validation tasks that entails.

Analysis with token embeddings

Up to this point, we’ve presupposed the classifications we wanted ChatGPT to identify in our data. But, what if we wanted to uncover hidden categories in the responses? Folks could advocate for the same Alternative but for different reasons. For example, among those who favor Alternative C, some might argue from the perspective of climate change and some from the perspective of moose populations.

We can use token embeddings to uncover hidden clusters of topics in our responses. Embeddings are the way that LLMs encode free text into numeric form. Each token or ‘unit of language’ is numerically described as a position in multidimensional language space. This is a huge advantage over more traditional language clustering methods that simply count the occurrence of certain words. Embeddings retain the context of each token as it exists in the document.

Embeddings allow us to retain the context of text by expressing tokens in multidimensional language space.

As a toy example, the word “train” in these sentences: “I train a model”, “I train for a marathon”, “I rode the train”, “I’m on the Soul Train” could be described in two dimensions of more or less metaphorical and noun/verb. If we do this for all of the words in a document or chunk of text, we can then think of all the embeddings as a point cloud. Documents with highly overlapping point clouds are more similar that those that don’t overlap at all.

We call a different OpenAI model, text-embedding-ada-002, to return the embeddings. Unlike the chat model, we can pass all of the responses as a list in a single call, instead of looping through each response. This makes embeddings much faster and cheaper than using the chatGPT API.

Prior to embedding, I like to remove non-alpha numeric characters from the text.

# Clean up the text to remove non-alpha numeric characters
input_to_embed <- 
     EIS_comments_deduplicated %>%
     mutate(Content_cleaned = str_replace_all(Content, "[^[:alnum:]]", " "))

# Call OpenAI for the embeddings
embeddings_return <- 
     POST(
          "https://api.openai.com/v1/embeddings",
          add_headers(Authorization = paste0(
               "Bearer ", read_lines("../credentials/openai.key"))
          ),
          body = list(
               model = "text-embedding-ada-002",
               input = input_to_embed$Content_cleaned
               ),
          encode = "json"
     )

The returned object is a bit convoluted. We can use a bit of purrr and jsonlite to extract the embeddings.

# Extract the embeddings from the API return
embeddings_list <-
     embeddings_return %>%
     content(as = "text", encoding = "UTF-8") %>%
     fromJSON(flatten = TRUE) %>%
     pluck("data", "embedding")

Then add the embeddings back into the dataframe.

# Combine the embeddings with the original data
EIS_GPT_embeddings <- 
     EIS_comments_deduplicated %>%
     as_tibble() %>%
     mutate(
          embeddings = embeddings_list,
          ID = as.character(ID)
     ) %>%
     left_join(
# We need to get only the first instance of the GPT response data, which also included the repeated reliability test responses, to know which alternative the comment favors
          GPT_output %>%
               group_by(ID) %>%
               arrange(response_created_time) %>%
               mutate(ID_row_count = row_number()) %>%
               filter(ID_row_count == 1) %>%
               ungroup() %>%
     select(
          ID,
          favored_alternative_edit,
          opinion_strength_edit
          )
     )

Topical clustering from text embeddings

The problem is that those point clouds exist in extremely high dimensions. OpenAI’s text-embedding-ada-002 model returns 1536 dimensions. We need a method to reduce that complexity into something useful.

As mentioned, the embeddings allow us to see how comments relate in high-dimensional language space. We want to figure out where there are denser clusters of point clouds in that space which indicate common themes in the comments.

A couple of common ways to do this is to use a clustering algorithm (e.g. K-means) or dimension reduction (e.g. PCA). For this tutorial I want to use a bit of a hybrid approach called t-SNE (t-distributed Stochastic Neighbor Embedding) that will allow us to easily visualize the clusters of common comments which we can then explore.

We’ll use Rtsne package which requires that the data be in matrix form.

library(Rtsne)

# Rtsne requires the embeddings to be in matrix form, so we extract the lists of emdeddings from the dataframe and convert them to matrix form.
openai_embeddings_mat <-
     matrix(
          unlist(
               EIS_GPT_embeddings %>%
               .$embeddings
               ),
          ncol = 1536,
          byrow = TRUE
     )

# Estimate tSNE coordinates
set.seed(7267158)
tsne_embeddings <-
     Rtsne(
          openai_embeddings_mat,
          pca = TRUE,
          theta = 0.5,
          perplexity = 50,
          dims = 2,
          max_iter = 10000
     )

Determining the proper theta (i.e. learning rate) and perplexity (basically an estimate of how close points are in relation to the expected groupings) is more of an art than a science. This post does a great job of exploring choices for these parameters. By setting pca = TRUE in this case, we are first reducing the dimensionality to 50 principal components and then using tSNE to do the final reduction to two visual dimensions.

# Extract the tSNE coordinates and add them to the main dataset
EIS_GPT_embeddings <- 
     EIS_GPT_embeddings %>%
     mutate(
          tsne_dim1 = tsne_embeddings$Y[,1],
          tsne_dim2 = tsne_embeddings$Y[,2]
     )

# Visualize the tSNE plot
EIS_GPT_embeddings %>%
     ggplot(aes(x = tsne_dim1, y = tsne_dim2)) +
     geom_point(alpha = 0.5, pch = 16)

The tSNE plot uncovers some weak groupings, but there are no extremely clear delineation between most comments. This is likely a symptom of low diversity in comments and the fact that most of our comments are very short, so there is less signal in the content.

The first thing to note is that we are not seeing much discrete grouping of the points. This tells us that that the comments share a lot more in common across all comments than across local groups of comments. The second thing to notice is that despite the spread, we do see a handful of groups budding off along the periphery. In fact, one group in the bottom right is very distinct. It is important to remember that, unlike PCA, the axis dimensions in tSNE are meaningless. In fact, I’ll remove them from plot for the rest of the post. Position doesn’t matter in tSNE–only relative closeness.

At this point, we might want to manually delimit groups that we want to analyze further, like pulling out all of the comments from that cluster in the top left. To make this a bit easier, I’ve opted to cluster the two dimensional tSNE with hierarchical clustering. It is important to realize that this is purely a convenience for visualization. If we really wanted to use clustering to directly define groups (like hierarchical, KNN, etc.), it would make much more sense to cluster directly on the first 50 principle components.

tsne_embedding_clusters <- 
     hclust(
          dist(tsne_embeddings$Y), 
          method = "average"
     )

EIS_embeddings_clustered <-
     EIS_GPT_embeddings %>%
     mutate(
          cluster = cutree(tsne_embedding_clusters, 7)
)

Since we are clustering on tSNE dimensions where distance doesn’t really matter, deciding where to set our breakpoint is a personal choice. I’ve decided to use 7 clusters because it seemed a natural breakpoint and recovered the obvious clusters.

Using hierarchical clustering, we can cluster on the tSNE coordinates. Since tSNE coordinates are mostly meaningless, deciding how many clusters to split the data into is a bit arbitrary.

Text analysis of topical clusters

Now that we have putative clusters of topics, we can perform some classic natural language processing (NLP) to illuminate the themes of those topics. We’ll use tidytext for this task.

library(tidytext)

First, we need to get the data into a long, tidy format where each word in every comments is its own row. We’ll also remove common stop words that are predefined in the tidytext library. Then, we can calculate the term frequency-inverse document frequency (TF-IDF) for the clusters. TF-IDF is basically a measure of how common a word is within a cluster, after accounting for how common a given words is overall.

For example, if we take a look at the most common words in each cluster, it is unsurprising that “wolves”, “moose”, “isle” and “royale” dominate. (Although it is interesting that the top words for clusters 4 and 7 are “wilderness” and “management”… more on that later).

Unsurprisingly, when considering the most common words, “wolves”, “moose”, and “isle” dominate.

However, TF-IDF tells us about the relatively unique words that define a cluster of comments. Some clusters, like 1 and 2 have very even tf-idf distribution and the important words are mostly filler or nonsense words. This happens when clusters are saturated with common words and there is no strong theme producing uniquely important words. We could have guessed from the tSNE plot of the embeddings that the bulk of comments in the center of the plot would fall in this lexical no-man’s-land. But! Clusters 3, 4, 5, and 7 show promisingly skewed distributions.

TF-IDF is a measure of uniquely important words in a ‘document’ (or cluster, in this case) relative to common words across all documents.

Cluster 3 seems to orient towards a topic of animal welfare, with words like, “contraception”, “sterilization”, “lethal”, and “culls”. I suspect that these comments speak to folks’ concerned less about the wolf population or wilderness management, and more about the ethics of any proposed action involving animals. In a similar way, it looks like Cluster 7 is more concerned with the science and measurement behind the management decision and less about the decision itself with words like, “evaluating”, “approximately”, and “tools” with high uniqueness and “management” as the most common word overall. These topics would have been completely lost if we had stopped at categorizing favored alternatives.

Meanwhile cluster 4 appears to be squarely concerned with Wilderness issues. “Wilderness” and “nature” are the most common words in this cluster and “untrammeled” and “unmanipulated” are the most uniquely important words. We might expect that most of the comments that chatGPT categorizes as favoring alternative A will fall into cluster 4.

We can also take a look at how the clusters map onto the chatGPT categorizations.

Mappin the chatGPT categorized ‘favored alternative’ onto the tSNE coordinates, we can see that topical clusters mostly conform to

Mappin the chatGPT categorized ‘favored alternative’ onto the tSNE coordinates, we can see that comments roughly sort by favored alternative. Cluster 6 is almost entirely defined by support for Alternative B – immediate wolf introduction. Cluster 4, which seemed to orient towards Wilderness values is mostly comprised of comments in support of Alternative A – no action.

Cluster 7 and Cluster 3, are mostly skewed to Alternative C – more monitoring, but exhibit very similar distributions. This might be a great example where even folks who tend to agree on the same Alternative, do so for different reasons–a pattern we would have totally missed without text analysis.

The remaining clusters which compose the bulk of the midland in the tSNE plot favor a mix of Alternatives.

Chain-of-density summarization

We can learn a lot from looking at common and important words and using our human judgement to piece together the topical theme of each cluster. Ideally, we would read all of the comments in a cluster to develop a topical summary. But that would take a long time. As an alternative, we can pass all of the comments in a given cluster to an LLM and have it summarize the theme.

Currently, only a handful of models support context windows large enough to digest the entirety of the comments in our clusters. Anthropic’s Claude2 has a context widow of up to 100k tokens (rough 75,00 words). Although, it isn’t quite as good at chatGPT 4. To get the most out of Claude2, we can use a special type of prompting developed for summarization called “chain-of-density”. Chain-of-density prompting forces the model to recurrently check it’s own output to maximize the density and quality of its summarization. Research shows that people tend to like the chain-of-density summaries even better than human-written summaries of new articles.

For demonstration, we’ll use chain-of-density prompting to summarize the theme of cluster 3. Here is the prompt that we will pass to Claude2:

"You will generate increasingly concise entity-dense summaries of the semicolon separated comments included below.

The comments were submitted by a member of the public in response to the The Isle Royale National Park Moose-Wolf-Vegetation Management Plan/EIS. The Plan/EIS is a document that evaluates management alternatives for the moose and wolf populations on the island National Park land.

Now that you know the context, here are the semicolon separated survey response:

[INSERT SEMICOLON SEPARATED COMMENTS]

Instructions: You will generate increasingly concise entity-dense summaries of the above semicolon separated comments. Repeat the following 2 steps 5 times.

Step 1: Identify 1-3 informative entities (delimited) from the comments which are missing from the previously generated summary.
Step 2: Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities.

A missing entity is
- Relevant: to the main themes of the comments.
- Specific: descriptive yet concise (5 words or fewer).
- Novel: not in the previous summary.
- Faithful: present in the comments.
- Anywhere: located in the comments.

Guidelines:
- The first summary should be long (7 - 5 sentences, ~100 words), yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., 'these comments discuss') to reach ~100 words.
- Make every word count. Rewrite the previous summary to improve flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases like 'these comments discuss'.
- The summaries should become highly dense and concise, yet self-contained, e.g., easily understood without having read the comments.
- Missing entities can appear anywhere in the new summary.
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

Remember: Use the exact same number of words for each summary.

Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are 'missing_entities' and 'denser_summary'."

I don’t have API access to Claude2, but unlike chatGPT, Claude2 accepts files as input. So, it is easy to output our prompt with the contained comments as a .txt file and submit it to Claude2 manually.

Claude2 gives us five increasingly dense summaries:

Summary 1: These comments provide a range of perspectives on how the National Park Service should manage the declining wolf population and increasing moose population on Isle Royale. There is discussion of various preliminary alternative concepts, including taking no action, reintroducing wolves once, maintaining both moose and wolves, reducing the moose population, and intensively managing the moose population. Many commenters express support for alternative C, which involves reintroducing and augmenting the wolf population on Isle Royale more than once if necessary. There is also frequent mention of the possibility of relocating problem wolves from elsewhere to the island. Overall, most commenters emphasize the importance of restoring the natural wolf/moose balance on Isle Royale through wolf reintroduction rather than intensively managing the moose population.

Summary 2: These comments provide perspectives on managing the declining Isle Royale wolf and increasing moose populations. Many support alternative C to reintroduce and augment wolves more than once if needed. Wolf relocation from the mainland is also mentioned. Most emphasize restoring the wolf/moose balance through wolf reintroduction rather than intensively managing moose. Climate change reducing ice bridges for wolf migration and the wilderness designation of Isle Royale are frequently discussed. Many oppose lethal moose culling, instead supporting moose contraception/sterilization. Overall there is a focus on restoring the natural predator-prey dynamic through wolf reintroduction while minimizing intensive human management due to the wilderness designation.

Summary 3: These comments discuss managing the declining Isle Royale wolf and increasing moose populations. Many support alternative C to reintroduce and augment wolves more than once if needed to restore the wolf/moose balance. Wolf relocation and genetics are often mentioned given reduced migration from climate change. The wilderness designation and opposition to lethal moose culling are frequently discussed, with support for contraception/sterilization. There is concern about moose overbrowsing without sufficient wolves. Some mention wolf tourism benefits. Overall there is emphasis on restoring the natural predator-prey dynamic and wolf genetics through reintroduction while minimizing intensive human management due to the wilderness designation.

Summary 4: These comments discuss managing the declining Isle Royale wolf and increasing moose populations, often supporting alternative C to reintroduce and augment wolves more than once to restore balance. Wolf relocation and genetics are frequently mentioned to compensate for reduced migration from climate change. Many oppose lethal moose culling, instead supporting contraception/sterilization, expressing concerns about moose overbrowsing and ticks without wolves. Some mention wolf tourism benefits and preserving the wilderness designation. There is emphasis on restoring balance through wolf reintroduction to limit moose impacts on vegetation while minimizing intensive human management.

Summary 5: These comments frequently support alternative C to reintroduce and augment Isle Royale wolves to restore wolf/moose balance and predator-prey dynamics, concerned about moose overbrowsing, ticks, and vegetation without sufficient wolves. Wolf relocation and genetics are often proposed to compensate for reduced migration from climate change. Many oppose lethal moose culling, noting wilderness values. Some mention continuing the iconic predator-prey research, wolf tourism benefits, wolf-livestock issues, and the value of public engagement. Overall there is emphasis on reestablishing a healthy Isle Royale wolf population through reintroduction to minimize intensive human management and moose impacts.

The final two summaries start to get at the nuance particular to this cluster. Summary 4 mentions, “Many oppose lethal moose culling, instead supporting contraception/sterilization”. Summary 5 mentions: “Many oppose lethal moose culling, noting wilderness values.” As expected, much of the summary is likely common across all comments. So, a human interpretation of the summaries is still necessary.

Conclusion

In this post, we explored using chatGPT, text embeddings, and Claude2 to analyze public comments on a complex environmental management decision. We learned methods to responsibly validate chatGPT output. While not perfect, chatGPT showed promising reliability at categorizing free-form opinions. The text embeddings allowed us to uncover hidden topical clusters among comments that traditional methods would have missed. Claude2’s long context window allowed us to further interpret the topical clusters. Together, these tools enabled a nuanced quantitative analysis of subjective text data that would be infeasible for a single human analyst to perform manually.

Visualize mixed effect regressions in R with GGplot2

azandis@gmail.com — Sat, 31 Dec 2022 18:34:16 +0000

Example of the final viz.

In this post, I will show some methods of displaying mixed effect regression models and associated uncertainty using non-parametric bootstrapping. This is kind of a follow-up to my previous post on visualizing custom main effect models.

Introduction

Mixed models have quickly become the model du jour in many observation-oriented fields because these models obviate many of the issues of pseudoreplication inherent in blocked or repeated measures experiments and structured data.

They do this by treating the levels of categorical variables not as unique instances to be parameterized individually, but as random samples from an infinite distribution of levels. Instead of wasting degrees of freedom estimating parameter values for each level, we only need to estimate that global distribution (which requires only a handful of parameters) and instead focus our statistical power on the variables of interest.

Thanks to packages like nlme and lme4, mixed models are simple to implement. For all of their virtues, mixed models can also be a pain to visualize and interpret. Although linear mixed models are conceptually similar to the plain old ordinary least-squares regression we know and love, they harbor a lot more math under the hood, which can be intimidating.

One of the reasons mixed models are difficult to intuitively visualize is because they allow us to manage many levels of uncertainty. Depending on the focus of our analyses, we usually want to focus on certain aspects of the trends and associated uncertainty. For instance, an ecologist might be interested in the effect of nutrient input across many plots, but not interested in the difference between plots (i.e. traditional random effect). Or, an educator might be interested in the effect of different curricula, but not the difference between specific classes within specific schools (i.e. nested random effects). Or, a physician might be interested in the effect of a long-term treatment on a patient after accounting for baseline difference between patients (i.e. repeated measures).

In this tutorial, I’m going to focus on how to visualize the results of mixed effect models from lme4 using ggplot2. You can also clone the annotated code from my Github.

First, load in the necessary libraries.

library(tidyverse)
library(lme4)
library(ggsci)
library(see)
library(cowplot)
theme_set(theme_classic()) # This sets the default ggplot theme

To begin, I am going to simulate an experiment with 10 experimental units each containing 100 observations. These could be 10 plots with 100 random samples, or 10 schools with 100 student test scores, or the records of 10 patients from each of 100 follow-up visits. Each of the experimental units will ultimately get its own intercept and slope effect coefficient. The rand_eff data frame is essentially a Z matrix in classic mixed model notation. For this example, I’ll assume that the intercepts come from a distribution with standard deviation of 20 and the slopes from a distribution with standard deviation of 0.5. The random effects define the variation of the experimental unit around the main effect, so the mean of these distributions is necessarily 0.

set.seed(666)
rand_eff <- data.frame(unit = as.factor(seq(1:10)),
            b0 = rnorm(10, 0, 20),
            b1 = rnorm(10, 0, 0.5))

We can now join our random effect matrix to the full dataset and define our y values as y_i = B0_i + b0_j + B1x_i + b1x_j + ε.

X <- expand.grid(unit = as.factor(seq(1:10)), obs = as.factor(seq(1:100))) %>%
  left_join(rand_eff,
            by = "unit") %>%
  mutate(x = runif(n = nrow(.), 0, 10),
         B0 = 20,
         B1 = 2,
         E = rnorm(n = nrow(.), 0, 10)) %>%
  mutate(y = B0 + b0 + x * (B1 + b1) + E)

Here’s a look at the data.

X %>%
  ggplot(aes(x = x, y = y, col = unit)) +
  geom_point() +
  facet_wrap(vars(unit))

Scatter plots of the simulated data. Each of 10 experimental units contains 100 observations.

Random intercept model

For demonstration, let’s first assume that we are primarily interested in the overall slope of the relationship. For instance, if these are 10 field plots, we might want to know the effect of adding 1 unit of nutrient fertilizer, regardless of the baseline level of nutrients in a given plot.

We can do this by fitting a random intercept model and then looking at the summary of the resulting model.

lmer1 <- lmer(y ~ x + (1|unit), data = X)

summary(lmer1)

> summary(lmer1)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | unit)
   Data: X

REML criterion at convergence: 7449.7

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.88156 -0.68745  0.01641  0.71022  2.84532 

Random effects:
 Groups   Name        Variance Std.Dev.
 unit     (Intercept) 811.59   28.488  
 Residual              94.63    9.728  
Number of obs: 1000, groups:  unit, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)  18.2652     9.0301   2.023
x             2.0091     0.1077  18.651

Correlation of Fixed Effects:
  (Intr)
x -0.060

We can see that the fitted model does a good job estimating the fixed effect slope (B1), which we simulated with a coefficient of 2 as 2.0091. However, the model is underestimating the fixed effect intercept (B0) as 18.3 and overestimating the standard deviation of the random effect slopes (b1) as 28.5, when we simulated those values as 20 and 20.

If we think we could live with that fit, how would we go about visualizing our model?

Here is a look at our data with a linear regression fit to each experimental unit. It is clear that there is a wide spread in the intercepts, but the slopes are similar.

X %>%
ggplot(aes(x = x, y = y, col = unit)) +
geom_point() +
geom_smooth(method = 'lm', se = F)

Scatter plot of the simulated data with an independent linear regression fit to each experimental unit.

Marginal (Fixed effect) versus Conditional (Fixed + Random effect)

We might be tempted to use this built-in regression by group from ggplot as a visualization of the mixed model. However, this would be WRONG!!! GGplot is fitting an ordinary least squares regression without accounting for the random effect. That means that the estimates and the confidence intervals do not reflect our model. In this case, the estimates might be pretty close since our samples sizes across species are pretty even, but this could be wildly off, or even opposite, of mixed model slope estimate.

In a prior post, I showed how we can use the predict function to display our custom models in ggplot. In the case of mixed effect models, you can predict both the marginal and conditional values. The marginal value is the fixed effect. The conditional value is the mixed effect of the fixed and random effects.

In other words, the marginal effect is asking “What would I expect y to be for a given x without knowing which experimental unit it came from?” whereas the conditional effect is asking “What would I expect y to be for a given x from a given experimental unit?”

We can specify which prediction we want with the random effect formula argument re.form:

X <- X %>% 
  mutate(fit.m = predict(lmer1, re.form = NA),
         fit.c = predict(lmer1, re.form = NULL))

The simplest visualization would be to display the marginal fit on the raw values.

X %>%
  ggplot(aes(x = x, y = y)) +
    geom_point(pch = 16, col = "grey") +
    geom_line(aes(y = fit.m), col = 1, size = 2) +
    coord_cartesian(ylim = c(-40, 100))

Linear fit of the marginal (fixed) effects (black line) shown with a scatterplot of the raw data (grey dots).

However, this is a bit misleading because it underrepresents our confidence in the slope by making it look like the residuals are huge.

But the residuals, and our confidence in the fit, is based on the conditional residual variance, which is much tighter. We can see that easily when we look at the conditional fits. This is one option for visualization, but it highlights the wrong element if our primary interest is the overall slope trend.

X %>%
  ggplot(aes(x = x, y = y, col = unit)) +
  geom_point(pch = 16) +
  geom_line(aes(y = fit.c, col = unit), size = 2) +
  facet_wrap(vars(unit)) +
  coord_cartesian(ylim = c(-40, 100))

Conditional fits from the random effect model with random intercepts on the raw data points, facetted by experimental unit.

Displaying the conditional fits on the same facet helps. Now we can see the variation in the conditional intercepts, but as a tradeoff it makes it difficult to get a sense of the residual variance because there are too many points.

X %>%
  ggplot(aes(x = x, y = y, col = unit)) +
  geom_point(pch = 16) +
  geom_line(aes(y = fit.c, col = unit), size = 2)  +
  coord_cartesian(ylim = c(-40, 100))

Conditional fits from the random effect model with random intercepts on the raw data points. All experimental units are displayed on the same facet, differentiated by color.

Instead, I think it makes more sense to display the conditional residuals around the marginal effect. You can kind of think of this as collapsing all of the conditional fits from the previous plot into the single marginal fit. We can do this by extracting the residuals (which are the conditional residuals) and then displaying the points as the marginal fit plus the residuals.

X <- X %>%
  mutate(resid = resid(lmer1))

X %>%
  ggplot(aes(x = x, y = fit.m + resid, col = unit)) +
  geom_point(pch = 16) +
  geom_line(aes(y = fit.m), col = 1, size = 2)  +
  coord_cartesian(ylim = c(-40, 100))

Marginal fit from the random effect model with random intercepts on the conditional residuals of the experimental units, differentiated by color.

In some cases, we might also want to give the reader a sense of the variation in the conditional intercepts. For instance, the fact that the slope is so consistent across a wide range of baselines might actually increase our confidence in the relationship even further.

There are a couple of ways to simultaneously display both our confidence in the fit of the marginal trend and the variance in the conditional fits.

Depending on the number of conditional units, one option is to display the conditional fits below the scatter plot of the conditional residuals.

X %>%
  ggplot(aes(x = x, y = fit.m + resid)) +
  geom_line(aes(y = fit.c, col = unit), size = 1) +
  geom_point(pch = 16, col = "grey") +
  geom_line(aes(y = fit.m), col = 1, size = 2) +
  coord_cartesian(ylim = c(-40, 100))

Marginal fit (heavy black line) from the random effect model with random intercepts with the conditional residuals (grey dots) and conditional fits (thin lines) for each experimental unit, differentiated by color.

Another option is to display a density plot or histogram of the estimated conditional intercepts (also known as the Best Linear Unbiased Predictors or BLUPs). In a random effect framework, we are assuming that the conditional intercepts are samples of some infinite distribution of intercepts, so this histogram from the BLUPs of our model is essentially an empirical representation of that idealized distribution. (Alternatively, we could also simply plot the idealized distribution as a normal distribution from the estimated variance of the random effect, but I like the empirical density plot because it also give a sense of when our conditionals do NOT conform to the assumption of being samples from a normal distribution).

We can extract the BLUPs from the model object (b0_hat) and add those to the model estimate of the marginal intercept (B0_hat) to get the estimated conditional intercepts. This is our data frame of conditional estimates.

Cond_DF <- as.data.frame(ranef(lmer1)) %>% transmute(unit = grp, b0_hat = condval) %>% mutate(Intercept_cond = b0_hat + summary(lmer1)$coef[1,1])

X %>%
  ggplot(aes(x = x, y = fit.m + resid)) +
  geom_point(pch = 16, col = "grey") +
  geom_violinhalf(data = Cond_DF, aes(x = 0, y = Intercept_cond), trim = FALSE, width = 3, fill = NA) +
  geom_line(aes(y = fit.m), col = 1, size = 2) +
  coord_cartesian(ylim = c(-40, 100))

Marginal fit (heavy black line) from the random effect model with random intercepts with the conditional residuals (grey dots) and histogram of the distribution of conditional intercepts.

Random slope and intercept model

Now let’s imagine that we are not satisfied with the random intercept model and also want to fit a random slope parameter. In this case, we want to estimate the distribution of slopes for all experimental units across the values of x.

lmer2 <- lmer(y ~ x + (x|unit), data = X)

We can see that the values from the model are getting much closer to the known values that we simulated.

summary(lmer2)


> summary(lmer2)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (x | unit)
   Data: X

REML criterion at convergence: 7442.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2071 -0.6957  0.0268  0.7067  2.8482 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 unit     (Intercept) 846.3021 29.0913       
          x             0.2151  0.4638  -0.27
 Residual              93.0853  9.6481       
Number of obs: 1000, groups:  unit, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)  18.3474     9.2202    1.99
x             2.0075     0.1816   11.05

Correlation of Fixed Effects:
  (Intr)
x -0.250

Correlation of random effects

One important addition from the random intercept-only model is the estimate for the correlation between the distribution of the random slopes and random intercepts (which the model estimates as -0.268, see output below). Because we simulated these data, we know that there is no true correlation between the unit slopes and intercepts. But, because we have a small number of units, we just happened to have an emergent correlation.

summary(lmer2)$varcor

cor.test(rand_eff$b0, rand_eff$b1)

> summary(lmer2)$varcor
 Groups   Name        Std.Dev. Corr  
 unit     (Intercept) 29.09127       
          x            0.46378 -0.268
 Residual              9.64807       
> 
> cor.test(rand_eff$b0, rand_eff$b1)

	Pearson's product-moment correlation

data:  rand_eff$b0 and rand_eff$b1
t = -1.2158, df = 8, p-value = 0.2587
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8205150  0.3123993
sample estimates:
       cor 
-0.3949022

If you had a reason to assume NO correlation between your random effects, you could specify that as (x||unit) in this way:

lmer3 <- lmer(y ~ x + (x||unit), data = X)

summary(lmer3)

summary(lmer3)$varcor

> summary(lmer3)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + ((1 | unit) + (0 + x | unit))
   Data: X

REML criterion at convergence: 7443.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.1812 -0.6951  0.0223  0.7074  2.8523 

Random effects:
 Groups   Name        Variance Std.Dev.
 unit     (Intercept) 833.2681 28.8664 
 unit.1   x             0.2092  0.4574 
 Residual              93.1011  9.6489 
Number of obs: 1000, groups:  unit, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)   18.325      9.149   2.003
x              2.010      0.180  11.165

Correlation of Fixed Effects:
  (Intr)
x -0.035
> 
> summary(lmer3)$varcor
 Groups   Name        Std.Dev.
 unit     (Intercept) 28.86638
 unit.1   x            0.45738
 Residual              9.64889

As with the random intercept model, we can use the predict function to get expected values of y based on the marginal or conditional estimates. Note that re.form = NULL is the same as re.form = ~ (x|unit).

X <- X %>% 
  mutate(fit2.m = predict(lmer2, re.form = NA),
         fit2.c = predict(lmer2, re.form = NULL),
         resid2 = resid(lmer2))

As with the random intercept model, one way to visualize the model is to show the conditional intercept/slopes as fitted lines and the conditional residuals as points.

pmain_lmer2 <- X %>%
  ggplot(aes(x = x, y = fit2.m + resid2)) +
  geom_line(aes(y = fit2.c, col = unit), size = 1) +
  geom_point(pch = 16, col = "grey") +
  geom_line(aes(y = fit2.m), col = 1, size = 2) +
  coord_cartesian(ylim = c(-40, 100))
pmain_lmer2

Marginal fit (heavy black line) from the random effect model with random intercepts and slopes with the conditional residuals (grey dots) and conditional fits (thin lines) for each experimental unit, differentiated by color.

Visualizing the random effect variance gets a bit more difficult with two random parameters. One strategy I like is to include an additional plot of the correlation and distribution of the random effects.

Basically, we make a scatter plot of the BLUPs of slopes and intercepts, Then we make histograms for each and add those to the margins of the plot. (I’m relying heavily on Claus Wilke’s post to code the histograms at the margins of the plot). Finally, we patch it all together with cowplot.

Cond_DF2 <- as.data.frame(ranef(lmer2)) %>% 
  transmute(unit = grp,
            term = case_when(term == "(Intercept)" ~ "b0_hat",
                             term == "x" ~ "b1_hat"),
            value = condval) %>%
  pivot_wider(id_cols = "unit", names_from = "term", values_from = "value") %>%
  mutate(Intercept_cond = b0_hat + summary(lmer2)$coef[1,1],
         Slope_cond = b1_hat + summary(lmer2)$coef[2,1])

pmain <- Cond_DF2 %>%
  ggplot(aes(x = Intercept_cond, y = Slope_cond)) +
    geom_point(aes(col = unit), size = 3) +
    geom_density2d(bins = 4, col = "grey", adjust = 3)

xdens <- axis_canvas(pmain, axis = "x") +
  geom_density(data = Cond_DF2, aes(x = Intercept_cond), fill = "grey", col = NA, trim = FALSE, adjust = 2)

ydens <- axis_canvas(pmain, axis = "y", coord_flip = TRUE) +
  geom_density(data = Cond_DF2, aes(x = Slope_cond), fill = "grey", col = NA, trim = FALSE, adjust = 2) +
  coord_flip()

p1 <- insert_xaxis_grob(pmain, xdens, grid::unit(.2, "null"), position = "top")
p2 <- insert_yaxis_grob(p1, ydens, grid::unit(.2, "null"), position = "right")

pinsert_lmer2 <- ggdraw(p2)

plot_grid(
  pmain_lmer2,
  pinsert_lmer2,
  nrow = 1
)

Left: Marginal fit (heavy black line) from the random effect model with random intercepts with the conditional residuals (grey dots) and conditional fits (thin lines) for each experimental unit, differentiated by color. Right: Correlation of BLUPs (Best Linear Unbiased Predictors) of random intercept and slope parameters of experimental units, differentiated by color with marginal density distributions.

Bootstrapping uncertainty

For many researchers, one of the most frustrating aspects of mixed models is that estimating confidence intervals and testing the significance of parameters is not straight forward. I highly encourage that folks take a look at Ben Bolker’s thorough considerations on the topic. Dr. Bolker suggests many different problems and solutions depending on the structure of your model.

I think that the most generalized solution is to use non-parametric bootstrapping. This method essentially asks the question, “How would our model fit change if we could go back in time, select different samples, and then rerun our analysis?”

We can’t go back in time, but maybe we CAN assume that our original samples were representative of the population. If so, instead of resampling the actual population, we could resample our original observations WITH REPLACEMENT to approximate resamples of the population. If we do this many time, we can then make intuitive statements like, “95% of the time, if we reran this experiment, we’d expect the main effect to be between X and X value.”

It is important to stop an consider what re-doing your data generation process would look like. For instance, imagine our mock data had come from 1000 independent random observation that we then categorized into “units” to control for autocorrelation after the fact. If we re-ran the process to generate a new dataset, we may not always get the same number of observations in each “unit”.

However, if our mock data came from an experiment where we planted 100 trees in each of 10 “units”, then when we re-ran the experiment, we could control the number of individuals per unit. We would also need to consider if we would always choose to plant in the same 10 units, or if we would also choose units at random.

The structure of the data generation process can guide our bootstrap resampling strategy. In the first example, we could simply bootstrap all individual observations (although we may need to worry about non-convergence and small sample sizes). In the second example, where unit choice is constrained, we might decide to bootstrap within units. If, in the second example, we could also randomize units, we should probably take a hierarchical approach, first bootstrapping the units and then bootstrapping the observations within each unit.

NOTE: The problem with non-parametric boostrapping of this kind is that it can be computationally expensive. One trick is to parallelize the bootstraps across all of your computer’s processors. By default, R uses one processor, so it will fit one bootstrap iteration at a time, sequentially. But the bootstraps are independent and the order doesn’t matter. If you computer has 8 cores, there is no reason not to fit 8 models simultaneously on 8 processors in 1/8 of the time. Unfortunately, setting up for parallel processing can be an adventure of its own. I won’t detail it here, but will try to dedicate a post on it in the future. If you have a very large dataset and can’t run bootstraps in parallel, you might consider some of the other methods suggested by Dr. Bolker.

Since our dataset is fairly small and simple, I’ll demonstrate how we can use bootstrapping to simultaneously estimate confidence intervals of our model parameters and visualize error bands.

Just keep in mind that if you are fitting models with complex random effect designs, you’ll have to think critically about which elements and levels of variance are most important for your data story. Hopefully, these examples will at least get you started and inspired!

The bootstrapping process begins by initiating an empty dataframes to accept the parameter estimates for the fixed effect coefficients and random effect variances. The number of rows in the dataframe will be the same as the number of bootstrap iterations, so we set that first. The number of iterations is the number of times we want to simulate re-doing our data generation. The convention is 1000, but the more the merrier!

Number_of_boots <- 1000

The number of columns for the dataframes will equal the number of fixed effect coefficients and random effect variances. We can extract these from the initial model. First we extract the coefficients, then transpose the table into wide format.

# Extract the fixed effect coefficients.
FE_df <- fixef(lmer2) %>% 
  t() %>%
  as.data.frame()

# Extract the random effects variance and residual variance
RE_df <- VarCorr(lmer2) %>%
  as.data.frame() %>%
  unite("Level", -c(vcov, sdcor)) %>%
  select(-vcov) %>%
  t() %>%
  as.data.frame()

Next, we create empty dataframes to take our bootstraps.

BS_params <- data.frame(matrix(nrow = Number_of_boots, ncol = ncol(FE_df)))
colnames(BS_params) <- colnames(FE_df)

BS_var <- data.frame(matrix(nrow = Number_of_boots, ncol = ncol(RE_df)))
colnames(BS_var) <- RE_df["Level",]

In addition, we will be predicting marginal values from each model. So, we need to create a prediction dataframe with an empty column to store the predicted values. For this example, we only need to predict ŷ values for a handful of x values that represent the range of xs. I chose to use 10-quantiles because I want to be able to fit a non-linear confidence band later. If this was a non-linear fit, we might want even more prediction values.

BS_pred <- expand.grid(x = quantile(X$x, probs = seq(0, 1, length.out = 10)),
                       iterration = 1:Number_of_boots,
                       pred = NA)

Finally, we can write a loop that creates a resampled dataset (with replacement) and fits the original model formula to the new dataset. From the new model, we can then extract the fixed and random effects and predict ŷs for the subset of x values. All of these get stored in their respective dataframes, indexed by the iteration number.

for(i in 1:Number_of_boots){
BS_X <- slice_sample(X, prop = 1, replace = TRUE)
BS_lmer <- lmer(formula = lmer2@call$formula,
data = BS_X)

BS_params[i,] <- BS_lmer %>%
fixef() %>%
t() %>%
as.data.frame()

BS_var[i,] <- BS_lmer %>%
VarCorr() %>%
as.data.frame() %>%
.$sdcor

BS_pred[which(BS_pred$iterration == i),]$pred <- predict(BS_lmer,
newdata = BS_pred[which(BS_pred$iterration == i),],
re.form = ~0)
}

Now we have a dataframe of the marginal (i.e. fixed effect) intercept and slope parameter estimates from 1000 models fit to bootstraps of our original data.

head(BS_params)

> head(BS_params)
  (Intercept)        x
1    18.06942 2.096240
2    18.18005 2.070043
3    18.13506 2.093110
4    18.77862 1.928048
5    18.47963 2.013831
6    18.28875 2.005947

One way to get a sense of our confidence in these parameter estimates is to take a look at their distributions.

BS_hist_x <- BS_params %>%
  ggplot(aes(x = x)) +
  geom_histogram()

BS_hist_intercept <- BS_params %>%
  ggplot(aes(x = `(Intercept)`)) +
  geom_histogram()

BS_hists <- plot_grid(
  BS_hist_intercept,
  BS_hist_x,
  nrow = 1)

BS_hists

Histograms of marginal (i.e. fixed effect) intercept (left) and slope (right) parameters.

These histograms tell us so much more than a typical confidence interval because we can see the full distribution. We can see that the baseline effect on y given x = 0 is around 18.5 and we are very confident it is between 17 and 20. We can also see that the effect of 1 unit change in x is expected to yield about a 2.0 unit change in y and we are extremely confident that the slope is positive and greater than 1.7 but probably less than 2.3.

Although we won’t directly visualize the random effect variance, we can see the estimates in the BS_var dataframe.

BS_var %>% head()

> BS_var %>% head()
  unit_(Intercept)_NA unit_x_NA unit_(Intercept)_x Residual_NA_NA
1            28.30440 0.4427450         0.01009018       9.786112
2            30.31353 0.6265168        -0.42315981       9.365630
3            30.33369 0.5261389        -0.78128072       9.320844
4            28.51896 0.6079722        -0.08575830       9.739881
5            30.50690 0.5835157        -0.64068868       9.718633
6            29.17012 0.5182395        -0.28244154       9.930305

Here, the first column is the estimated variance attributed to the random intercepts, the second column is the variance estimate of the random slopes, and the third column is the correlation between the random effects. The fourth and final column is the residual variation after accounting for the random effects (i.e. the conditional residual variation). This is information you’d want to include in a table of the model output.

We also have a dataframe of 10 predicted ŷs for each iteration.

head(BS_pred, n = 20)

> head(BS_pred, n = 20)
             x iterration     pred
1  0.008197445          1 18.08660
2  1.098954093          1 20.37309
3  2.231045775          1 22.74622
4  3.266109165          1 24.91596
5  4.323574828          1 27.13267
6  5.667717615          1 29.95031
7  6.672807375          1 32.05722
8  7.748743587          1 34.31264
9  8.793447507          1 36.50259
10 9.968969584          1 38.96677
11 0.008197445          2 18.19702
12 1.098954093          2 20.45493
13 2.231045775          2 22.79841
14 3.266109165          2 24.94104
15 4.323574828          2 27.13003
16 5.667717615          2 29.91247
17 6.672807375          2 31.99305
18 7.748743587          2 34.22028
19 8.793447507          2 36.38286
20 9.968969584          2 38.81624

Rather than using a traditional confidence band (which basically reduces the distribution of our bootstraps down to two points: high and low), I prefer to actually show all of the iterations and let the density of the lines make a kind of optical illusion of a confidence band.

Since our estimates are pretty tightly distributed, I’m using cowplot to show what this looks like while also zooming in to a section.

plot_grid(
  BS_pred %>%
    ggplot(aes(x = x, y = pred)) +
    geom_line(aes(group = iterration), alpha = 0.1, col = "grey50") +
    geom_line(data = X,
              aes(x = x, y = fit2.m)) +
    geom_rect(aes(ymin = 35, ymax = 40,
                  xmin = 8, xmax = 10),
              col = "firebrick",
              fill = NA,
              size = 2),
  BS_pred %>%
    ggplot(aes(x = x, y = pred)) +
    geom_line(aes(group = iterration), alpha = 0.1, col = "black") +
    geom_line(data = X,
              aes(x = x, y = fit2.m),
              col = "grey", 
              size = 2) +
    coord_cartesian(xlim = c(8, 10),
                    ylim = c(35, 40)) +
    geom_rect(aes(ymin = 35, ymax = 40,
                  xmin = 8, xmax = 10),
              col = "firebrick",
              fill = NA,
              size = 2) +
    theme(axis.line = element_blank(),
          axis.text = element_blank(),
          axis.title = element_blank()) +
    labs(x = "", y = ""),
  nrow = 1
)

Marginal effects of linear mixed effect regressions fit to 1000 boostrap resamples of the original data (thin grey lines in left and think black lines in right panel). The model estimates fit to the original data are show as a thick black (left) or grey (right). The right panel is a detailed view of the quadrant indicated with a red rectangle in the left panel.

Of course, we can also create a more traditional 90% confidence band by summarizing the 5th and 95th percentiles of ŷ across each x value.

BS_pred %>%
  group_by(x) %>%
  summarise(hi = quantile(pred, 0.95),
            lo = quantile(pred, 0.05)) %>%
  ggplot(aes(x = x)) +
    geom_ribbon(aes(ymin = lo, ymax = hi),
                fill = "grey50",
                alpha = 0.3) +
    geom_line(data = X,
              aes(x = x, y = fit2.m))

Model fit of a linear mixed effect regression (black line) and 90% confidence band (grey band) estimated from fitting 1000 boostraps.

Putting it all together

Putting it all together, here is my preferred visualization of a mixed effect model with random intercepts and slopes, using bootstrapping to display uncertainty.

BS_ci_lines <- BS_pred %>%
  ggplot(aes(x = x, y = pred)) +
    geom_line(aes(group = iterration), alpha = 0.1, col = "grey") +
    geom_line(data = X,
              aes(x = x, y = fit2.m)) +
    geom_point(data = X,
               aes(x = x, y = fit2.m + resid2, col = unit),
               alpha = 0.3,
               pch = 16)

plot_grid(
  BS_ci_lines,
  plot_grid(
    pinsert_lmer2,
    plot_grid(
      BS_hist_intercept,
      BS_hist_x,
      nrow = 1,
      labels = c("C", "D")),
    nrow = 2,
    rel_heights = c(1, 0.7),
    labels = c("B", NA)
  ),
  nrow = 1,
  rel_widths = c(1, 0.7),
  labels = c("A", NA)
)plot_grid(
  BS_ci_lines,
  plot_grid(
    pinsert_lmer2,
    plot_grid(
      BS_hist_intercept,
      BS_hist_x,
      nrow = 1,
      labels = c("C", "D")),
    nrow = 2,
    rel_heights = c(1, 0.7),
    labels = c("B", NA)
  ),
  nrow = 1,
  rel_widths = c(1, 0.7),
  labels = c("A", NA)
)

Linear fixed effect (A, solid black line) of a mixed effect regression fit to the original data and estimate 1000 boostrap resamples (A, thin grey lines) with conditional residuals for experimental units (A, points). Correlation of BLUPs (Best Linear Unbiased Predictors) of random intercept and slope parameters of experimental units with marginal density distributions (B). Histograms of marginal (i.e. fixed effect) intercept (C) and slope (D) parameters estimated from 100 boostraps. Colors in A and B differentiate experimental units.

Or, if we want to get really fancy with it, we could inset everything into one plot panel.

BS_ci_lines +
  coord_cartesian(ylim = c(-30, 100)) +
  annotation_custom(ggplotGrob(BS_hists),
                    xmin = 5,
                    xmax = 10,
                    ymin = -30,
                    ymax = 5
  ) +
  annotation_custom(ggplotGrob(pinsert_lmer2),
                    xmin = 0,
                    xmax = 4,
                    ymin = 50,
                    ymax = 110
  )

sadfasdf

Fractured Aviary

azandis@gmail.com — Fri, 29 Jul 2022 10:51:25 +0000

During the pandemic, my partner, Bayla, and I began taking daily walks down to Yale’s campus. We often noticed dead birds at the base of the glass walls that wrap the Yale School on Management building when we passed by.

Overlooked casualties of building all-glass buildings. Female and male Ruby-throated hummingbirds with broken beaks, killed by flying into @YaleSOM building. pic.twitter.com/Idl8q0mqWu

— A. Z. Andis Arietta, PhD (@azandisarietta) May 9, 2021

Because we both have working relationships with the Peabody Museum of Natural History, we began saving the bird specimens for the museum’s collection. Through that partnership, we learned that the School of Management building is one of the most lethal pieces of architecture on Yale Campus. We also met Viveca Morris at the Yale Law Ethics and Animals Program who had been helping to organize city-wide bird-strike data collections and spearheading a push to adopt bird-friendly building ordinances in New Haven.

Another @YaleSOM window strike casualty. Yellow warbler.#illustration #watercolour #watercolor #watercolorartist #birdsketch #birdart #deadbird #birdpainting #yellowwarbler #migratorybirds #warblersofinstagram #darkart #baylaart pic.twitter.com/QiX58rc9VX

— Bayla Arietta (@BaylaArt) May 26, 2020

One of the main barriers enacting mitigatory measures at the SOM building was that the lack of hard accounting of the total number of birds killed allowed the administrators of the building to downplay the problem. So, along with Viveca, we began a systematic survey of bird strikes at SOM. I’ll write more about that in a future post.

We also began thinking about the larger picture. How could we get more folks to recognize the magnitude of deaths due to thoughtless architecture? And how could we inspire folks to demand businesses, architects, and municipalities to adopt bird-friend design?

View this post on Instagram

A post shared by Bayla Arietta (@baylaart)

Bayla began painting some of the specimens we found. She posted a painting of five warblers we collected on a single day at SOM. The response was huge. That image seemed to have struck a chord. We realized that art could be a way to simultaneously introduce the topic and inspire emotions toward enacting change.

View this post on Instagram

A post shared by Bayla Arietta (@baylaart)

Bayla contacted Talon and Antler galleries in Portland, Oregon which feature some of our favorite contemporary artists and tend toward natural themes.

They agreed to let us curate a show with us. Over the next few months, Bayla contacted artists whose work fit the theme. In total, 62 artists contributed original pieces to the show titled, “Fractured Aviary”, which hung for the month of June 2022.

If you missed the show, you can see some of my favorites below:

Leaving the Dream

azandis@gmail.com — Thu, 23 Jun 2022 20:37:40 +0000

I have never not wanted to be a scientist.

As a kid, I would copy down—by hand—the entirety of encyclopedia entries about animals. In fifth grade, I was regularly running natural history experiments (see photo). By senior year of high school, I was determined to earn, “PhDs in Biology and Ecology” (see embarrassing photo).

Some historical artifacts of my life-long love for science, uncovered in my mom’s garage. LEFT: Apparently high school Andis wanted to get not one but double doctorates in ecology and biology. RIGHT: Some early practice in natural history observations and empirical data keeping. (Also please enjoy the only photo of me you will ever see without a beard).

That goal might sound ambitious, but not outlandish to most folks, especially those currently in academia. But where I come from, it was in league with aspirations to become an NFL footballer, a rockstar, or lottery jackpot winner—life outcomes that, while theoretically possible, were implausible to the point of fantasy.

I come from a low-income family. Neither of my parents went to college. So, a college education, let alone graduate school, was not an expectation. Much the opposite. The fact that I went to college at all was a universal alignment of serendipity. If it hadn’t been for some friends explaining the process, a recycled plastic frisbee, and my dad’s meager life insurance after his overdose in my junior year, I certainly would not have gone to college.

Nevertheless, I managed to make that dream of a doctorate—from Yale, no less—come true. I even had my dream post-doc lined up with an amazing set of mentors and collaborators.

Then I decided to walk away.

That decision was the hardest, but simultaneously, the most obvious decision point of my entire life. Judging from the whinging about a lack of post-docs on science Twitter, it seems like others are taking a long hard look at the prospects of life in the academy and deciding to opt out as well.

For others at the decision point, here are some things to consider:

Hands-down, academics have the greatest disjunct between an overly inflated concept of self-worth yet accept horribly low monetary valuation of their worth. The average doctorate stipend in the US is somewhere between $15k and $30k. That’s less than California’s minimum wage going to lots of folks who already have Master’s degrees.

The vast majority academics receive significant support from their parents during their graduate degrees. From Morgan et al. (2021).

This paltry payout is enabled because most academics are heavily subsidized by their parents throughout grad school and afterward. Morgan et al. (2021) showed that among academics, those with more educated parents received more support and encouragement. This translates to significant wealth gaps between first-gen grads and their peers. A Pew study from last year showed that even among college graduates, first-gen households had lower income (-27%) and much less wealth (-38%) than those whose parents were also college grads. Much of this effect is driven by the fact that kids with wealthy parents incur less or no debt for their education, setting them up for positive wealth–instead of negative wealth–going into graduate school and beyond. In addition to debt avoidance, wealthy parents also confer direct cash subsidies like down-payments for houses and inheritance. All of this means that folks with wealthy parents can accept lower wages.

The impact of familial wealth subsidies is not alleviated by getting a degree. Wealth and income gaps persist for first-gen students.

And it turns out that folks who don’t really have to worry about the monetary benefits of a job are more likely to gravitate to jobs with lower than expected pay, but greater non-monetary benefits like prestige or job security.

When your parents provide a greater portion of your adult income, you have more latitude to seek occupations with higher intrinsic quality as opposed to monetary compensation. Post-secondary education is ranked first among occupations with high intrinsic value. Boar and Lashkari, 2022.

Academia is the perfect trifecta of high prestige, high security (with tenure) and low pay. In fact, post-secondary education ranked first in Boar and Lashkari’s (2022) assessment of career intrinsic quality. Folks how receive less than about $50K from their parents (i.e. about the average most academics’ parents give them for a down-payment on a house or to pay for college) are more likely to chose jobs with negative intrinsic quality.

This is compounded by another reason that poor kids are so uncommon in the academy: your parent’s income is the single largest predictor of your early college attainment, far above any other demographic variable (Chetty et al., 2018).

Parental income is the greatest predictor of college attendance (here, attendance means enrollment in at least a two-year or longer degree), far above any other demographic variable. From Chetty et al. (2018).

The fact that most folks with PhDs come from wealthy parents with graduate degrees creates a vicious feedback cycle that drives down salaries from graduate student stipends through faculty salaries.

Figures 3 and 9 from Stansbury and Schultz (2022) show that the percent of academic whose parents do not have degrees have been steadily declining while the share of academics whose parents have graduate degrees is increasing. Figure 9, specifically, shows that this is not simply the effect of more folks with college degrees in general. Academics are about 4-6 times more likely to have a parent with a graduate degree than the average person in the US.

Those low doctoral salaries establish an abysmally low first tier in academic salary ladder. The average salary for a post-doc in the US is $47.5k. (It can be even less appealing internationally. I was offered a European post-doc that would have been < $35K after exchange rate and taxes).

The NSF post-doc salary I turned down would have been $56k, the suggested amount from NSF. For most newly graduated docs, a salary increase to $55k seems enormous to someone who just spent 5 years working 60 hours a week for $30K. But $55K is a paltry salary for someone with a PhD. Consider that the average salary for a professional clown in the US is nearly $50k, and let that subtle irony sink in.

Folks are beginning to notice the grass on the other side of the fence and realize that it is, in fact, greener.

The story outside of academia is much different. Doctorate degrees are actually worth something. And you don’t even have to sell your soul to industry. I got offers for conservation NGO positions for twice my post-doc salary and I interviewed with environmental funding organizations hiring at salaries three to five times my post-doc. I had a career in non-profit leadership before graduate school, so I was probably seeing the high end that the environmental NGO field had to offer. But conservation work is low-paying in general. Other fields have a much higher ceiling.

In almost every field, PhDs can make far more outside of academia. This is especially true for folks with biology, math or computer science degrees. These data come from the National Science Foundation’s “Survey of Earned Doctorates”.

Right now, the field of data science is booming. Given that even the most field-oriented biologist likely spends most of their days staring at an R terminal doing statistics, every biologist is an experienced and competitive data scientist, de facto. Even in the data science field, you don’t have trade in your morals for your salary. NGOs are also hiring PhDs for data science roles. In fact, the position I ultimately abandoned academia for is with an education non-profit.

I think a lot of academics really want to help make the world a better place through their research. But, the fact that you can make double or triple the salary while doing far more immediately impactful good leaves almost nothing left on the scale in favor of academia.

In my role: I work remotely, have outstanding work-life balance, and a clear promotional track. Compare that to a post-doc where I’d be trying to wrap up unfinished papers from my doctorate on top of a heavy workload, in a temporary position, where the next career step involves competing with over 300 of my peers for a position half-way across the country that pays only slightly more than the people paid to watch Netflix all day.

One of the few benefits that academia can uniquely offer is the promise of tenure. Setting aside the fact that chasing tenure is simply prolonging one’s time chasing a carrot on a stick, tenured positions are dwindling every year. And the chance of transitioning from a post-doc into a tenure track position is abysmal and getting worse. Only about 10% of post-docs end up in tenured roles.

Only about 10% of post-docs in the biological sciences transition into a tenure track role. Cheng, 2021.

One overlooked downside of academia is that in addition to poor pay, the pay-off is delayed. Sure, in those rare cases you might end up with a six-figure, tenured professorship, but that reward is deferred well past the most important years of capital growth.

For instance, doctoral students and postdoc salaries make it basically impossible to save any money (especially for folks with student loans). As an example, I was making $40k in a full time NGO job right out of undergrad. From that income band, getting paid $33k while getting a PhD seemed like a good bet, given the eventual salary advantage of a degree. However, that calculus neglects to consider the life-timing of savings. The advantage of an extra $7k per year in your early 20s has disproportionate outcomes compared to the same amount later in life.

If your academic pay keeps you from maxing out your IRA at $6k/year for 7 years in your mid-20s, that $42k loss compounds to a net loss of roughly $484k by the time you retire.

Consider this, the maximum allowable annual contribution to a Roth Retirement Account is $6k. If the paltry pay of grad stipends and postdocs prevents you from contributing to a Roth IRA for 7 years after undergrad, that mere $42k in lost pay compounds to a loss of nearly half a million dollars by retirement ($484k). (Do the calculations yourself: NerdWallet calculator)

And that’s just the cost of deferred retirement savings. The other engine of wealth accumulation in the US is home equity. Did you know: graduate stipends are not considered eligible income when applying for home loans. And don’t expect to wait until a post-doc, either. You need to be in a position for at least two years for it to be eligible income.*

(*There is a caveat that if you have a signed agreement of continued stipend for at least two years, your stipend could be eligible. But that means that you’d have to buy a house at the beginning of your PhD, when most folks can’t afford a down payment (unless you have wealthy parents to float you down-payment. Or unless you have parents who will gift you $300k in cash to buy your house, like one of my colleagues at Yale).

Ultimately, I’m not saying that a career in academia is a poor life-decision. I’m just saying that a career in academia is a poor life decision if you’re from a poor family.

Folks with external subsidies (even minorly) have the liberty to make decision to follow the dream of academia in a way that those of us who have to generate our own income cannot. Academia runs on those external subsidies. If your family can’t float you during your unpaid summer internships, or loan you cash to pay for the conference that you may or may not get reimbursed for 8 months later, or cover the down payment on your eventual house, … etc., you are going to end up way behind in life.

Mentors should feel ethically compelled to lay out the Sisyphean asymmetry of the academic career path to mentees from low-income backgrounds. If that makes you uncomfortable, the answer is to fix the system, not to mislead mentees with your unexamined ‘luxury beliefs.’

Unfortunately, this reality remains unseen by those currently in the academy. Academics love to worry about inequality, but because most of them are from upper-middle class to rich families, they manage to overlook the enormous impact of wealth inequality in academia. (Hell, most academic institutions actively avoid even collecting the data that would illuminate this reality).

In the end, overlooking the consequence of wealth subsidies leads mentors to encourage any student who shows an interest to pursue academic careers because they confuse what they wish to be true: “The academy is open to all” with what is, in fact, true: “Academia is a terribly unwise career for folks from poor families.”

I contend that mentors should feel ethically compelled to lay out the Sisyphean asymmetry to mentees from low-income backgrounds. If that makes you uncomfortable, the answer is to fix the system, not to mislead mentees with unexamined ‘luxury beliefs.’

Until that system gets fixed, more of us trailer home alumni will keep unhappily walking away from our dream.

[Updated 2022 Sept. 09 with some recent, relevant research papers]

[Updated 2022 Dec. 10 with information from this Pew report]

References:

Boar, C., and Lashkari, D. (2022). Occupational Choice and the Intergenerational Mobility of Welfare. Available at: https://www.nber.org/system/files/working_papers/w29381/w29381.pdf.

Cheng, S. D. (2021). What’s Another Year? The Lengthening Training and Career Paths of Scientists. in (Harvard University Department of Economics). Available at: https://conference.nber.org/conf_papers/f159298.pdf.

Chetty, R., Hendren, N., Jones, M. R., and Porter, S. R. (2018). Race and Economic Opportunity in the United States: An Intergenerational Perspective. doi: 10.3386/w24441.

Morgan, A., Clauset, A., Larremore, D., LaBerge, N., and Galesic, M. (2021). Socioeconomic Roots of Academic Faculty. doi: 10.31235/osf.io/6wjxc.

Schultz, R., Stansbury, A., Albright, A., Bleemer, Z., Cheng, S., Fernández, R., et al. (2022). 22-4 socioeconomic diversity of economics PhDs. Available at: https://www.piie.com/sites/default/files/documents/wp22-4.pdf.

Chasing Arctic Frogs

azandis@gmail.com — Tue, 17 Aug 2021 19:13:54 +0000

A short recipe for adventurous field science

Take me to the photos!

Step 1: Come up with a hair-brained scheme.

My labmate Yara and I had been dreaming up the idea studying wood frog genomes from across the species’ range since she started her PhD. Wood frogs have the largest range of any North American amphibian. They also happen to be the only North American amphibian that can survive North of the Arctic circle.

Our 200 mile route (in orange) from the headwaters of the Ambler River in Gates of the Arctic National Park, down the Kobuk River through Kobuk Valley National Park Wilderness, and out to the village of Noorvik where the Kobuk meets the Arctic Ocean.

Dr. Julie Lee-Yaw had done a similar study back in 2008. She embarked on a road trip from Quebec all the way up to Alaska to collect wood frog tissue. So, out first step was to ask Dr. Lee-Yaw if she would collaborate and share her samples.

Those samples gave us a solid backbone across the wood frog range, but we were missing population in expansive regions north and west of the road systems. We worked with the Peabody Museum to search for tissue samples that were already housed in natural history collections around the world. We filled a few gaps, but huge portions of the range were still missing.

We knew that there must be samples out there sitting in freezers and labrooms that were not catalogued in museum databases. So, our next step was to begin sleuthing. We looked up author lists from papers and cold-called leads. I even reached out to friends on Facebook (…which actually turned out to be a big success. The aunt of a friend from undergrad happens to do herpetology research in Galena, Alaska and was able to collect fresh samples for us this year!). This effort greatly expanded our sample coverage with new connections (and friends) from Inuvik and Norman Wells in the Northwest Territories, Churchill on the Hudson Bay, and the Stikine River Delta in Southeast Alaska.

But as the points accumulated on the map, we noticed some glaring holes in our coverage. Most importantly, we had no samples from Northwestern Alaska. Populations in this region are the most distant from the ancestral origin of all wood frogs in the southern Great Lakes. If we wanted a truly “range-wide” representation of wood frog samples, we needed tissue from that blank spot on the map!

Step 2: Convince your advisor and funders it’s a good idea.

This might be the hardest step. In our case, Yara and I were lucky that our advisor, Dave, was immediately supportive of the project. After we made the case for the importance of these samples, funders came around to the idea as well.

Step 3: Make a plan …then remake it …then make a new plan yet again.

Once we knew where we required samples from, we needed to figure out how to get there. Alaska in general is remote, but northwestern Alaska is REALLY remote. The road system doesn’t stretch farther than the middle of the state. All of the communities–mainly small villages–are only accessible by plane, and most of them only have runways for tiny prop planes. Travelling out from the villages into the bush is another layer of difficulty. Most people here either travel by boat on the river or by snowmachine during the winter. Traveling on land, over the soggy and brush-choked permafrost, is brutal and most locals only do it when necessary, if at all.

Prior to academia, I made a career of organizing expeditions to the most remote places in the rugged southeastern archipelago of Alaska. Despite my background, the logistic in the Arctic were even inscrutable to me. Fortunately, I had a couple of friends, Nick Jans and Seth Kantner, who know the area well. In fact, Seth grew up in a cabin out on the Kobuk. (Seth and Nick are both talented authors. I suggest checking out Ordinary Wolves by Seth and The Last Light Breaking by Nick). With their help, I was able to piece together the skeleton of a trip.

After many logistic iterations, Yara and I decided to follow in the footsteps of local hunters who, for generations, have used the rivers as conduits into the heart of the wilderness. Our plan was to travel down one of the major arterial rivers and hike inland to search for frog as we went.

Our original itinerary was to raft the 100 mile section of the Kobuk River from just north of Ambler village to the village of Kiana. But at the last minute (literally), our plans changed. As we were loading up the plane, the pilot told us that he couldn’t fly into our planned starting point. Instead, he suggested that we fly into a gravel bar 30 miles up river in Gate of the Arctic. Those “30 miles” turn out to be AIR MILES. Following the river, it ended up adding over 60 miles to our trip.

We packed two inflatable oar rafts, almost 150 pounds of food, and another 300 pounds of camping, rescue, and science gear, into the balloon-wheeled plane. For the next two weeks, we rowed down the swift Ambler River from the headwaters to the confluence of the Kobuk. Then, we rowed down the massively wide and meandering Kobuk River, eventually extending our trip by an additional 30 miles, by-passing Kiana, and continuing to Noorvik, the last village on the river.

Step 4: Recruit a crew.

Despite being the worlds first and only Saudi Arabian Arctic Ecologist with limited camping experience, I knew Yara would be a stellar field partner. But I never like traveling in brown bear country with fewer than four people. Plus, expedition research involves too many daily chores for the two of us to manage alone. So, we recruited a team.

Sam Jordan is a dry land ecologist, but he had been willing to help me with my dissertation fieldwork in wetlands before, so I knew he would be willing to defect for a good adventure. Sam is also an exceptional whitewater paddler and all-around outdoor guru. Plus, he’s just a great guy (when he leaves his banjo at home). He and I spend two weeks floating the Grand Canyon in the dead of winter and there are few people I would want along on a remote river trip.

Kaylyn Messer and I guided sea kayak expeditions in Southeast Alaska back in our youth. I am a bit particular about how I manage my camp system (read: “extremely picky and fastidious to a fault”) on big trips. Kaylyn is one of the few people as scrupulous as me, but she’s also a super amenable Midwesterner at heart. I knew she’d be a huge help out in the field.

We fell into an effective rhythm on the trip. Each morning we woke, made breakfast, broke camp, packed the boats, and launched early in the day. While one person on each boat rowed, the other person checked the maps for frog surveying spots, fished, or photographed. We stopped along the way to bushwhack back into wetlands we’d identified from satellite images. We typically arrived at camp late. Yara and I would set up one tent to process the specimens from the day while Same and Kay made camp and cooked dinner. One of the hidden disadvantages of 24-hour Arctic sunlight is that it is easy to overwork. Most nights we only managed to get sampled finished, dinner cleaned up, and camp bearproofed with enough time to crawl into tents with just eight hours till beginning again the next day.

Step 5: Do the science.

Doing science in the field is difficult. Tedious dissections seem impossible while baking in the omnipresent sun and being alternately hounded by hundreds of mosquitoes or blasted by windblown sand. Trading lab coats for rain jackets and benchtops for sleeping pads covered in trashbags compounds the trouble. Not to mention, keeping tissues safe and cool. Organization and adaptability go a long way.

On remote, self-supported trips, it is inevitable that equipment fails or is lost. On one of the first days, we discovered that our formalin jar was leaking—and formalin is not something you want sloshing around! We cleaned the boats and found a creative solution to replace the offending container: a 750ml Jack Daniel’s bottle!

Planning ahead and engineering backup plans also helps. One of our main struggles was figuring out how to preserve specimens and get them home. It is illegal to ship alcohol by mail and you can’t fly with the high-proof alcohol needed for genetic samples. You can ship formalin, but it is difficult to fly with. To make matters worse, we were flying in and out of “dry” or “damp” villages where alcohol is strictly regulated or forbidden. Also, we happened to be flying out on a Sunday, making it impossible to mail samples home. The solution we arrived at was to ship RNAlater and formaldehyde to our hotel room ahead of time. Tissue would remain stable in RNAlater for a couple of weeks and we could make formalin to fix the specimens. After fixing, we cycled the specimens through water to leach out the formalin. This made it possible for me to fly with all of the tissue tubes and damp specimens in my carry on. Other than a few concerned looks from the TSA folks, all of the samples made it back without issue!

Step 6: Enjoy the adventure.

Despite the hard work, there was a lot to appreciate about the Arctic. We witnessed major changes in ecology as we travelled from the steep headwater streams in the mountains to the gigantic Kobuk. Every day was an entirely new scene.

Here’s an illustrated thread on our collecting trip to the Alaskan Arctic for the @Yale @yalepeabody this past summer. 1/22 pic.twitter.com/Jh3VCqVEgx

— A. Z. Andis Arietta (@azandisarietta) August 24, 2021

Step 7: Forget the hardships

Looking back, it is really easy to forget the sweltering heat, swarms of mosquitoes, inescapable sun, and freak lightning storms. And, it’s probably better to forget those anyway!

Pulitzer Challenge

azandis@gmail.com — Sat, 26 Dec 2020 22:31:43 +0000

In addition to reading all of the Pulitzer novels, I also challenged myself to find as many of the books as possible at used book stores. Hunting became an added element of fun!

I’ve always been enamored with books, but when I started my PhD, I was worried that I would fall out of the habit of reading for fun. So, I set a goal for myself: read all of the Pulitzer Prize winning fiction novels before I finished my degree.

The Prize for Novel was one of the original Pulitzers. The first was awarded in 1918 and the competition has run annually since, although the category name was changed from Novel to Fiction in 1947. In eight years, no prize was awarded (the last time was in 1977 when the committee passed up Norman MacLean’s A River Runs Through It).

So, in total, there are 93 winning novels as of 2020. A few (8), I had read prior to setting my personal challenge. In the end, it took me 5 years to read all 36,518 pages. Some of my favorites were Ironweed, A Confederacy of Dunces, A Bell for Adano, Middlesex, Arrowsmith, All the Light We Cannot See, So Big, and Laughing Boy. I’ve included my rating for all of the novels below.

I really enjoyed making reading-for-fun an objective challenge. It certainly coerced me to read more. Surprisingly, this challenge not only helped me maintain my reading habit but increased my consumption. Between each Pulitzer, I generally read either a non-fiction book or sci-fi/fantasy novel, which I also track. In the end, I managed to read an average of slightly over 3 books a month. Given the constant reading required for my degree, there is no way I would have read a fraction of those books without adding time for reading to my to-do list for the week.

With this challenge over, I am ready to start another. To be honest, I was a bit disappointed in the Pulitzers for including a fair portion of really boring books (I will NEVER read another John Updike novel in my life). So, for my next challenge, I’ve decided to aggregate rankings from as many “Books to Read Before You Die” lists as possible and read the top 100. I’ll publish a follow up post once I generate that list.

My ratings were slightly on the high side and did not seem to correlate much with either how old the title was or the number of pages.

Ratings

Books are arranged by award year within ranks. Here’s the thought process behind my ratings:

5 = I would re-read this book

4 = I would recommend this book

3 = I’m glad I read this book

2 = I would recommend against reading this book

1 = I regret the time I spent reading this book

5

So Big – Ferber (1925)

Arrowsmith – Lewis (1926)

Laughing Boy – La Farge (1930)

The Good Earth – Buck (1932)

The Grapes of Wrath – Steinbeck (1940)

A Bell for Adano – Hersey (1945)

The Old Man and the Sea – Hemingway (1953)

To Kill a Mocking Bird – Lee (1961)

Angle of Repose – Stegner (1972)

A Confederacy of Dunces – Toole (1981)

The Color Purple – Walker (1983)

Ironweed – Kennedy (1984)

A Good Scent from a Strange Mountain – Bulter (1993)

The Amazing Adventure of Kavalier and Clay – Chabon (2001)

Middlesex – Eugenides (2003)

The Road – McCarthy (2007)

All the Light We Cannot See – Doerr (2015)

Empire Falls – Russo (2002)

4

The Bridge of San Luis Rey – Wilder (1928)

Gone with the Wind – Mitchell (1937)

The Yearling – Rawlings (1939)

Dragon’s Teeth – Sinclair (1943)

Tales of the South Pacific – Michener (1948)

The Way West – Gutherie (1950)

The Caine Mutiny – Wouk (1952)

A Death in the Family – Agee (1958)

Advise and Consent – Drury (1960)

The Edge of Sadness – O’Connor (1962)

The Reivers – Faulkner (1963)

The Fixer – Malamud (1967)

The Stories of John Cheever – Cheever (1979)

Foreign Affairs – Lurie (1985)

The Shipping News – Proulx (1994)

Martin Dressler: The Tale of an American Dreamer – Millhauser (1997)

The Brief Wonderous Life of Oscar Wao – Diaz (2008)

The Overstory – Powers (2019)

The Nickel Boys – Whitehead (2020)

His Family – Poole (1918)

All the King’s Men – Warren (1947)

The Town – Richter (1951)

The Collected Stories of Katherine Anne Porter – Porter (1966)

The Killer Angels – Shaara (1975)

A Thousand Acres – Smiley (1992)

A Visit from the Goon Squad – Egan (2011)

The Goldfinch – Tartt (2014)

Underground Railroad – Whithead (2017)

3

Alice Adams – Tarkington (1922)

The Able McLaughlins – Wilson (1924)

Early Autumn – Bromfield (1927)

Scarlet Sister Mary – Peterkin (1929)

Years of Grace – Barnes (1931)

The Store – Stribling (1933)

Lamb in His Bosom – Miller (1934)

The Late George Apley – Marquand (1938)

Journey in the Dark – Flavin (1944)

Andersonville – Kantor (1956)

The Travels of Jaimie McPheeterrs – Taylor (1959)

The Keepers of the House – Grau (1965)

The Confession of Nat Turner – Styron (1968)

The Collected Stories of Jean Stafford – Stafford (1970)

Elbow Room – McPherson (1978)

The Executioner’s Song – Mailer (1980)

Lonesome Dove – McMurty (1986)

The Hours – Cunningham (1999)

The Known World – Jones (2004)

Gilead – Robinson (2005)

March – Brooks (2006)

Tinkers – Harding (2010)

Less – Greer (2018)

Now in November – Johnson (1935)

Beloved – Morrison (1988)

Breathing Lessons – Tyler (1989)

Independence Day – Ford (1996)

The Orphan Master’s Son – Johnson (2013)

2

The Magnificent Ambersons – Tarkington (1919)

The Age of Innocence – Wharton (1921)

One of Ours – Cather (1923)

Honey in the Horn – Davis (1936)

In this Our Life – Glasgow (1942)

Guards of Honor – Cozzens (1949)

A Fable – Faulkner (1955)

The Optimist’s Daughter – Welty (1973)

The Mambo Kings Play Songs of Love – Hijuelos (1990)

Rabbit at Rest – Updike (1991)

The Stones Diaries – Shields (1995)

American Pastoral – Roth (1998)

The Sympathizers – Nguyen (2016)

1

House Made of Dawn – Momaday (1969)

Humboldt’s Gift – Bellow (1976)

Rabbit is Rich – Updike (1982)

A Summons to Memphis – Taylor (1987)

Interpreter of Maladies – Lahiri (2000)

Olive Kitteridge – Strout (2009)

Django sat with me as I finished the last book of the challenge, the day after Xmas, 2020.

The Anatomy of Data Viz

azandis@gmail.com — Thu, 08 Oct 2020 00:50:06 +0000

When I first started in communications, data viz was hard. You basically had to have a serious knowledge of Adobe Illustrator and Photoshop. At that time, “New Media” was just coming into vogue. We don’t even use that term anymore. Now, all media is new media.

Today it is trivial to make really sexy graphics in a few clicks and keystrokes. But the ease of creation also makes it much easier to produce poorly planned or spurious outputs. It also means that the marketplace of people’s attention is now flooded with loads of other eye-catching data visualizations to compete with.

Now, more than ever, it is important to think strategically about how to present your work. This blog grew out of a guest lecture I gave. It is intended to present some conceptual tools to help you make your data stand out.

To data viz or not to data viz?

Making a stellar data visualization takes time and effort. Even a simple plot for a scientific paper can take a while to get to the final print-ready stage. So, for starters, it is worth considering just how much time your particular data viz project is worth.

It is really easy to go deep into a rabbit hole making a beautiful visualization, or even an entire data storytelling project, only to have it sit on your computer or collect digital dust in some dark corner of your blog. My most adamant piece of communications advice is that you should spend just as much time planning how to outreach your work as you do creating it. And after you’ve got your product, you should again spend the same amount of time actually making sure that people see it. (This rule applies less to journal figures since the outlet is predetermined, but you should still plan to spend as much time sharing your hot-off-the press manuscript with your stunning figures after it comes out.)

Data visualization or data storytelling?

When people think about great data visualizations, they often think about the flashy and interactive products like those from the Washington Post or New York Times. I also love these interactive visuals, but to me, they are something more than data viz–they are data storytelling. Rather than simply displaying data, data storytelling integrates data as a part of a larger narrative. Good data storytelling involves skills that overlap with data viz, but add much more. For instance, my friend Collin’s Story Map of his research on lizards evolving to hurricanes is a great example. We learn all about his research and how he produced his data, but very little about the data itself.

One of my favorite data visualizations is this citation network of all Nature publications from the past 150 years. Every point is a paper and every line is a citation. It is easy to see how fields split and merge over time. Click the image to see the interactive visual at Nature’s website.

In this article, I want to focus narrowly on data viz and how we interpret statistics visually. There are loads of plot forms that you can use, and folks are always coming up with new ways to use them, so rather than create an exhaustive list, I want to consider when and how we use data visualizations.

One quick caveat here: data viz implies that the only way to interpret data is with sight. But there are some really cool projects that display data without visuals, like my friend Lauren, who translated Alaskan tree loss through sound.

Grabbing your attention or focusing your attention?

One of the first questions to ask yourself in defining the purpose of your visual is: am I trying to grab folks’ attention or do I want to focus their attention? Humans brains are not all that well designed for sustained attention (I go in depth about this in my presentation about scientific presentations), so most of our task as science communicators is simply managing people’s attention spans. Flashy and interactive visuals are great for catching your audience’s eyes, but can be a distraction from carefully interrogating specific trends in data because there is too much to focus on. On the flip side, an equally beautiful but more subdued plot can perfectly highlight a specific point you want to make about your data, but folks might flip or scroll right past it if they are not actively interested. Considering your audience is paramount. For example, in a paper, I may include lots of information in a plot, but when I present my work in presentation form at conferences, I completely strip down my figure to their most basic elements.

One of the reasons we have short attention spans is that our brains have evolved to process lots of information quickly. As a tradeoff, our brains take cognitive shortcuts. If we are clever, we can use visualizations to hack our brains and leverage those shortcuts. As an example, take a look at the two images below. Can you tell which image of stars is randomly placed? Can you tell which set of numbers is random?

Can you tell which set of start or which set of numbers was randomly generated? The star example is by Richard Muller and the numbers are by Paul May.

Human brains are overly tuned to seek patterns. Often, we see patterns when none are there (maybe this is where human predilection for superstition, conspiracy theories, and religion come from). Most people think that the blue stars (B) and number string A are the random sets. That is because we tend to see too much pattern and clustering in the black stars and too many patterns of repeats in number string B. When we see patterns, we assign meaning. In fact, the black stars are randomly placed (the blue stars are overly uniform) and number string A is randomly generated.

This is convenient for data viz, because it makes it easy for us to see trends in complicated data. For example, when Nature plotted all of it’s published papers over the last 150 years, and then linked them by citations, the result was incredibly complicated. But our minds tune-out most of the noise and instead focuses in on the major groups where fields merge.

On the flip-side, our minds are quick to spot deviations from patterns, too. For instance, when Campbell et al. plot coding density versus genome size, it is easy to spot the clade of endosymbionts (in green) that deviate from the trend.

Figure from Campbell et al. 2014 shows how our mind’s natural pattern seeking also makes it easy for us to spot deviations from trends.

Our brains are also really bad at conceptualizing large numbers. For instance, if I told you that humans have about 3.2 billion bits of information in every cell of your body, but E. coli has just 5 million, and Paris japonica flower has almost 150 billion, the scale might be hard to grasp. But if I compare your genome to the letters in an encyclopedia and visualize the difference, the disparity is clear.

Encyclopedia Genomica. If each letter in the encyclopedia represented one letter of DNA sequence, you could write out the entire genetic code for E. coli in half a volume. A human would take about 10 sets and a Paris japonica flower would need about 495 sets. (I made this visual, but I got the idea from a talk by David Weisrock).

Making visuals that strategically hack our brains.

When it comes to visuals, I don’t like prescribing rules. Aesthetics change too quickly. Instead, I think it is more helpful to be strategic about the content of your visuals and treat the aesthetic refinement as an artistic process.

Scott Berinato’s book Good Charts comes from the perspective of management rather than science, but is, nonetheless, one of the best examples I’ve found of thinking strategically about making visuals. Berinato thinks that visuals fall on two intersecting gradients: Conceptual versus Data-driven (are you dealing with ideas or statistics?) and Exploratory versus Declarative (are you looking for a pattern or are you showing a pattern?).

Categories of data visualizations from Scott Berinato’s book Good Charts.

1. Everyday data viz

Usually, when we think about data viz, we are thinking about graphics that fall into the upper right quadrant, data-drive declarative graphics, what Berinato calls “Everyday data viz.” The purpose of these graphics is to highlight specific facts about our data. Most of the figures from scientific papers fall into this category.

Radial mirrored bar plot from a tutorial I made comparing population density to canopy cover across U.S. states.

Within the “everyday data viz” category, there lies a wide range of visualization goals that depends on the intended audience. For example, I made a mirrored radial barplot comparing population density to tree cover. Wrapping this plot into a radial form makes the data more interesting, but actually makes it more difficult to read. If I were to include these data in a scientific paper, I would probably use a dotplot like the one in the top left of the figure. The dotplot displays the same information in a way that is more conducive to quantitative comparison.

With these types of visuals, there is often a tradeoff between simplicity and aesthetics. Usually, simpler is better for scientific audiences. However, sometimes the whole point of a graphic is to demonstrate complexity or variation in the data. For instance, a simple mixed model regression could be easily displayed as a single trend line.

Not only is this super boring, but it misses one of the points of mixed models, which is how we deal with variation in the data. Below are six examples showing the same trend while highlighting the variation in the data in different ways.

Here are six different ways to display the fit of a mixed effect model that explicitly show variation in the data. Often, we are just as interested in display our uncertainty in our data as we are in telling the main story. (I made these plots as part of a tutorial on displaying mixed models that I hope to publish soon.)

On the other hand, when giving scientific presentations, we want to highlight the main trend without distracting the audience with noisy variation. In a prior post, I used the fake example below, where the most important trends (bottom figure) are completely buried in the meaningless distraction of too much information (top figures).

These fictitious plots are from my post about better scientific presentations. Depending on the audience and attention spans, you can include more or less information. But scientists most often include WAY MORE information than is needed in plots.

My main point here is that you must be strategic about who your audience is and exactly what you want them to take away from your visuals. It is unlikely that anyone will think as carefully about your graphic as you have. Instead, most folks will take away a fraction of the information you present. So, it is worth being as parsimonious as possible with the content in your graphics. One tip for presentations is to step away from your computer and squint your eyes–if you can’t make out the main trend, you probably should strip it down. Another tip is to start with the bare axis and explain them to your audience before showing the content of the plot. This way, they already know what to expect and they will not be as distracted trying to conceptualize what the graphic is saying.

2. Visual discovery

The graphics in the upper right quadrant of Berinato’s diagram are like the perfected Pintrest versions of our visuals. Before we get to that point, we will probably plot a ton of graphs as we analyze our data that no one ever sees. Berinato calls these graphs “visual discovery.” They fall in the lower right quadrant of data-driven exploratory plots.

As we explore our raw data, it is useful to hack our own brains to discover hidden patterns in our data. Most data is multidimensional and too complex to see every relationship at once. So, we check for relationships among variables and among subsets of variables. This process is usually iterative. The point isn’t to make perfect, pretty graphics–the point is to wrap our minds around the data.

One of my favorite examples of visual discovery involves one of the oldest examples of data viz.

John Snow’s 1854 map of cholera cases surrounding a London public well.

In the mid 1800s cholera was sweeping into London. At the time, few understood how the disease was transmitted. John Snow (no, not that John Snow) a medical doctor decided to plot the cases as bar charts of the number of victims at each address on a street map of the city. The map showed a public well at the center of the epidemic. The map helped Snow convince skeptical municipal authorities to close the well and effectively ended the outbreak.

Visual discovery is what scientist probably spend 80% of their analysis time on (I certainly do). Plotting programs like Rstudio or MatLab (and to a lesser extent, Excel) make it really easy to play with lots of ways to see our data and easily iterate to narrow in on interesting trends.

3. Idea illustration

The top left quadrant, conceptual and declarative, Berinato calls “Idea illustration.” These are usually heuristics, flow charts, or diagrams with the purpose of visually demonstrating a complex idea in picture form. Scientists use these type of graphic often in review or synthesis papers. For example, I made the figures below for a recent review paper of herp thermal evolution. Neither are based on data. The first demonstrates a theoretical process. The second illustrates what real data might look like and how to interpret them. These types of graphic hack the map reading tendencies of our brain or prime our natural pattern seeking.

Figure from a recent review paper I published as examples of conceptual diagrams.

4. Idea generation

The lower left quadrant, Berinato calls, “Idea generation.” These are the kinds of figures scientists scribble up on white boards when we are thinking through experiments. Rarely do these graphics make it out into the world, rather they help us think through our own ideas. However, sometimes conceptual, exploratory graphics are useful for thinking through hypotheses. For example, I included the graphic below in my dissertation prospectus as a way to think through how geneflow patterns might look in different populations.

Example on an “idea generation” visual that I made for my dissertation prospectus.

Understanding why and how it makes sense to use graphics can save you loads of time, keep you from making spurious plots, and may even lead you to a new discovery. Fortunately, professional plotting tools like (R and GIMP2) are freely available. So get out there and start making something beautiful and useful!

How to avoid giving terrible presentations

azandis@gmail.com — Mon, 11 Nov 2019 19:42:09 +0000

Recently, I gave a presentation to a class of Yale Masters students about how to give better scientific presentations. This is a topic I think about a lot, coming from a background in non-profit communications.

I’ve replicated all of the slides and script below, but first, here are the bullets if you’re short on time.

To make better presentations, try:

Reorient your presentation to optimize for short attention spans. This means stating your conclusion first, then presenting your data, then connecting it to the broader topic.
Minimize distractions. Avoid designs and presentation strategies that will distract from your main topic.
Bad storytelling can be a distraction from even the best story, so please, practice your #$*&ing presentation!
Use figures to tell stories, not to show data. Reduce data stories to minimum story units and only tell one story unit at a time using layers, starting with the axes.
I also include a number of simple PowerPoint hacks that will help you avoid common errors that lead to wonky slides.

Here’s my full presentation:

What is the point of academic presentations? The obvious answer is “to transmit information about our science.” Realistically, most of us are also motivated by little more than adding a new line to our CV. An underappreciated purpose of presentations is as a vehicle for networking. Presentations are like calling cards and signal to your audience if your work is interesting. It also give audience members a brief window into you as a scientist.

We need to give good presentations because, first, the whole damn point is to transmit our information. Bad presentation are a barrier to that transmission. Second, we need to give good presentations because presentations are like our tinder profiles to collaborators. A bad presentation indicates that you do not care enough about your work or your peers in the audience to bother making a tolerable presentation.

Some might question the wisdom of signing up to give a presentation about not giving bad presentations. But you should be able to guess from the title of my talk that I think the bar is exceptionally low. More often than not, academics give pretty terrible presentations. But there are a few simple things you can do to give really great presentations.

It basically boils down to just two key elements:

Optimize for attention

Minimize distractions

Let me show you what I mean by Optimizing for Attention.

Almost every science presentation I have ever seen is organized the same way that we organize papers.

We start with some big picture background questions, then we talk about our specific question, then we explain our system, then we detail our methods, then we talk through our results, and if we have enough time at the end, we finally get to our conclusion, which is the main punchline of the talk. That is the chocolate at the center of the tootsie-pop.

The problem with this model is that our attention span diminishes over time. So, we end up wasting all of the optimal attention window laying out the least important information, and we wait until half the audience has zoned out before delivering the punchline.

Lindquist & McLean (2011) showed how attention diminishes over time by surveying folks during 45-minute lectures. They sounded a signal at intervals during presentations and then had respondents indicate if they were thinking about something unrelated to the lecture, called Thoughts or Images Unrelated to Tasks (= TUITs (basically, day dreaming)). The frequency of daydreaming increased over time until only about half of the audience was focused at any given time towards the end of the presentation.

To optimize for attention spans, one option is to flip your presentation to match the audience’s daydreaming frequency. Give the punchline first. Rather than starting by explaining why your question matters, start with your conclusion and then explain how it is relevant.

Even though this might seems strange, it is actually how we read papers. Rarely do we ever read papers linearly. Most people skip to the last line of the abstract to get the punchline, then maybe read the abstract or figure captions, then the conclusions. Maybe you read the methods or the text of the results section next, but the last thing you read is the background or introduction.

Even if you arrange your presentation to optimize for folks attention span early, we still don’t want to lose most of our audience along the way. We’d like to keep folks’s attention as much as possible.

We humans are a very distractible animals. We spend 47% of our day distracted by other things. And we are most distracted at the times we are trying to concentrate the most!

This makes maintaining attention even more difficult in learning environments, because we will naturally seek out distraction.

For example, kids in decorated classrooms performed 25 – 35% poorer on tests because of the easy distraction from the environment. Interestingly, even without environmental distraction, kids simply switched to being distracted by each other.

In the case of classrooms, there are likely many other benefits of busy classrooms that outweigh the distractions they cause. But in the short span of a presentation intended for adults, we should be aiming for the starkest, least distracting design.

Just to drive home my point that we humans are overly distractible, look at the next slide and time how long it take you to find the let “O”.

Now try it again with the next column.

When researchers repeat these kinds of tests over and again, they find that just adding a simple distraction, like the cartoon, substantially increases your processing time because your mind splits your attention to processing the distracting image.

So, given what I just told you about how easy it is to distract a human, what is wrong with this fictitious but typical academic style slide?

The problem with this type of slide is that there is too much going on. Even the most aggressively wielded laser pointer will not be able to focus the audience’s attention to one element at a time without distraction.

It is also important to remember that even the best story can be ruined by a bad storyteller. Poorly practiced and desultory presentations can be a huge distraction. So, in addition to minimizing visual distractions, remember to…

In the next section, I will go over some pointers and presentation hacks to avoid slides like this one.

Using figures

Academics love to put up complex figures with loads of distracting and unnecessary information on slides. Which kind of makes sense, really. After all, you spent hours collecting each little point of data, so to you, every data point is important. But that’s often not the case. Our goal with figures is to tell a story. It’s not to show off how much work we did, or how complicated our designs are. We want to distill our figures down to the smallest possible story units.

Take a look at this figure from an “experiment” I conducted. For this fictitious experiment, I was interested in how beard color and length correlates with crossword completion speeds. I went to 16 towns. In each town I gave 100 bearded folks a crossword and recorded how many cells they completed per minute and measure their beard length. In every town, half the folks had red beards and half had brown. I noted if they had proper beards or goatees (type).

Think about the minimum units of the story here. What are some ways you might be able to make this figure simpler and less distracting?

Here is my revision. All of those original cells told basically the same story. And this image answers my primary question: “How does beard length and/or color impact crossword speed?” There is no need for the other information in the figure.

Now, here is how I would present this slide. I do it in layers, starting byexplaining the axis and what we should expect to see.

Then we layer on one bit of information.

Then the next bit of information.

But there are some times when distilling down all of the information into one figure is difficult or doesn’t answer our question.

In this next fictitious experiment, I wanted to know if dragon size correlated with the number of villagers eaten. I recorded three different species of dragons in three different years (years is in time from present).

In this case, there is no obvious story–the relationship changes between years and with different dragon species. Also, our sample size are very different, so we need to have a way of relating that we are more confident in some relationships than others.

Here is how I decided to tell this data story.

First, I start with the blank axis and explain what they mean.

Then I add the first year of information.

Then, I add the second year of information. But I still want folks to see the prior information, so I use selective highlighting to focus their attention. We can see that in all cases, more villagers were consumed. And the rate of growth increased.

Now we add the third element. And we can tell the whole story. Every year, dragons eat more villagers, but species differ in the level of increase. Also, the relationship changes over time with respect to body size for different species of dragon.

To recap the whole story, I might show just the trends lines and confidence bands.

It can be really hard to strip down figures, especially if you are afraid that someone might question your data. When I make presentations, I always keep all of my original figures in extra slides at the end of my presentation, after the conclusion slide. I never show those slides in the presentation, but if someone asks a specific question about the data, I can easily flip to the more informative figure.

Other trick and tips

You can use the same kind of selective color to focus attention with text, too.

I also want to touch on some problems that I see too often and tell you how to avoid them.

Have you ever seen a presentation that looked fine on your computer, but came out looking like this at your conference talk?

This happens when you use a font on your computer to make a presentation that is not installed on the computer that you use to display the presentation. The computer defaults to what it thinks the next closest font should be, and it is always wrong.

The easiest solution is to simply export your slides as JPEG image files and put each one back onto a slide. Essentially, your slide is now a picture of your original slide. That way, it will be displayed exactly as you see it on your computer wherever you display it.

One quick word while we are talking about fonts. Please try to pick simple fonts (like those on the left). You should only use the fonts on the right if you are creating a title page for your 5th grade history report.

Have seen presentations with figures that look like this?

This happens when you enlarge an image in bitmap format. Essentially you are trying to display more pixel than in the original image. The computer interpolates new pixels by averaging adjacent pixels, but it is fuzzy and pixelated. The solution here is to either use vector based images or bitmaps that are as large or larger than the display size. If you are using an image from a paper, try to download the largest image size. If you can only take a screen shot, be sure to blow it up as large as possible before capturing the screen.

And in conclusion…

Please don’t do this at the end of your presentation. No one needs to know every organization that has ever given you money. And it is great to thank people, but if everyone is special, no one is special. Instead of making your final slide into a Guess Who board, consider alternative options to show gratitude. For example, if an undergrad was integral to an experiment, pop up their photo on the results slide and thanks them then. If an advisor was especially helpful, they’ll appreciate a handwritten thank you note more than a pixelated mug shot at the end of your presentation.

Keep your conclusion slide simple. Use the final slide to give your audience ways to learn more. For instance, I try to make a blog post about my presentation that folks can use to find out more information, check out my references, or see my original figures.

Good luck and happy presenting!

The REAL problem of unpaid internships is us

azandis@gmail.com — Thu, 13 Jun 2019 01:54:50 +0000

A few weeks ago I wrote a post about a questionable internship proposal by the Northeast branch of Partners in Amphibian and Reptile Conservation (NEPARC). In the interim, I’ve had a couple of great conversation about the topic, including hearing from folks at PARC (NEPARC regional and national). With just one exception, these conversations have been super supportive and understanding of the issue. I’ve republished (with permission) a response from PARC’s Executive Committee at the bottom of this post.

I’m really impressed with their position. It’s clear that they’ve already thought about the problem of unpaid internships a lot.

I especially want to highlight that I completely empathize with PARC in that these issues are moving targets, especially from the perspective of large, all-volunteer organizations. That PARC is actively working on fixing the problem is to their credit.

I also want to make a strong point that I failed to fully articulate in my last post: the responsibility shouldn’t fall solely on the organizations to fund internship—all of us that appreciate and benefit from the work of those organizations do should feel responsible, too.

The responsibility shouldn’t fall solely on the organizations to fund internship—all of us that appreciate and benefit from the work of those organizations do should feel responsible, too.

I feel really fortunate that I landed a paid internship (shout out to Sitka Conservation Society) right out of undergrad. When that internship rolled into a salaried position, I was already inculcated into the stance that if we couldn’t afford an intern, then we couldn’t offer an internship. But I’ve also served on the Board of Directors for a couple of non-profits and have struggled with the desire to get work done on a slim budget and the temptation to seek willingly free labor from unpaid interns.

The root of the unpaid internship issue is in the lack of funding for environmental conservation. Grouped along with animal rights and animal welfare groups, the sector is receives the least charitable giving, just 2.8% of the 407 $B total philanthropic gifts in 2018. Source: Giving USA Foundation 2018 Report.

The root of the problem is that all of us undervalue the important work of non-profits. If our nonprofits were well funded, this issue would never arise. Unfortunately, environmental organizations receive the least philanthropy of any sector (grouped with animal groups, the sector receives just 2.8% of total charitable contributions annually).

I fear that my first post came across as more of a call-out of NEPARC than a general call-to-action. I’m not a fan of call-out culture, so I hope you will join me in this call-to-action to support PARC. I decided to pony up on my offer to support PARC. I really hope you will make a donation too. They suggest that the best ways to support them are to donate to their non-profit partner, the Amphibian and Reptile Conservancy, or buy some sweet PARC swag.

Or, just take a minute to donate to your favorite environmental non-profit, and feel good that you are helping to end the need for unpaid internships.

Here is the response from PARC, in full:

“Dear Andis,

I’m writing to you on behalf of PARC‘s Executive Committee with regard to your recent blog post: The problem with unpaid pseudo-internships.

First, I’d like to thank you for sharing your perspective and for highlighting actionable steps that PARC must take to ensure equitable and just practices within our organization. I’d also like to apologize for the delayed response; PARC is an all-volunteer organization and it often takes a few days (if not longer) to gather and address feedback from all of the appropriate entities.

We (the National PARC leadership) agree with your views on the issue of unpaid internships in ecology/conservation. This is an issue we have been working to address for the last year. In fact, we have restructured our internships at the national level of PARC (i.e., within the Executive Committee, which oversees the regional and state chapters) to reflect some of the key points addressed in your blog post. In some cases, we have opted to hire contractors rather than creating internships. In other cases, we have opted to provide hourly compensation to interns and to intentionally model the positions in a way that provides the intern with clear learning- and skill-based objectives and opportunities for professional development. With this updated model, we hope that our interns gain as much value from us as we do from them.

As this is a relatively new approach for us, we have not yet developed guidance on this issue for PARC‘s regional and state chapters (again as a volunteer driven organization, these things can take some time). Our extensive discussions were put in place at the national level but never translated into policies and/or guidelines. This, we believe, is a failure on our part. We hope to remedy this shortcoming by taking the following actions:

1 – We will provide time/space for discussion regarding the points you’ve raised in your blog post on our Joint National Steering Committee (which includes regional co-chairs and external partners) and National Diversity, Equity, and Inclusion task team (DEITT) monthly conference calls. Following these discussions, we will work with NE PARC to ensure that appropriate adjustments are made with regard to the social media position.

2 – We are currently developing a best practices document for engaging members and recruiting leaders at the regional and state levels. We will include a section on internships that will provide guidelines for creating equitable and ethical internships.

3 – We will ask the DEITT to provide feedback on our internship guidelines to ensure they reflect PARC‘s goal of providing an equitable platform for our members, partners, and stakeholders to engage in the conservation of amphibians and reptiles.

Thank you again for taking the time to bring this issue to our attention. We are hopeful that in the future, with the assistance of the DEITT, we can be more proactive in addressing these kinds of issues. If you are interested in joining PARC‘s DEI efforts, please consider reaching out to the DEITT co-chairs Neha Savant & David Muñoz (copied on this email). I’ve attached a document that highlights the team’s recent projects/accomplishments.

Best,

Alex Novarro (on behalf of PARC‘s Executive Committee)“