how to – A.Z. Andis Arietta

Structural Equation Modeling in R

azandis@gmail.com — Wed, 24 Jul 2024 21:52:46 +0000

Introduction

Today we are talking about structural equation models (SEM). There are lots of synonyms, sub-categories, and adjacent techniques that you may have heard before–covariance structure analysis, linear structural relations, path analysis, latent variable modeling, causal modeling, confirmatory factor analysis, exploratory factor analysis, latent growth modeling, mixture/multigroup/heirarchical/multilevel structural modeling, construct analysis, etc., etc…

So many names exist because there are a LOT of things that you can do with these types of models. It turns out that allowing for and testing structure in modeling can help solve lots of problems in research.

I think that the best way to understand SEMs is to start with simple regressions. Let’s consider a regression predicting a rooster crowing from the rising of the sun:

y = b₀ + b₁x + ε

y ~ Rooster Crowing,
x ~ Sun Rising,
b₀ = 0,
ε ~ N(0, 1)

This is a silly regression because we all understand the relationship: the rooster goes cock-a-doodle-do when the sun crests the horizon. But in the language of mathematics, there is no reason I can’t rewrite this equation as:

x = ( y – ε ) b₁^-1

This formulation make no sense. Basically, we are saying that the rising of the sun is a function of the rooster crowing! Even though this is totally mathematically viable, it defies our common sense of causation.

The language of structural equation modeling allows one way to impose some directional structure on mathematical equations. We usually visualize that language as directed graphical models like the one below.

In a graphical model, the observed variables are displayed as boxes Unobserved or latent variables are displayed as circles. Constants are triangles. The functional relationship between variables is displayed as directional arrows. Non-directional or double-headed arrows indicate a variance or covariance.

This graphical model above is the same model as the regression in the equation above. Our independent variable x has a linear relationship with the dependent variable y with the slope parameter b₁. y has a constant intercept b₀ of 0. Finally, the residual variation in y not caused by x is assumed to come from some other unobserved cause with it’s own variance. Rather than thinking of variables as independent or dependent, we used the terms exogenous or endogenous. Exogenous variables are those (like x) that have no incoming paths. These are the most ‘upstream’ variables in the causal paths. Endogenous variables are those that receive causal paths. We’ll see later that some endogenous variables can also be causes of other endogenous variables, but they are still considered endogenous.

In practice, we generally ignore the peripherals and intercepts in structural models, yielding a simplified graph:

Now that we have a shared language. Let’s take a look at a toy example to understand why SEMs can be so useful in research.

Motivating example

You will need a handful of packages for this tutorial:

packages_for_sem_workshop <-
  c(
    'tidyverse', # basic data wrangling
    'tidygraph', # graph visualization
    'ggraph', # graph visualization
    'lavaan', # sem tools
    'piecewiseSEM', # sem tools
    'mgcv', # nonlinear modeling
    'lme4', # random effect modeling
    'cvsem' # cross-validating sems
    )

install_and_load_packages <-
  function(x){
    for( i in x ){
      if( ! require( i , character.only = TRUE ) ){
        install.packages( i , dependencies = TRUE )
        library( i , character.only = TRUE )
      }
    }
  }

install_and_load_packages(packages_for_sem_workshop)

Fitting a simple regression

I find that, for most applied researchers, the language of code is more intuitive than the language of math. So, let’s simulate a toy dataset and see how we can fit the same system as a linear model or an SEM.


simple_ex <-
  data.frame(
    x = runif(n = 100, min = 0, max = 10),
    e = rnorm(n = 100, 0, 1)
  )

simple_ex <- simple_ex %>%
  mutate(
    y = 1 + 2.5*x + e
  )

Now, let’s fit a good old-fashioned linear regression model:

fit_simple_ex_lm <- 
  lm(y ~ x, data = simple_ex)

summary(fit_simple_ex_lm)

> summary(fit_simple_ex_lm)

Call:
lm(formula = y ~ x, data = simple_ex)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61653 -0.49110 -0.01622  0.51680  2.76976 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.08641    0.17916   6.064 2.49e-08 ***
x            2.52441    0.02871  87.913  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8972 on 98 degrees of freedom
Multiple R-squared:  0.9875,	Adjusted R-squared:  0.9874 
F-statistic:  7729 on 1 and 98 DF,  p-value: < 2.2e-16

We can fit the same model using the lavaan package. The syntax for lavaan is to pass the list of models as a simple character string separated by next lines.

simple_ex_sem <-
  '
  y ~ x
  y ~ 1
'

fit_simple_ex_sem <-
  sem(model = simple_ex_sem,
      data = simple_ex)

summary(fit_simple_ex_sem)

> summary(fit_simple_ex_sem)
lavaan 0.6-18 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  y ~                                                 
    x                 2.524    0.028   88.805    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 1.086    0.177    6.125    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 0.789    0.112    7.071    0.000

Lavaan give us an informative output. We see that the parameters are estimated with ‘ML’, maximum likelihood, which differs from ‘lm’ which estimates via ordinary least squares (OLS). This means that estimates might differ slightly. We will trust the default optimizer ‘NLMINB‘.

We see that the model is estimating 3 parameters–the regression slope, the intercept, and the residual variance. Note that we typically do not estimate the intercepts in SEMs (I’m doing so here for continuity). To fit the SEM without the intercept, simply remove the 'y ~ 1' from the model list.

We will pass over the Model Tests and focus on the parameter estimates. The slope estimate of 2.524 and the intercept estimate of 1.086 perfectly match our lm estimates. We don’t get an estimate of the residual variance from lm, but we can extract the residuals from the lm model and calculate the variance ourselves.

fit_simple_ex_lm %>%
  resid() %>%
  var()

> fit_simple_ex_lm %>%
+   resid() %>%
+   var()
[1] 0.7968917

That values is a bit off due to the difference in estimation techniques, I believe.

Fitting complex systems

Now that we’ve seen how linear regression models can be fit as a special case of SEM, let’s take a look at an example that shows where SEM surpasses linear regression.

Again, we will use simulated data, but we’ll download the data so that we can’t see the underlying structure used to generate it. Instead, we’ll use SEMs to determine the structure.

source("https://raw.githubusercontent.com/andisa01/202407_SEM_turorial/main/scripts/SEM_tutorial_example_source.R")

The traditional approach

First, let’s analyze this data using linear regression modeling with stepwise variable selection like almost (the technique use by almost half of all ecological and animal behavioral researcher (including my past self)). In this methodology, we start with a ‘full’ model including all variables and their interactions. Then, we drop non-significant parameters starting with the least parsimonious and refit the model until only significant parameters remain.

# Fit the full or 'global model'
mod_ex01_full <- lm(Y ~ X1 + X2 + X1:X2, data = example01_data_anon)
summary(mod_ex01_full)

> summary(mod_ex01_full)

Call:
lm(formula = Y ~ X1 + X2 + X1:X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5935  -3.7755   0.1861   3.6929  13.7593 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.163110   7.944663  -0.146    0.884
X1          -0.264413   0.387296  -0.683    0.496
X2           0.155742   0.109397   1.424    0.156
X1:X2        0.002022   0.003950   0.512    0.609

Residual standard error: 5.066 on 196 degrees of freedom
Multiple R-squared:  0.144,	Adjusted R-squared:  0.1309 
F-statistic: 10.99 on 3 and 196 DF,  p-value: 1.06e-06

None of our parameters are significant, so we drop the interaction term.

# Drop the interaction
lm(Y ~ X1 + X2, data = example01_data_anon) %>% 
  summary()

> lm(Y ~ X1 + X2, data = example01_data_anon) %>% 
+   summary()

Call:
lm(formula = Y ~ X1 + X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5962  -3.8684   0.1564   3.6213  13.8291 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -4.73378    3.79270  -1.248  0.21346   
X1          -0.08449    0.16216  -0.521  0.60295   
X2           0.19734    0.07308   2.701  0.00753 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.057 on 197 degrees of freedom
Multiple R-squared:  0.1428,	Adjusted R-squared:  0.1341 
F-statistic: 16.41 on 2 and 197 DF,  p-value: 2.555e-07

Now, x₂ is significant, but x₁ is not. So, we drop x₁.

lm(Y ~ X2, data = example01_data_anon) %>% 
  summary()

> lm(Y ~ X2, data = example01_data_anon) %>% 
+   summary() # Drop X1

Call:
lm(formula = Y ~ X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.4915  -3.8082  -0.0172   3.6272  13.5921 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.28441    2.57317  -1.276    0.203    
X2           0.16227    0.02839   5.716 3.97e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.047 on 198 degrees of freedom
Multiple R-squared:  0.1416,	Adjusted R-squared:  0.1373 
F-statistic: 32.67 on 1 and 198 DF,  p-value: 3.966e-08

And, now we have a minimum adequate model to explain the system! We can explain 14% of the variation of y with x₂. For every 1 unit increase in x₂, we expect y to increase by 0.16. And look at that p-value! Let’s go publish a paper!!!

In fact, this method of stepwise variable selection is so common that you can do it all in one command with dredge() from the MuMIn package.

MuMIn::dredge(mod_ex01_full)

> MuMIn::dredge(mod_ex01_full)
Fixed term is "(Intercept)"
Global model call: lm(formula = Y ~ X1 + X2 + X1:X2, data = example01_data_anon)
---
Model selection table 
   (Int)       X1     X2    X1:X2 df   logLik   AICc delta weight
3 -3.284          0.1623           3 -606.551 1219.2  0.00  0.626
4 -4.734 -0.08449 0.1973           4 -606.413 1221.0  1.81  0.254
8 -1.163 -0.26440 0.1557 0.002022  5 -606.279 1222.9  3.64  0.101
2  4.870  0.31890                  3 -610.048 1226.2  6.99  0.019
1 11.280                           2 -621.824 1247.7 28.49  0.000
Models ranked by AICc(x)

The complexity of complex systems

The problem with the linear regression approach is that there is an inherent assumption that the independent variables are, well… independent. But, in natural systems, there are almost always relationships between the independent variables. For instance, here are just a few structures that could underly the relationship between y, x₁, and x₂.

The model structure we derived in our stepwise regression model is the single effect graph in the top left. The model implied by the multiple regression model is a ‘common effect’ structure where the exogenous variables are uncorrelated–in other words x₁ and x₂ have independent effects on y.

But other structures could exist that we cannot easily capture in regression. For instance, For instance, x₁ and x₂ have independent effects on y but remain correlated with each other (common cause with correlation (bottom right)). Or x₁ may have no direct effect on y, but may affect x₂ which affects y in turn (this is called a chain or fully mediated effect (top middle)). Or, x₁ might directly affect y in addition to the indirect effect (partial mediator (top right)).

x₂ may not have any effect on y at all, but may still covary because both are directly affected by x₁ (common cause (bottom left)).

Using SEMs to compare structural hypotheses

Given all the possible structures, how can we ever know which governs our study system? Well, the first pass is to use common sense. If x₁ is height and x₂ is weight, the principle of allometry should have us exclude any model without a relationship between them.

The second pass is to use knowledge of your system from the literature. For example, if x₁ is maternal genotype and x₂ is F1 genotype (of the progeny), there can be no direct effect of the maternal effect on y if the species has an annual lifecycle, but might be partially mediated for perennial species.

Once you have a set of plausible structural hypotheses, we can use the mechanics of SEM to ask which structure best fits the data.

For now, we will assume that all 6 hypotheses above are plausible. We’ll fit each in turn. To do so, I’ll introduce a new operator ~~ which relates the undirected covariation between two variables (i.e. covariance) or with itself (i.e. variance).

# single effect
ex01_formula_x2effect <- '
Y ~ X2
X1 ~~ X1
'

ex01_sem_x2effect <- sem(ex01_formula_x2effect, data = example01_data_anon)

summary(ex01_sem_x2effect)

> summary(ex01_sem_x2effect)
lavaan 0.6-18 ended normally after 8 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                               377.733
  Degrees of freedom                                 2
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Y ~                                                 
    X2                0.162    0.028    5.745    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    X1               32.093    3.209   10.000    0.000
   .Y                25.220    2.522   10.000    0.000

We won’t use the summary output for now, so I will exclude it when fitting the rest of the models.

# chain
ex01_formula_chain <- '
X2 ~ X1
Y ~ X2
'

ex01_sem_chain <- sem(ex01_formula_chain, data = example01_data_anon)

# partially mediated
ex01_formula_mediator <- '
X2 ~ X1
Y ~ X2
Y ~ X1
'

ex01_sem_mediator <- sem(ex01_formula_mediator, data = example01_data_anon)

# common cause
ex01_formula_commoncause <- '
X2 ~ X1
Y ~ X1
'

ex01_sem_commoncause <- sem(ex01_formula_commoncause, data = example01_data_anon)

# common effect (uncorrelated)
ex01_formula_commoneffect <- '
Y ~ X1
Y ~ X2
'

ex01_sem_commoneffect <- sem(ex01_formula_commoneffect, data = example01_data_anon)

# common effect (correlated)
ex01_formula_commoneffect2 <- '
Y ~ X1
Y ~ X2
X1 ~~ X2
'

ex01_sem_commoneffect2 <- sem(ex01_formula_commoneffect2, data = example01_data_anon)

We’ll use the anova() command to get some summary statistics on all of the models.

anova(
  ex01_sem_chain, 
  ex01_sem_commoneffect, 
  ex01_sem_commoncause, 
  ex01_sem_x2effect, 
  ex01_sem_commoneffect2, 
  ex01_sem_mediator
)

Chi-Squared Difference Test

                       Df    AIC    BIC    Chisq Chisq diff RMSEA Df diff Pr(>Chisq)    
ex01_sem_commoneffect   0 1218.8 1228.7   0.0000                                        
ex01_sem_commoncause    0 2425.5 2442.0   0.0000       0.00 0.000       0               
ex01_sem_commoneffect2  0 3688.8 3708.6   0.0000       0.00 0.000       0               
ex01_sem_mediator       0 2425.5 2442.0   0.0000       0.00 0.000       0               
ex01_sem_chain          1 2423.8 2437.0   0.2754       0.28 0.000       1     0.5997    
ex01_sem_x2effect       2 2480.4 2490.3 377.7330     377.46 1.372       1     <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Warning message: lavaan->lavTestLRT():  
   some models have the same degrees of freedom

The models are not mutually nested, so we can’t interpret the chi-squared test. However, we can use the two information criteria (AIC and BIC) since those are agnostic to the particular models in comparison. Both AIC and BIC indicate that the uncorrelated common effect structure is the most likely model.

That’s not the end of the story, though…

Cross-validating SEMs

As an applied data scientist, I’ve become very skeptical of parametric goodness-of-fit metrics. After all, I can fit a neural net that perfectly fits any dataset by memorizing it. What I am really interested is in how well a model can make accurate predictions.

Ideally, we would fit our model on one dataset and then collect separate data to validate its predictive accuracy. This is called cross-validation. Given the limitations of field research, however, a more common approach is to split your single dataset into two groups. One for fitting and one for validation.

The simple form of cross-validation only gives us one chance to measure accuracy (or lack thereof, i.e. error). In order to maximize the use of our data and get better estimates, it is a standard practice to split the dataset into a given number of evenly distributed, randomly sampled splits (i.e. folds). Then we can perform the cross-validation as many times as we have folds. This is called k-fold cross-validation.

We can use the library cvsem to easily conduct a 10-fold cross validation on all of our models.

# Cross validating
models_to_cv <- 
  cvgather(
    ex01_sem_chain, 
    ex01_sem_commoneffect, 
    ex01_sem_commoncause, 
    ex01_sem_x2effect, 
    ex01_sem_commoneffect2, 
    ex01_sem_mediator
    )

cvsem(
  data = example01_data_anon,
  Models = models_to_cv,
  k = 10
)

Cross-Validation Results of 6 models 
based on  k =  10 folds. 

                   Model E(KL-D)   SE
1         ex01_sem_chain    0.32 0.16
3   ex01_sem_commoncause    0.33 0.16
6      ex01_sem_mediator    0.33 0.16
4      ex01_sem_x2effect    5.90 1.32
5 ex01_sem_commoneffect2    7.16 1.23
2  ex01_sem_commoneffect    7.20 1.23

Both of the ‘common cause’ structures exhibit the worst predictive performance. They have the highest errors (by default, the error is calculated via KL-Divergence–the difference between the covariance matrix of the validation set and the covariance matric implied by the fitted model). To be fair, a very quick look at the relationship between x₁ and x₂ could have told us that any model that fails to account for the relationship of these two models is not realistic.

The next highest error comes from the model suggested by our initial stepwise regression method. Thus, based on predictive performance, we can conclusively exclude these hypothesized structures.

The chain, common cause, and mediator structures all have similar predictive accuracy. How can we decide which is the true structure of our system? Unfortunately, there is no statistical method to differentiate with our data. However, we can still use the structural model to help us develop further experiments to select the final model. For example, we could envision conducting an experiment where we held x₂ constant in one set of plots and let it vary in another.

If the system were a common cause structure, holding x₂ would cause no difference in the y values between plots. However, if the system were a fully mediated chain through x₂ , holding x₂ would completely decouple the association between x₁ and y, whereas controlling x₂ in a partially mediated system would only attenuate the covariation between x₁ and y.

In our toy example, the true causal structure will be easy to diagnose when I tell you what the variables really are:

y ~ Shark attacks,
x₁ ~ Weather,
x₂ ~ Ice cream sales

Estimating SEM coefficients

You might be tempted to take the SEM for the common cause structure and use that in your paper. I’d advise against it. Since we used all of our data in estimating the structure of the system, we don’t want to reuse the same data to estimate parameter coefficients. That would be double-dipping on our data. Having our cake and eating it too.

Instead, we need to either collect additional data (ideal) or reserve a split of our original data to use as the final fitting data. Since this is a simulated dataset, we can ‘collect’ all of the new data we want.

set.seed(666)

n <- 200

weather <- rnorm(n, mean = 20, sd = 5)

ice_cream_sales <- 50 + 2 * weather + rnorm(n, mean = 0, sd = 5)

shark_attacks <- 5 + 0.3 * weather + rnorm(n, mean = 0, sd = 5)

example01_newdata <- data.frame(shark_attacks, weather, ice_cream_sales)

# common cause
ex01_new_commoncause <- '
ice_cream_sales ~ weather
shark_attacks ~ weather
# shark_attacks ~ 1
'

ex01_sem_new_commoncause <- sem(ex01_new_commoncause, data = example01_newdata)

summary(ex01_sem_new_commoncause)

> summary(ex01_sem_new_commoncause)
lavaan 0.6-18 ended normally after 10 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                    Estimate  Std.Err  z-value  P(>|z|)
  ice_cream_sales ~                                    
    weather            2.131    0.065   32.853    0.000
  shark_attacks ~                                      
    weather            0.419    0.061    6.852    0.000

Covariances:
                     Estimate  Std.Err  z-value  P(>|z|)
 .ice_cream_sales ~~                                    
   .shark_attacks       3.026    1.644    1.841    0.066

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .ice_cream_sals   24.452    2.445   10.000    0.000
   .shark_attacks    21.727    2.173   10.000    0.000

Our coefficient estimates are close, but not spot on the true parameter values we simulated. The reason is that the simulated errors are pretty high (sd = 5). Rather than fitting the final model once, we can borrow another technique from machine learning and repeatedly fit bootstrap replicates. This will both stabilize our estimates and provide a convenient, non-parametric confidence interval.

Boostrapping SEM coefficients

Content to come!

Extensions to SEM

Content to come!

Laten variables

Content to come!

Composite variables

Content to come!

Non-linear relationships

Content to come!

Hierarchical relationships

Content to come!

Background

This tutorial was originally developed for a graduate workshop at Yale University on 2024 07 25.

Biomass Pyramids and Funnel Plots in R with GGplot2

azandis@gmail.com — Tue, 22 Nov 2022 21:38:39 +0000

Every once in a while, I come across a plot form that is not in my ggplot repertoire. For example, I occasionally want to make a funnel plot in my work. I’m ashamed to admit that I usually pop over to Illustrator to make those plots.

But last week, a friend asked me if I knew how to make biomass pyramids in ggplot2. Since I love a challenge and a pyramid plot is just an upside down funnel, I figured I would give it a try.

My solution was to hack the geom_bar() functionality to make a mirrored bar plot, then flip the axis.

Here’s a simple example with some simulated data:

library(tidyverse)

X <- data.frame(level = c("Small", "Medium", "Large"), value = c(25, 50, 100)) 

X %>%
  ggplot(aes(x = fct_reorder(level, value), y = value)) +
  geom_bar(
    stat = "identity", 
    width = 0.9
    ) +
  geom_bar(
    data = . %>% mutate(value = -value),
    stat = "identity",
    width = 0.9
    ) +
  geom_text(aes(label = value, y = 0),
            col = "white",
            vjust = 0) +
  coord_flip() +
  theme_minimal() +
  theme(
    axis.title = element_blank(),
    panel.grid = element_blank(),
    axis.text.x = element_blank()
    )

If you want a pyramid instead of a funnel, just change value to -value when passing the x = fct_reorder(level, -value) argument to aes().

There is a lot you can do to customize the plots from this basic plot. As an example, I decided to loosely replicate the biomass pyramid example image from Wikipedia.

library(cowplot)

Biomass <- data.frame(System = c(rep("Aquatic Ecosystem", 4), rep("Terrestrial Ecosystem", 4)),
                      level = c("Sea Lion", "Herring", "Zooplankton", "Phytoplankton", "Snakes", "Mice", "Grasshoppers", "Grasses"),
                      value = c(1e3, 1e4, 1e5, 1e6, 15.2, 152, 1520, 15200))

Aqua <- Biomass %>%
  filter(System == "Aquatic Ecosystem") %>%
  mutate(value = log10(value)) %>%
  ggplot(aes(x = fct_reorder(level, -value), y = value, fill = as.numeric(fct_reorder(level, abs(value))))) +
  geom_bar(
    stat = "identity", 
    width = 1) +
  geom_bar(
    data = . %>% mutate(value = -value),
    stat = "identity",
    width = 1
  ) +
  geom_text(aes(label = paste0(10^value, " kg"), y = 0),
            col = "goldenrod",
            vjust = 3,
            cex = 4) +
  geom_text(aes(label = level, y = 0),
            col = "grey",
            vjust = 0,
            cex = 8) +
  coord_flip() +
  theme_minimal() +
  theme(
    axis.title = element_blank(),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    legend.position = "none"
  ) +
  scale_fill_gradient(low = "dodgerblue", high = "navyblue") +
  labs(title = "Aquatic Ecosystem")

Terra <- Biomass %>%
  filter(System == "Terrestrial Ecosystem") %>%
  mutate(value = log10(value)) %>%
  ggplot(aes(x = fct_reorder(level, -value), y = value, fill = as.numeric(fct_reorder(level, abs(value))))) +
  geom_bar(
    stat = "identity", 
    width = 1) +
  geom_bar(
    data = . %>% mutate(value = -value),
    stat = "identity",
    width = 1
  ) +
  geom_text(aes(label = paste0(10^value, " kg"), y = 0),
            col = "goldenrod",
            vjust = 3,
            cex = 4) +
  geom_text(aes(label = level, y = 0),
            col = "grey",
            vjust = 0,
            cex = 8) +
  coord_flip() +
  theme_minimal() +
  theme(
    axis.title = element_blank(),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    legend.position = "none"
  ) +
  scale_fill_gradient(low = "palegreen", high = "darkgreen") +
  labs(title = "Terrestrial Ecosystem")

plot_grid(
  Aqua,
  Terra,
  nrow = 1
)

A few notes on this plot, to remove the space between bars use width = 1 in geom_bar().

You’ll also see that I decided to log-scale the values. You have to transform the data before piping it into the ggplot function, otherwise when we make the values negative, the log is undefined.

I also decided to color the bars. In order to get the gradient to scale correctly, we need to order the levels factor when we pass it to the fill argument.

I tried valiantly to make side-by-side figures using facet_wrap() and scales = "free", but found it difficult to get the factor level order correct for the color fill. So, I fell back to making separate plots and paneling them together with cowplot.

Web scraping and text analysis in R and GGplot2

azandis@gmail.com — Tue, 30 Aug 2022 12:04:44 +0000

I recently needed to learn text mining for a project at work. I generally learn more quickly with a real-world project. So, I turned to a topic I love: Wilderness, to see how I could apply the skills of text scrubbing and natural language processing. You can clone my Git repo for the project or follow along in the post below. The first portion of this post will cover web scraping, then text mining, and finally analysis and visualization.

Introduction

I thought it would be interesting to see how different Wilderness Areas are described. Having spent a lot of time in Wilderness Areas across the country, I know that they are extremely diverse. We can use text mining and subsequent analysis to examine just how that regional diversity is perceived and portrayed.

This is the final product visualizing descriptions of Wilderness Areas in the US.

Wilderness.net provides a brief description, which I assume are written by the agency Wilderness managers, for each of the 803 designated Wilderness Areas in the United States [data disclaimer]. For example, here is the page for one of my favorite Wilderness Areas: South Baranof.

As always, we first load packages and set up the environment.

library(rvest) # For web scrapping
library(qdap) # For tagging parts of speech
library(tidytext) # Text mining
library(hunspell) # For word stemming
library(quanteda) # Text analsysis
library(cowplot) # Composite figures
library(ggrepel) # For text labels in visualizations
library(wordcloud2) # Word cloud vizualizations
library(tidyverse) # For everything.

# I find it easier to install wordcloud2 from the repo:
# library(devtools)
# devtools::install_github("lchiffon/wordcloud2")

Web scraping

I found the video tutorials here very helpful. In those tutorials, the author employs a Chrome extension, SelectorGadget to isolate the page elements that you want to scrape. Unfortunately, Wilderness.net uses an extremely convoluted page structure without CSS. Thus, I could not find a way to isolate the Wilderness descriptions using CCS tags. Instead, I scrub all of the body text and then use strsplit to cleave out the portions of the text I am interested in keeping.

Conveniently, each Wilderness Areas gets its own page on Wilderness.net and each of those pages are numbered consecutively in the page address. For example, South Baranof Wilderness is number 561: “wilderness.net/visit-wilderness/?ID=561”. This means that we can simply loop through each page, scrape the text, and store it for further analysis.

First, we need to set up an empty matrix that we will populate with the data we scrub from Wilderness.net. From the text, we want to isolate and store the Wilderness name, the state that contains the most acreage, the year of designation, the federal agency (or agencies) charged with its management, and the description text.

Colnames <- c("Wild", "State", "Year", "Agency", "Descrip")
WildText <- matrix(nrow = 804, ncol = length(Colnames))
colnames(WildText) <- Colnames

Now we loop through all of the pages to populate the empty matrix with our data.

for(i in 1:nrow(WildText)){
  link = paste0("https://wilderness.net/visit-wilderness/?ID=", i)
  page = read_html(link)
  content <- page %>%
    html_nodes("#maincontent , h1") %>%
    html_text()

  WildText[i, 1] <- content %>%
    strsplit("wilderness = '") %>%
    unlist() %>%
    `[`(2) %>%
    strsplit("';\nvar") %>%
    unlist() %>%
    `[`(1)
  
  WildText[i, 2] <- content %>%
    strsplit("stateAbbr = '") %>%
    unlist() %>%
    `[`(2) %>%
    strsplit("';\nvar") %>%
    unlist() %>%
    `[`(1)
  
  WildText[i, 3] <- content %>%
    strsplit("yearDesignated = '") %>%
    unlist() %>%
    `[`(2) %>%
    strsplit("';\nvar") %>%
    unlist() %>%
    `[`(1)
  
  WildText[i, 4] <- content %>%
    strsplit("agency = '") %>%
    unlist() %>%
    `[`(2) %>%
    strsplit(";\nvar") %>%
    unlist() %>%
    `[`(1) %>%
    gsub("[^a-zA-Z ]", "", .)
  
   WildText[i, 5] <- content %>%
    strsplit("Introduction") %>%
    unlist() %>%
    `[`(2) %>%
    strsplit("Leave No Trace\n\t\t\t") %>%
    unlist() %>%
    `[`(1) %>%
    strsplit(";\n\n\n") %>%
    unlist() %>%
    `[`(2)
}

Now we convert the matrix to a tibble and check to make sure our scraping rules captured all of the descriptions.

WildText <- as_tibble(WildText) # Convert the matrix to a tibble.

MissingDescrip <- WildText %>%
  mutate(WID = row_number()) %>%
  filter(is.na(Descrip)) %>%
  .$WID

In MissingDescrip we find that 44 Wilderness Areas are missing descriptions. So, we need to alter the rules a bit and re-scrape those pages.

for(i in MissingDescrip){
  link = paste0("https://wilderness.net/visit-wilderness/?ID=", i)
  page = read_html(link)
  
  WildText[i, 5] <- page %>%
    html_nodes("#maincontent") %>%
    html_text() %>%
    strsplit(paste0(";\nvar WID = '", i, "';\n\n\n")) %>%
    unlist() %>%
    `[`(2) %>%
    strsplit("Leave No Trace\n\t\t\t") %>%
    unlist() %>%
    `[`(1)
}

There are still a couple of Wildernesses with missing information: Wisconsin Islands Wilderness #654 and Okefenokee Wilderness #426. Each of these areas have idiosyncratic text elements, so we can write specific rules to pull the descriptions for each.

# Wisconsin Islands Wilderness #654
link = paste0("https://wilderness.net/visit-wilderness/?ID=", 654)
page = read_html(link)

WildText[654, 5] <- page %>%
  html_nodes("#maincontent") %>%
  html_text() %>%
  strsplit(";\nvar WID = '654';\n\n\n") %>%
  unlist() %>%
  `[`(2) %>%
  strsplit("Closed Wilderness Area") %>%
  unlist() %>%
  `[`(1)

# Okefenokee Wilderness #426
link = paste0("https://wilderness.net/visit-wilderness/?ID=", 426)
page = read_html(link)

WildText[426, 5] <- page %>%
  html_nodes("#maincontent") %>%
  html_text() %>%
  strsplit("WID = '426';\n\n\n") %>%
  unlist() %>%
  `[`(2) %>%
  strsplit("Leave No TracePlan Ahead and Prepare:") %>%
  unlist() %>%
  `[`(1)

The management of many Wilderness Areas is mandated to two agencies. We need to parse those.

WildText <- WildText %>%
  mutate(
    Agency = case_when(
      Agency == "Bureau of Land ManagementForest Service" ~ "Bureau of Land Management; Forest Service",
      Agency == "Bureau of Land ManagementNational Park Service" ~ "Bureau of Land Management; National Park Service",
      Agency == "Forest ServiceNational Park Service" ~ "Forest Service; National Park Service",
      Agency == "Fish and Wildlife ServiceForest Service" ~ "Fish and Wildlife Service; Forest Service",
      TRUE ~ Agency
    ),
    WID = row_number()
  ) %>%
  filter(!is.na(Wild))

It would be difficult to analyze 804-way comparisons. Instead, I want to group the Wilderness Areas by broad regions. I’m defining the Eastern Region as the states that boarder the Mississippi and those states to the east. The Western Region is everything to the west. Alaska, which contains almost half of the nation’s Wilderness acreage [link to post] is it’s own Region. I’ve grouped Hawaii and Puerto Rico into an Island Region, but because there are only 3 Areas in those places, we won’t have enough data to analyze.

WildText <- WildText %>%
  mutate(Region = case_when(State %in% c("MT", "CA", "NM", "WA", "NV", "AZ", "UT", "OR", "SD", "TX", "CO", "WY", "ND", "ID", "NE", "OK") ~ "West",
                            State %in% c("MN", "FL", "PA", "IL", "TN", "VA", "KY", "MO", "VT", "GA", "MI", "AR", "NJ", "MS", "WI", "LA", "NC", "SC", "ME", "IN", "AL", "WV", "NY", "NH", "MA", "OH") ~ "East",
                            State == "AK" ~ "Alaska",
                            State == "HI" | State == "PR" ~ "Island"))

At this point, I like to save these data so I don’t need to re-scrape every time I run the analysis. Next time, I can simply pick back up by redefining the WildText object.

saveRDS(WildText, "./WildernessDescriptions.rds")
WildText <- readRDS("./WildernessDescriptions.rds")

Text Mining

Now that we’ve successfully scrubbed our data, we can begin text mining. I found Silge & Robinson’s book, “Text Mining with R” invaluable.

There are a number of approaches to dealing with text that fall on a spectrum from relatively simple text mining to more complex natural language processing. For this exercise we will conduct a simple analysis that compares the words and their relative frequencies. To do so, we need to decompose the descriptions from word strings to individual words.

First, we remove non-text characters.

WT <- WildText %>%
  mutate(Descrip = gsub("\\.", " ", Descrip),
         Descrip = gsub("  ", " ", Descrip),
         Descrip = gsub("[^\x01-\x7F]", "", Descrip))

Next, we tag the parts of speech for each word in the essay.

WT$pos <- with(WT, qdap::pos_by(Descrip))$POStagged$POStagged

POS_list <- strsplit(WT$pos, " ") # Break string down into list of tagged words

Next, we create a tidy dataframe of the parts of speech and retain only the informational words (nouns, verbs, adjectives, and adverbs). We also remove “stop words” or common words that tend not to provide a strong signal (or the wrong signal).

WT2 <- data.frame(Wild = rep(WT$Wild, sapply(POS_list, length)), words = unlist(POS_list)) %>% # Convert lists of tagged words into tidy format, one word per row
  separate(words, into = c("word", "POS"), sep = "/") %>% # Create matching columns for the word and the tag
  filter(POS %in% c("JJ", "JJR", "JJS", "NN", "NNS", "RB", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ")) %>% # Keep only the nouns, verbs, adjectives, and adverbs
  anti_join(stop_words) # Remove stop words

Next we lemmatize the words (i.e. extract the stem word, for instance, peaks = peak and starkly = stark).

WT2$stem <- NA
for(i in 1:nrow(WT2)){
  WT2$stem[i] <- hunspell_stem(WT2$word[i]) %>% unlist() %>% .[1]
  print(paste0(sprintf("%.2f", i/nrow(WT2)), "% completed")) # This just outputs a progress indicator. 
}

Finally, we can add the regions from the original dataset onto the tidy dataset and save the data object for future use.

WT2 <- WT2 %>%
  left_join(WildText %>% select(Wild, Region))

At this point, I like to save these data so we don’t need to reprocess every time we run the analysis.

# saveRDS(WT2, "./WildernessByWord.rds")
WT2 <- readRDS("./WildernessByWord.rds")

Analysis and Visualization

After all that effort of scraping text from the web and text mining, we can finally begin to take a look at our data. One of the first questions one might as is: do managers tend to write longer descriptions for Areas depending on the region?

WT2 %>%
  group_by(Region, Wild) %>%
  summarise(WordCount = length(word)) %>%
  ggplot(aes(x = fct_reorder(Region, WordCount, median), col = Region, y = WordCount)) +
  geom_jitter(size = 3, alpha = 0.4, pch = 16, height = 0) +
  geom_boxplot(width = 0.3, alpha = 0.7) +
  scale_y_log10() +
  scale_color_manual(values = c("#008dc0", "#8ba506", "grey", "#bb5a00"))

Descriptions of Alaskan Wilderness Areas tend to be longer than those for other regions. While descriptions for Areas in Hawaii and Puerto Rico also have a high median, there are too few to make strong comparisons.

Another common question is: which words are used most frequently? We can answer that most simply by looking for the word with the high use count across all descriptions.

 WT2 %>% 
    group_by(word) %>%
    tally() %>%
    arrange(desc(n)) %>%
    top_n(20)

Unsurprisingly, “wilderness” is the most frequently used word, by far. Factual descriptors like “feet”, “miles”, “elevation”, and “boundary” are common. Features like “mountain”, “trail”, “river”, “wildlife”, “dessert”, and “forest” are also common.

Interestingly, no emotive adjectives make the list. John Muir would be disappointed! We can also pull out only the adjectives.

WT2 %>% 
    filter(POS %in% c("JJ", "JJS", "JJR")) %>%
    group_by(word, Wild) %>%
    tally() %>%
    group_by(Wild) %>%
    filter(n == max(n)) %>%
    group_by(word) %>%
    tally() %>%
    arrange(desc(n)) %>%
    top_n(20)

Other than the prosaic “national” and direction words, we can see some descriptor of the sublime, like “steep”, “rugged”, “rocky”, and “wild” that would make the transcendentalists progenitors of the wilderness concept proud.

To make the main figure for this project, I really wanted to use word clouds. Word clouds aren’t a great visualization for data interpretation, but they are a fun gimmick and make for interesting visuals!

Unfortunately, R doesn’t have the best support for generating fancy wordclouds. The package wordcloud is solidly reliable, but can only make very basic images. The package wordcloud2 allows for far more customization, but it is extremely buggy and requires some work-arounds on most systems.
I want the regional word clouds to conform to the Region’s shape. I made some simple shape images (here: East, West, Alaska) that we can pass to wordcloud2.

Depending on your system, the image may not populate in RStudio’s plotting window. If you don’t see the plot, try clicking the button to “show in new window” which will open a browser tab. Try refreshing the browser tab a few times until the wordcloud generates. Unfortunately, there is no way to constrain the aspect ratio of the cloud, so you will need to resize your browser window to a square. Then you can right-click and save the image. …like I said, wordcloud2 requires far too many work-arounds, but it is the only option for word clouds with custom shapes in R.

Below I’m showing the code for generating the Western region. It should be easy to alter for the other regions (all of the code is on github).

WT2 %>%
  filter(Region == "West") %>%
  filter(word != "wilderness") %>%
  count(word) %>%
  filter(n > 10) %>%
  wordcloud2(figPath = "./West.png",
             maxRotation = 0,
             minRotation = 0,
             color = "#bb5a00",
             fontFamily = 'serif')

I pulled the wordclound images into Illustrator to make the final adjustments. (You could certainly do all of this in R, but I’m much faster at Illustrator and prefer to use a visual program for graphic design decisions.)

Because wordclouds are not useful for quantitative interpretation, I also want to make some histograms of the most common words associated with each region. Again, I’m only showing code for the Western region below, all of the code is on Github.

WT2 %>%
  filter(Region == "West") %>%
  filter(stem != "wilderness") %>%
  count(stem) %>%
  arrange(desc(n)) %>%
  top_n(40) %>%
  ggplot(aes(x = fct_reorder(stem, n), y = n)) +
  geom_bar(stat = "identity", fill = "#bb5a00", width = 0.7) +
  coord_flip() +
  labs(y = "", x = "") +
  theme(axis.text.y = element_text(color = "#bb5a00"),
        axis.line.y = element_blank(),
        panel.grid.major = element_blank(),
        axis.text = element_text(family = "serif"))

Another question we could ask is: which words most distinguish regions from other regions? For example, “mountain” and “trail” are high frequency words in the Western region, but they also occur at high frequency in the Eastern region, as well. So, these terms don’t help us distinguish between regions. Instead we can estimate log ratios of word occurrence between regions. Log ratios conveniently scale symmetrically around zero. Greater absolute values indicate words particularly relevant to one region and smaller values indicate words that are equally relevant to both regions.

wordratios <- WT2 %>%
  filter(word != "wilderness") %>%
  filter(Region == "East" | Region == "West") %>%
  count(stem, Region) %>%
  filter(sum(n) > 10) %>%
  ungroup() %>%
  spread(Region, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1)/sum(. + 1))) %>%
  mutate(logratio = log(East/West)) %>%
  arrange(desc(logratio))

wordratios %>%
  arrange(abs(logratio)) # Small log ratios indicate terms that are equally likely to be from East or West

WT2 %>%
  count(stem, Region) %>%
  group_by(stem) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  pivot_wider(names_from = Region, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio.EW = log(East / West)) %>%
  arrange(abs(logratio.EW)) %>%
  slice(1:20) %>%
  mutate(class = "mid") %>%
  bind_rows(
    WT2 %>%
      count(stem, Region) %>%
      group_by(stem) %>%
      filter(sum(n) >= 10) %>%
      ungroup() %>%
      pivot_wider(names_from = Region, values_from = n, values_fill = 0) %>%
      mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
      mutate(logratio.EW = log(East / West)) %>%
      arrange((logratio.EW)) %>%
      slice(1:20) %>%
      mutate(class = "west")
  ) %>%
  bind_rows(
    WT2 %>%
      count(stem, Region) %>%
      group_by(stem) %>%
      filter(stem != "roger") %>%
      filter(sum(n) >= 10) %>%
      ungroup() %>%
      pivot_wider(names_from = Region, values_from = n, values_fill = 0) %>%
      mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
      mutate(logratio.EW = log(East / West)) %>%
      arrange(desc(logratio.EW)) %>%
      slice(1:20) %>%
      mutate(class = "east")
  ) %>%
  ggplot(aes(x = fct_reorder(stem, logratio.EW), 
             y = logratio.EW, 
             col = class)) +
  geom_segment(aes(xend = fct_reorder(stem, logratio.EW), 
                   y = case_when(class == "west" ~ -0.1,
                                 class == "mid" ~ 0,
                                 class == "east" ~ 0.1),
                   yend = logratio.EW)) +
  geom_point(data = . %>% filter(class == "west"), aes(size = exp(abs(logratio.EW))), pch = 16) +
  geom_point(data = . %>% filter(class == "east"), aes(size = exp(abs(logratio.EW))), pch = 16) +
  geom_text(data = . %>% filter(class == "west"), aes(label = stem, y = 0), hjust = 0) +
  geom_text(data = . %>% filter(class == "east"), aes(label = stem, y = 0), hjust = 1) +
  geom_text(data = . %>% filter(class == "mid"), aes(label = stem, y = 0)) +
  coord_flip() +
  scale_color_manual(values = c("#8ba506", "grey70", "#bb5a00")) +
  theme(axis.text.y = element_blank(),
        axis.line.y = element_blank(),
        panel.grid.major = element_blank()) +
  labs(x = "",
       y = "Log ratio ('Uniqueness' of words for a region)",
       title = "Which words are most unique to a Wilderness?")

In the image above we can see the 20 most unique words that distinguish the Eastern from the Western regions. “Laurel”, “swamp”, “key”, and “bay” are characteristic of the East while “desert”, “lion”, “wash”, and “alpine” are almost exclusively used to describe Western areas. Words like “safety”, “coastal”, and “glimpse” are commonly used in both regions. Interestingly, the word “west” is used commonly in Eastern descriptions. I was also surprised to see “lynx” and “glacial” to be equally common.

Log ratios aren’t as useful when we have more than two groups to compare. In our case, where we want to compare multiple groups (i.e. West, East, and Alaska), we can find the words that most distinguish one region from the others by computing the term frequency-inverse document frequency (tf-idf). This metric is computed by multiplying the the number of times a word is used in a given “document” by the inverse of that word’s frequency across all “documents”. In this case, we treat all descriptions from a region as a single “document”. Similar to log ratios, tf-idf let’s us know how relevant a term is to a given document, but allows us to compare across many documents.

WT2 %>%
  filter(Region != "Island") %>%
  filter(!is.na(stem)) %>%
  filter(stem != "roger") %>%
  count(Region, stem, sort = TRUE) %>%
  bind_tf_idf(stem, Region, n) %>%
  group_by(Region) %>%
  top_n(10, tf_idf) %>%
  arrange(desc(Region), desc(tf_idf)) %>%
  print(n = 40)

# A tibble: 32 x 6
# Groups:   Region [3]
   Region stem           n       tf   idf   tf_idf
                    
 1 West   desert       533 0.00852  0.405 0.00346 
 2 West   wash         125 0.00200  1.10  0.00220 
 3 West   bighorn       95 0.00152  1.10  0.00167 
 4 West   mesa          90 0.00144  1.10  0.00158 
 5 West   golden        79 0.00126  1.10  0.00139 
 6 West   creosote      78 0.00125  1.10  0.00137 
 7 West   fir          203 0.00325  0.405 0.00132 
 8 West   sagebrush     73 0.00117  1.10  0.00128 
 9 West   tortoise      71 0.00114  1.10  0.00125 
10 West   badlands      63 0.00101  1.10  0.00111 
11 East   laurel        38 0.00180  1.10  0.00198 
12 East   hardwood      90 0.00427  0.405 0.00173 
13 East   key           68 0.00323  0.405 0.00131 
14 East   oak           66 0.00313  0.405 0.00127 
15 East   illustrate    20 0.000949 1.10  0.00104 
16 East   swamp         53 0.00251  0.405 0.00102 
17 East   logging       37 0.00175  0.405 0.000712
18 East   maple         34 0.00161  0.405 0.000654
19 East   turkey        34 0.00161  0.405 0.000654
20 East   branch        33 0.00157  0.405 0.000635
21 Alaska fjord         17 0.00243  1.10  0.00267 
22 Alaska anchorage     11 0.00157  1.10  0.00173 
23 Alaska tundra        28 0.00400  0.405 0.00162 
24 Alaska prince         9 0.00128  1.10  0.00141 
25 Alaska wale           9 0.00128  1.10  0.00141 
26 Alaska chuck          8 0.00114  1.10  0.00125 
27 Alaska whale         20 0.00286  0.405 0.00116 
28 Alaska lion          17 0.00243  0.405 0.000984
29 Alaska alpine        16 0.00228  0.405 0.000926
30 Alaska cook           5 0.000714 1.10  0.000784
31 Alaska frigid         5 0.000714 1.10  0.000784
32 Alaska warren         5 0.000714 1.10  0.000784

Words like “desert”, “wash”, “bighorn”, and “mesa” are highly indicative of the West. The East is described most distinctly by it’s plant species: “laurel”, “oak”, “maple” and by terms like “key” which refers to the islands in the southeast. Alaska is dominated by intuitive terms like “fjord”, “tundra” and “alpine” and sea animals like “whale” and sea “lion”. Place names also rise to high relevance for Alaska with terms like “Anchorage”, “Prince of Wales Island”, and “Cook Inlet”.

When plotting differences between East and West in log-ratios, above, it made sense to use a diverging bar graph (or lollipop graph, specifically). But with more than a two-way comparison, visualization gets more complicated.

After a few iterations, I settled on visualizing the most distinctive terms (i.e. terms with highest tf-idf) as growing larger from a common point. I accomplished this by wrapping the plot around a circular coordinate system. Terms that are larger and further from the axes of other regions are more distinctive to the focal region.

Before plotting, I also remove the words associated with proper noun place names.

WT2 %>%
  filter(Region != "Island") %>%
  filter(!is.na(stem)) %>%
  filter(!stem %in% c("roger", "anchorage", "prince", "wale", "cook", "warren", "admiralty", "coronation")) %>%
  count(Region, stem, sort = TRUE) %>%
  bind_tf_idf(stem, Region, n) %>%
  group_by(Region) %>%
  top_n(30, tf_idf) %>%
  ungroup() %>%
  mutate(ordering = as.numeric(as.factor(Region)) + (tf_idf*100),
         stem = fct_reorder(stem, ordering, .desc = FALSE)) %>%
  mutate(tf_idf = tf_idf * 100) %>%
  ggplot(aes(x = fct_relevel(Region, c("East", "West", "Alaska")), label = stem, y = tf_idf, col = Region)) +
  geom_point() +
  coord_flip() +
  scale_color_manual(values = c("#008dc0", "#8ba506", "#bb5a00")) +
  scale_y_log10(limits = c(.025, 0.35)) +
  coord_polar() +
  geom_text_repel(aes(cex = tf_idf), max.overlaps = 100, family = "serif", segment.linetype = 0) +
  theme(panel.grid.major.x = element_blank(),
        axis.line = element_blank(),
        axis.text = element_blank()) +
  labs(x = "", y = "")

With all of the components created, I pulled everything into Illustrator to design the final infographic.

This is the final product visualizing descriptions of Wilderness Areas in the US.

There is certainly a lot more that one could learn from these data. For instance, do descriptions differ by management agency? Would we find stronger divergence in language used to describe North versus South regions in the lower 48 rather than East-West? Nevertheless, this was a useful exercise for learning a bit more about web scrapping, text mining, and word analysis.

Create a radial, mirrored barplot with GGplot

azandis@gmail.com — Fri, 19 Jul 2019 12:21:45 +0000

Among other topics, my lab studies the relationship between forest obligate frogs and urbanization. During a seminar, I once heard my advisor mention that Connecticut is the perfect state for us because the state sits a the top of the rankings for both the greatest percentage of tree cover and highest population density.

I’ve been meaning to dig into that statement for a while, so when Storytelling With Data encouraged folks to submit radial graphs for their July #SWDchallenge, I took the opportunity.

I pulled the population data from the US Census Bureau and used “People per sq. mile” from the 2010 census for density estimates. The tree cover data came from Nowak and Greenfield (2012, Tree and impervious cover in the United States. Landscape and Urban Planning).

Here’s how I made the graphic:

library(tidyverse)

TP <- read.csv("C:\\Users\\Andis\\Google Drive\\2019_Summer\\TreePop\\CanopyVsPopDensity.csv", header = TRUE) %>% select(State, Perc.Tree.Cov, Pop.Den.) %>% rename(Trees = Perc.Tree.Cov, Pop = Pop.Den.) %>% filter(State != "AK")

> head(TP)
  State Trees   Pop
1    AL    70  94.4
2    AZ    19  56.3
3    AR    57  56.0
4    CA    36 239.1
5    CO    24  48.5
6    CT    73 738.1
>

Here is the dataset. First we rename the variables and removed Alaska since it was not included in the tree cover dataset.

ggplot(TP, aes(x = State, y = Trees)) +
  geom_col()

This gives us a column or bar plot of the percent tree cover for each state. Note that we could also use geom_bar(), but geom_col() will be easier to deal with once we start adding more elements to the plot.

Mirrored bar charts are a great way to compare two variables for the same observation point, especially when the variables are in different units. However, we still want to make sure that the scales are at least pretty similar for aesthetic symmetry. In our case, we will actually be asking ggplot to use the same unit scale for both tree cover and population density, so we need to make sure that they are very similar in scale.

The best option would be to standardize both values of tree cover and population density to a common scale by dividing by the respective standard deviation. The problem comes in interpreting the axis because the scale is now in standard deviations and not real-world units.

For this graphic, I’m going to cheat a little. Since we will eventually be removing our y-axis completely, we can get away with our values being approximately congruent. Since tree cover is a percentage up to 100, I decided to simply scale population density to a similar magnitude.

The greatest Population Density is Rhode Island with 1018.1 people per square mile. We can divide by 10 to make this a density in units of “10 people per square miles” which will scale our range of density values down to 0.6 to 101.8, on par with the 0 to 100 range of the tree cover scale.

> max(TP$Pop)
[1] 1018.1
> 

TP <- mutate(TP, Pop.10 = Pop/10)

> range(TP$Pop.10)
[1]   0.58 101.81
>

Now we can add the population density to the figure. To make this a mirror plot, we just need to make the values for population density negative. Also, I gave each variable a different fill color so we could tell them apart.

TP <- mutate(TP, Pop.10 = -Pop.10)

ggplot(TP, aes(x = State)) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_col(aes(y = Pop.10), fill = "#817d79")

I’ve always loved the vertically oriented mirrored bar plots used so often by FiveThirtyEight. The problem is that these tall charts can’t fit on a wide-format presentation slide. And it is hard to read the horizontally oriented plot above. I realized that if I could wrap a mirrored bar chart into a circle, it can fit in any format. All that we need to do to make this into a circular plot is to add the coord_polar() element.

ggplot(TP, aes(x = State)) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  coord_polar()

Now we have a mirrored, radial bar plot. But, this is super ugly and not very intuitive to read. This first useful adjustment we can make is to order the states to highlight the comparison we are interested in. In this case, we are trying to highlight the states that simultaneously have the greatest tree cover and highest population densities. One easy solution would be to rank order the states by either of those variables. For instance, we can order by tree cover rank.

ggplot(TP, aes(x = reorder(State, Trees))) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  coord_polar()

But that doesn’t really highlight the comparison because having lots of trees doesn’t really correlate with having lots of people.

ggplot(TP, aes(x = Pop, y = Trees)) +
  geom_point(size = 4, alpha = 0.7) +
  theme_minimal()

Instead, we can directly highlight the comparison by computing a new variable that simultaneously accounts for tree cover rank and population density rank. We cannot simply average rankings because that will produce a lot of ties. Also, if a state has a really low rank in one variable, it can discount the higher rank of the other variable. We can deal with this by using the mean of the squared rank orders of each variable (similar to mean squared error in regression). Also, note that since we want the largest values to be rank 1, we need to find the rank of the negative values.

TP <- TP %>% mutate(TreeRank = rank(-Trees), PopRank = rank(-Pop)) %>% mutate(SqRank = (TreeRank^2)+(PopRank^2)/2) %>% mutate(RankOrder = rank(SqRank))

ggplot(TP, aes(x = reorder(State, RankOrder))) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  coord_polar()

Next, we can improve the readability of the plot. Since our y-axis isn’t technically comparable, we can get rid of the axis label and ticks altogether using theme_void(), then we tell ggplot to label all of the states for us and to place the labels at position y = 100.

ggplot(TP, aes(x = reorder(State, RankOrder))) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  geom_text(aes(y = 100, label = State)) +
  coord_polar() +
  theme_void()

This plot is pretty worthless without some numbers to help us interpret what the bar heights represent. We can add those just as we added the state labels. In order to keep the character lengths short, we need to round the values. Also, now that the scale units are independent, I decided to further scale the population density values to 100 people per square mile simply by dividing the density by 100. The values tends to overlap, so we also need to make the font smaller.

ggplot(TP, aes(x = reorder(State, RankOrder))) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_text(aes(y = 10, label = round(Trees, 2)), size = 3)+
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  geom_text(aes(y = -10, label = round(Pop/100, 1)), size = 3)+
  geom_text(aes(y = 100, label = State)) +
  coord_polar() +
  theme_void()

This looks okay, but we can make it look even better. First we can adjust the limits of the y-axis. We can use the negative limit to create a white circle in the center, essentially pushing all of the data towards the outer ring instead of dipping down to the very central point.

Also, it bothers me that the bars cut into the value labels. We can adjust the position of the labels conditionally so that labels too big to fit in the bar are set outside of the bar using an ifelse() statement.

We can use the same type of ifelse() statement to conditionally color the labels so that those inside of the bars are white while those outside of the bars match the colors of the bars. We just need to include the scale_color_identiy() to let ggplot know that we are directly providing the name of the color.

ggplot(TP, aes(x = reorder(State, RankOrder))) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_text(aes(y = ifelse(Trees >= 15, 8, (Trees + 10)), color = ifelse(Trees >= 15, 'white', '#5d8402'), label = round(Trees, 2)), size = 3)+
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  geom_text(aes(y = ifelse(Pop.10 <= -15, -8, (Pop.10 - 10)), color = ifelse(Pop.10 <= -15, 'white', '#817d79'), label = round(Pop/100, 1)), size = 3)+
  geom_text(aes(y = 100, label = State)) +
  coord_polar() +
  scale_y_continuous(limits = c(-150, 130)) +
  scale_color_identity() +
  theme_void()

The way the state labels are so crowded on the right, but not on the left bugs me. We can set a standard distance, like y = 50, but then conditionally bump out the values if they would interfere with the bar.

And finally, ggplot builds in an obnoxious amount of white space around circular plots. We can manually reduce the white area by adjusting the plot margins.

ggplot(TP, aes(x = reorder(State, RankOrder))) +
  geom_col(aes(y = Trees), fill = "#5d8402") +
  geom_text(aes(y = ifelse(Trees >= 15, 8, (Trees + 10)), color = ifelse(Trees >= 15, 'white', '#5d8402'), label = round(Trees, 2)), size = 3)+
  geom_col(aes(y = Pop.10), fill = "#817d79") +
  geom_text(aes(y = ifelse(Pop.10 <= -15, -8, (Pop.10 - 10)), color = ifelse(Pop.10 <= -15, 'white', '#817d79'), label = round(Pop/100, 1)), size = 3)+
  geom_text(aes(y = ifelse(Trees <= 50 , 60, Trees + 15), label = State)) +
  coord_polar() +
  scale_y_continuous(limits = c(-150, 130)) +
  scale_color_identity() +
  theme_void() +
  theme(plot.margin=grid::unit(c(-20,-20,-20,-20), "mm"))

And there we have it. I made some final touches like changing the font and adding a legend in Illustrator.

Quals

azandis@gmail.com — Fri, 25 May 2018 11:20:57 +0000

Qualifying exams are the written and oral tests PhD students must take to prove that they are experts in their field before embarking on a dissertation. And they are one of the most terrifying and anxiety-infused processes in an academic career.

After sleeplessly struggling through my own quals, and afterwards reflecting on the process with friends, I’ve compiled my thoughts/suggestions/tips into this post for those staring down the barrel of their exams.

I’ll admit that I had never heard of qualifying exams until well into the first year of my PhD, and I had absolutely no idea what they entailed or how they transpired. I’m a first generation student, so maybe most folks are better familiarized with the concept. For those who are academically naïve like me, scroll to the end for an overview of the quals process.

Tips for quals:

Prep:

In my experience quals felt a lot like running a marathon through a Ninja Warrior gauntlet, while wearing a T-Rex costume and being chased by a flock of angry mallards.

In other words, you’ll have a lot on your plate. You won’t want to stop for day-to-day chores like paying bills and cooking dinner. Prior to your exams, I suggest trying to preemptively clear as much off your to-do list as possible so that you won’t be distracted nor will you compound your stress load.

For instance, you might pre-pay your bills, do a big grocery run to stock the kitchen with brainy-stimulating snacks, and even pre-prepare freezer-meals to last the duration of your exams. Set your email auto-responder to a plea-ful apology. If you are really savvy, you might even consider giving your partner a few extra backrubs or doing a couple extra chores for your labmates in an effort to bank some reciprocity points that you can cash in for favors during your exams.

Studying:

In some institutions, it is common for committees to give reading lists of critical papers prior to the exams. This isn’t common in my program, but nonetheless, there were about a hundred papers I knew that I would want to have at hand and a few hundred more that I wanted to be able to recall quickly.

You don’t want to be spending precious minutes of exam time searching through databases for the right paper. So, make sure that you have a good reference software onboard and consider blocking out a chunk of time each day leading up to your exam organizing key words and folders you expect to be relevant to your topic.

It is also a huge timesaver to have a reference software that integrates with your word processing application. When you are in the zone, there is nothing more frustrating than breaking the flow to stop and figure out your parenthetical citations. (I use Paperpile and strongly advocate for it).

Do some mental calisthenics to prep your expectations:

Great, now you’ve read and cataloged the relevant literature, banked some favors, and stocked your fridge—you are prepared for your exams! …right?! No matter how confident your feel about your exams, there is bound to be a point during your writing when you may come to one or more of the following realizations:

There is a lot more literature on your topic than you thought
You missed a whole boatload of important papers
Your points are not original
You know nothing
Everyone else is smarter
You will certainly fail your exams
It might not be too late to go into law school
Maybe being a barista for the rest of your life wouldn’t be so bad
Perhaps it would be preferable to live in a cave in the woods forever
Etc

I call this the “Dark Times”–when the gravity of your examination flattens any modicum of confidence and extinguishes all motivation but the desire to eat Nutella by the spoonful and rewatch the entire first season of the Office in bed.

The amount I actually learned, compared to the quality of my exams answers, and how I felt about my answers were vastly different. The big dips in the red line are what I call “The Dark Times,”

I’m sure there are countless self-help blogs littered with suggestions for coping with stress, anxiety, and melancholy. But, for now, I share my preferred mind-hack that worked well for me during quals—a mind-hack that comes courtesy of the Stoic philosopher Seneca: “premeditation malorum” (The literal translation is “meditate on future evils.”) The idea is to think about all of the bad things that could happen and all poorer potential alternatives to your situation, thereby vaccinating yourself from the emotional distress if those bad scenarios come to fruition. For example, a few critical remarks from a committee member during orals might be emotionally difficult to take in, but if you’ve been imagining an even worse alternative wherein your exams ends with your committee hurling cabbage and screaming insults, a few critical remarks are no big deal in contrast.

Meditating on poorer alternatives puts things in perspective. After all, when you really think about it, the exam process is just sitting down to do nothing but think about one topic for a few entirely undistracted weeks (presumably, since this is the topic you’ve chosen to dedicate your life to, this is probably a topic that you are really interested in). If you imagine that you could have spent those weeks endlessly swimming away from hungry sharks, or listening to Dave Mathews Band on repeat, or continuously flossing… the quals seems like down-right eudemonia in contrast.

Another mind-hack is to remind yourself of the purpose of qualifying exams—to push you into the shady corners of your own knowledge and the rarified boundaries of your field. If you start to feel like you don’t know anything, that’s a good sign that you are uncovering a gap in your own knowledge, a gap you can spend the next few years filling. Or, maybe you’ve discovered an unexplored horizon in the field that you can spend the next few years illuminating.

Try to keep in mind that the purpose of quals is to show you what you do not know. Quals are kind of the antidote to academic Dunning-Kruger effect.

It might also be helpful to remember that your committee would never have let you initiate the exam process if they did not expect you to succeed. If you were able to answer all of their questions with ease it would be less of an indication of your expertise than an indication that they failed to their design questions well enough

Plan a perspective parallax:

I hit two severe “Dark Times” during my written exams. In both cases, the thing that halted my unchecked plummet into existential duress were unplanned conversations with peers who had already gone through the exam process. It came as such a relief to hear that they had shared many of the same feelings during their exams. It was heartening to hear that they too discovered papers they missed, to hear that they questioned their entire graduate prospects, and to know that despite it all, they came through the process successfully. Sometime that simple perspective-check from an empathetic peer is a powerful tool to steer your thoughts away from the perigee of despair.

Celebrate:

After my exams, I went to the bar to celebrate. Over beers I had to confess that I kinda liked the process. Yes, it was stressful. Yes, I would not want to revisit the Dark Times. But, I am also incredibly grateful for the chance to spend four solid week thinking about these topics. I recon I learned as much in those weeks as I did in the previous three semesters. The task of explaining a topic to someone else is an entirely different ballgame from convincing yourself that you understand a topic, and doing so solidifies your understanding unequivocally.

I have a wonderful cohort who made a serious effort to organize a party after each person passed exams. Even when we couldn’t celebrate in person, we managed virtual toasts over WhatsApp.

At the end of the process, don’t forget to celebrate. More importantly, remember to celebrate your peers going through the same process. After all, we’re all inhabiting this Ivory Tower together, so let’s make it a party!

What are qualifying exams?

Qualifying exams (a.k.a. Quals, Preliminary Exams, Candidacy Exams, etc) act like a turnstile on the path toward one’s dissertation. The purpose of these exams is to demonstrate that you possess the requisite substrate of knowledge in your field. A passing mark on qualifying exams is an endorsement that you are ready and able to define a novel research topic and carry it through into a dissertation. With the exams successfully completed, a PhD student is said to advance to candidacy for a PhD degree, and becomes a PhD candidate.

The structure of the exams is different at every institution, but most seem to include written and oral components. Other than that, the form of exams seems to be widely diverse and variable even between programs in the same institution. So, rather than analyze all of the many exams structure, I’ll quickly explain how exams worked for my program at Yale School of Forestry and Environmental Studies (FES).

At FES, exams tend to center around a student’s prospectus, a document outlining a student’s research plan for their dissertation. For the first year or two of one’s PhD studies, the prospectus is a living document that helps in formulating research questions and outline a plan for answering them. Concurrently with crafting a prospected research plan, students are expected to begin putting together a committee. This is the committee that will preside over the qualifying exam and usually is also the group that will eventually assess the final dissertation. In my case, I have four committee members, all of whom are experts in various aspects of my proposed research.

At FES, the convention is to draft an extremely thorough and extensive prospectus with a full introduction, methods, analysis, and preliminary results sections, most of which will be imported into future publication and the dissertation. Over the months leading up to the exams, the prospectus is floated around to all of the committee both to refine it, but also to give the committee a sense of what the student might be missing.

I planned to take my exams at the end of my third semester (which was pushed back to the beginning of my fourth semester), which is a littler earlier than average for the program. Most students qualify in their fifth or sixth semester.

In FES, the exam is structured in two parts. First, the committee delivers 2-3 essay question that the student is allotted 2 weeks to return answers. After a brief break, the committee convenes to for the oral component. In the oral exam, the student gives a presentation of the prospectus and then the committee gets about 2 hours to ask questions about the prospectus and the student’s answers to the written component. Once the committee has sated their questions, they go into close-door deliberations, and within a 15-30 minutes wither accept, accept with reservations, or deny the student’s advancement candidacy.

The whole process is unequivocally nerve-wracking. After all, you just spent months developing a research plan, then invited a panel of the top experts in that field to grill you on your short-comings over the course of 3-4 weeks. Although it tends to reduce even the most confident PhD student into a sobbing, melancholic, mess of existential-crisis, I unrepentantly appreciated the opportunity and hopefully came out with some tips for other soon-to-be PhD candidates.

Hemispherical light estimates

azandis@gmail.com — Fri, 16 Feb 2018 03:13:59 +0000

Hemispherical canopy photo used to estimate light profiles of vernal pools. Note that I lightened this image for aesthetics–this is not what your photos should look like for canopy estimates.

Despite the fact that foresters have been estimating forest canopy characteristics for over a century or more, the techniques and interpretation for these measurements is surprisingly inconsistent. As part of my dissertation research, I wanted to ask what I thought was a simple question: “How much light reaches different vernal pools over the season?” It took a lot of literature searching, a lot of emails, and a lot of trial and error to discover that this is actually not as simple of a question as I originally thought.

But in the end, I’ve developed what I think is a very sound workflow. In the hopes of saving other researchers a journey through the Wild West of canopy-based light estimates, I decided to publish my notes in a series of blog posts.

In this first post, I’ll cover the rationale behind hemispherical photo measurements.
The second post will compare hardware and address measuring/calibrating lens projections.
The third post will focus on capturing images in the field.
Finally, the fourth post will be a detailed walk-through of my software and analysis pipeline, including automation for batch processing multiple images.

Why measure light with hemispherical photos?

Light is an important environmental variable. Obviously the light profile in the forest drives photosynthesis in understory plants, but it is also important for exothermic wildlife, and can also impact abiotic factors like snow retention. An ecologist interested in any of these processes should be interested in the light environment at specific points in the habitat. Although light intensity at a point can be measured directly with photocells (Getz 1968), this requires continuous measurement at each point of interest over the entire seasonal window of interest. For most applications, this would be intractable. A more common method involves measuring the canopy above a point and estimating the amount of sky that is obstructed in order to infer the amount of light incident at that location.

Fortunately, since both foresters and ecologists are interested in the canopy, the field has produced a multitude of resources for measuring it. However, foresters are more often interested in estimating the character of the canopy itself and the canopy metrics they use are not always directly relatable to questions in ecology.

For instance, while foresters are generally interested in canopy COVER, ecologists might be more interested in canopy CLOSURE. These terms are similar enough that they are often (incorrectly) used interchangeably. According to Korhonen (2007) and Jennings et al. (1999), canopy cover is “the proportion of the forest floor covered by the vertical projection of the tree crowns” while canopy closure is “the proportion of sky hemisphere obscured by vegetation when viewed from a single point” (see the figure below).

From Korhonen et al. (2007)

The choice between measuring cover versus closure depends on both the scale and perspective of your research. For instance, forest-wide measurements of photosynthetic surface will probably be estimated from remote sensing data which is a vertical projection (i.e. canopy cover). However, if you are interested in the amount of light reaching a specific understory plant or a vernal pool, canopy closure is more relevant.

The disparity in perspective can be an issue in downstream analysis too. For instance, some analysis procedures consider only the outer edge of a tree crown or contiguous canopy and ignore gaps within this envelope. This kind of analysis is much less useful if your interest is in the light filtering through canopy gaps.

Canopy estimation:

There are three main ways to estimate canopy characteristics: direct field measurements, indirect modelling from other parameters, and remote sensing. The choice of measurement methods employed will largely be determined by the scope of the question and the scale-accuracy required. For the purpose of my research, I am interested in estimating light environments for a specific set of vernal pools; so, direct measurements are most useful. However, if I wanted to know how different tree composition influences light environments of ponds on average across forest habitats, I might want to try modelling that relationship followed by ground-truthing with direct measurements.

The rest of this post will focus on directly measuring canopy closure to estimate light environments.

Measurement methods:

The most popular methods for directly measuring closure are hemispherical photography and spherical densiometer. A spherical densiometer (Lemmon 1956) is basically a concave or convex mirror that reflects the entire canopy. A grid is drawn over the mirror and researchers can count the number of quadrants of sky versus canopy.

Hemispherical photos use a similar principle. A true hemispherical lens projects a 180 degree picture of the canopy that can then be processed to determine percentage of sky versus canopy.

This is a great example of both a densiometer (top left) and a hemispherical photo (top right). Bottom left is an example grid for densiometer estimates, and bottom right is an example view from a densiometer with the grid overlaid on the photo. Borrowed from Eric Nordberg.

Advantages of densiometer are that they are cheap, easy to operate, and can function in any light conditions. The exact converse is true of hemispherical photography which can be expensive, difficult to operate, and requires very narrow light conditions (I’ll get into the particulars in post 3).

So, why would anyone use hemispherical photos over densiometers?

The main disadvantage of a densiometer is that the estimates are instantaneous, whereas hemispherical photos can be integrated to estimate continuous measurements. It is easiest to explain this with an example.

Imagine you want to know how much light is received by a specific plant on the forest floor at the very edge of a clearcut (see figure below). A closure estimate from a densiometer would indicate 50% closure no matter the orientation of the tree line. However, if that little plant is in the northern hemisphere, we know that it will receive much less light if the clearcut lies to the north than if the clearcut lies to the south due to the angle and orientation of the sun.

Figure by me, A.Z. Andis.

The advantage of hemispherical photos (if taken with known orientation and location) is that they can be used to integrate light values over time with respect to the direction of the sun relative to gaps in the canopy. This means that with a single photo (or two photos for deciduous canopies) one can calculate the path of the sun and estimate the total light received by the plant in our example at any point in time or cumulatively over a range of time.

An ancillary advantage of photos is that they can be archived along with the scripts used for processing, which makes the entire analysis easily reproducible.

I’ll go much further in depth in later posts, but as a preview, here is a general overview of how light estimates are calculated from hemispherical photos:

Hemispherical photos are captured in the field at a specific location and known compass orientation.
Images are converted to binary, such that each pixel is either black or white using thresholding algorithms.
Software can then be used to project a sun path onto the image.
Models parameterized with average direct radiation, indirect radiation, and cloudiness can then estimate the total radiation filtering through the gaps represented as white pixels in the photo at any point in time or averaged over time ranges.
The result is a remarkably accurate point estimate of incident light at any time in the season.

~~I’ll be drafting subsequent post in the coming weeks so be sure to check back in!~~

Be sure to check out my other posts about canopy research that cover the theory, hardware, field sampling, and analysis pipelines for hemispherical photos.

Also, check out my new method of canopy photography with smartphones and my tips for taking spherical panoramas.

References:

Getz, L. L. (1968). A Method for Measuring Light Intensity Under Dense Vegetation. Ecology 49, 1168–1169. doi:10.2307/1934505.

Jennings, S. B., Brown, N. D., and Sheil, D. (1999). Assessing forest canopies and understorey illumination: canopy closure, canopy cover and other measures. Forestry 72, 59–74. doi:10.1093/forestry/72.1.59.

Korhonen, L., Korhonen, K. T., Rautiainen, M., and Stenberg, P. (2006). Estimation of forest canopy cover: a comparison of field measurement techniques. Available at: http://www.metla.fi/silvafennica/full/sf40/sf404577.pdf.

Lemmon, P. E. (1956). A Spherical Densiometer For Estimating Forest Overstory Density. For. Sci. 2, 314–320. doi:10.1093/forestscience/2.4.314.