Structural Equation Modeling in R

Introduction

Today we are talking about structural equation models (SEM). There are lots of synonyms, sub-categories, and adjacent techniques that you may have heard before–covariance structure analysis, linear structural relations, path analysis, latent variable modeling, causal modeling, confirmatory factor analysis, exploratory factor analysis, latent growth modeling, mixture/multigroup/heirarchical/multilevel structural modeling, construct analysis, etc., etc…

So many names exist because there are a LOT of things that you can do with these types of models. It turns out that allowing for and testing structure in modeling can help solve lots of problems in research.

I think that the best way to understand SEMs is to start with simple regressions. Let’s consider a regression predicting a rooster crowing from the rising of the sun:

y = b0 + b1x + ε

y ~ Rooster Crowing,
x ~ Sun Rising,
b0 = 0,
ε ~ N(0, 1)

This is a silly regression because we all understand the relationship: the rooster goes cock-a-doodle-do when the sun crests the horizon. But in the language of mathematics, there is no reason I can’t rewrite this equation as:

x = ( yε ) b1-1

This formulation make no sense. Basically, we are saying that the rising of the sun is a function of the rooster crowing! Even though this is totally mathematically viable, it defies our common sense of causation.

The language of structural equation modeling allows one way to impose some directional structure on mathematical equations. We usually visualize that language as directed graphical models like the one below.

In a graphical model, the observed variables are displayed as boxes Unobserved or latent variables are displayed as circles. Constants are triangles. The functional relationship between variables is displayed as directional arrows. Non-directional or double-headed arrows indicate a variance or covariance.

This graphical model above is the same model as the regression in the equation above. Our independent variable x has a linear relationship with the dependent variable y with the slope parameter b1. y has a constant intercept b0 of 0. Finally, the residual variation in y not caused by x is assumed to come from some other unobserved cause with it’s own variance. Rather than thinking of variables as independent or dependent, we used the terms exogenous or endogenous. Exogenous variables are those (like x) that have no incoming paths. These are the most ‘upstream’ variables in the causal paths. Endogenous variables are those that receive causal paths. We’ll see later that some endogenous variables can also be causes of other endogenous variables, but they are still considered endogenous.

In practice, we generally ignore the peripherals and intercepts in structural models, yielding a simplified graph:

Now that we have a shared language. Let’s take a look at a toy example to understand why SEMs can be so useful in research.

Motivating example

You will need a handful of packages for this tutorial:

packages_for_sem_workshop <-
  c(
    'tidyverse', # basic data wrangling
    'tidygraph', # graph visualization
    'ggraph', # graph visualization
    'lavaan', # sem tools
    'piecewiseSEM', # sem tools
    'mgcv', # nonlinear modeling
    'lme4', # random effect modeling
    'cvsem' # cross-validating sems
    )

install_and_load_packages <-
  function(x){
    for( i in x ){
      if( ! require( i , character.only = TRUE ) ){
        install.packages( i , dependencies = TRUE )
        library( i , character.only = TRUE )
      }
    }
  }

install_and_load_packages(packages_for_sem_workshop)

Fitting a simple regression

I find that, for most applied researchers, the language of code is more intuitive than the language of math. So, let’s simulate a toy dataset and see how we can fit the same system as a linear model or an SEM.


simple_ex <-
  data.frame(
    x = runif(n = 100, min = 0, max = 10),
    e = rnorm(n = 100, 0, 1)
  )

simple_ex <- simple_ex %>%
  mutate(
    y = 1 + 2.5*x + e
  )

Now, let’s fit a good old-fashioned linear regression model:

fit_simple_ex_lm <- 
  lm(y ~ x, data = simple_ex)

summary(fit_simple_ex_lm)
> summary(fit_simple_ex_lm)

Call:
lm(formula = y ~ x, data = simple_ex)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61653 -0.49110 -0.01622  0.51680  2.76976 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.08641    0.17916   6.064 2.49e-08 ***
x            2.52441    0.02871  87.913  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8972 on 98 degrees of freedom
Multiple R-squared:  0.9875,	Adjusted R-squared:  0.9874 
F-statistic:  7729 on 1 and 98 DF,  p-value: < 2.2e-16

We can fit the same model using the lavaan package. The syntax for lavaan is to pass the list of models as a simple character string separated by next lines.

simple_ex_sem <-
  '
  y ~ x
  y ~ 1
'

fit_simple_ex_sem <-
  sem(model = simple_ex_sem,
      data = simple_ex)

summary(fit_simple_ex_sem)
> summary(fit_simple_ex_sem)
lavaan 0.6-18 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  y ~                                                 
    x                 2.524    0.028   88.805    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 1.086    0.177    6.125    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 0.789    0.112    7.071    0.000

Lavaan give us an informative output. We see that the parameters are estimated with ‘ML’, maximum likelihood, which differs from ‘lm’ which estimates via ordinary least squares (OLS). This means that estimates might differ slightly. We will trust the default optimizer ‘NLMINB‘.

We see that the model is estimating 3 parameters–the regression slope, the intercept, and the residual variance. Note that we typically do not estimate the intercepts in SEMs (I’m doing so here for continuity). To fit the SEM without the intercept, simply remove the 'y ~ 1' from the model list.

We will pass over the Model Tests and focus on the parameter estimates. The slope estimate of 2.524 and the intercept estimate of 1.086 perfectly match our lm estimates. We don’t get an estimate of the residual variance from lm, but we can extract the residuals from the lm model and calculate the variance ourselves.

fit_simple_ex_lm %>%
  resid() %>%
  var()
> fit_simple_ex_lm %>%
+   resid() %>%
+   var()
[1] 0.7968917

That values is a bit off due to the difference in estimation techniques, I believe.

Fitting complex systems

Now that we’ve seen how linear regression models can be fit as a special case of SEM, let’s take a look at an example that shows where SEM surpasses linear regression.

Again, we will use simulated data, but we’ll download the data so that we can’t see the underlying structure used to generate it. Instead, we’ll use SEMs to determine the structure.

source("https://raw.githubusercontent.com/andisa01/202407_SEM_turorial/main/scripts/SEM_tutorial_example_source.R")

The traditional approach

First, let’s analyze this data using linear regression modeling with stepwise variable selection like almost (the technique use by almost half of all ecological and animal behavioral researcher (including my past self)). In this methodology, we start with a ‘full’ model including all variables and their interactions. Then, we drop non-significant parameters starting with the least parsimonious and refit the model until only significant parameters remain.

# Fit the full or 'global model'
mod_ex01_full <- lm(Y ~ X1 + X2 + X1:X2, data = example01_data_anon)
summary(mod_ex01_full)
> summary(mod_ex01_full)

Call:
lm(formula = Y ~ X1 + X2 + X1:X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5935  -3.7755   0.1861   3.6929  13.7593 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.163110   7.944663  -0.146    0.884
X1          -0.264413   0.387296  -0.683    0.496
X2           0.155742   0.109397   1.424    0.156
X1:X2        0.002022   0.003950   0.512    0.609

Residual standard error: 5.066 on 196 degrees of freedom
Multiple R-squared:  0.144,	Adjusted R-squared:  0.1309 
F-statistic: 10.99 on 3 and 196 DF,  p-value: 1.06e-06

None of our parameters are significant, so we drop the interaction term.

# Drop the interaction
lm(Y ~ X1 + X2, data = example01_data_anon) %>% 
  summary()
> lm(Y ~ X1 + X2, data = example01_data_anon) %>% 
+   summary()

Call:
lm(formula = Y ~ X1 + X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5962  -3.8684   0.1564   3.6213  13.8291 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -4.73378    3.79270  -1.248  0.21346   
X1          -0.08449    0.16216  -0.521  0.60295   
X2           0.19734    0.07308   2.701  0.00753 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.057 on 197 degrees of freedom
Multiple R-squared:  0.1428,	Adjusted R-squared:  0.1341 
F-statistic: 16.41 on 2 and 197 DF,  p-value: 2.555e-07

Now, x2  is significant, but x1 is not. So, we drop x1.

lm(Y ~ X2, data = example01_data_anon) %>% 
  summary()
> lm(Y ~ X2, data = example01_data_anon) %>% 
+   summary() # Drop X1

Call:
lm(formula = Y ~ X2, data = example01_data_anon)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.4915  -3.8082  -0.0172   3.6272  13.5921 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.28441    2.57317  -1.276    0.203    
X2           0.16227    0.02839   5.716 3.97e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.047 on 198 degrees of freedom
Multiple R-squared:  0.1416,	Adjusted R-squared:  0.1373 
F-statistic: 32.67 on 1 and 198 DF,  p-value: 3.966e-08

And, now we have a minimum adequate model to explain the system! We can explain 14% of the variation of y with x2. For every 1 unit increase in x2, we expect y to increase by 0.16. And look at that p-value! Let’s go publish a paper!!!

In fact, this method of stepwise variable selection is so common that you can do it all in one command with dredge() from the MuMIn package.

MuMIn::dredge(mod_ex01_full)
> MuMIn::dredge(mod_ex01_full)
Fixed term is "(Intercept)"
Global model call: lm(formula = Y ~ X1 + X2 + X1:X2, data = example01_data_anon)
---
Model selection table 
   (Int)       X1     X2    X1:X2 df   logLik   AICc delta weight
3 -3.284          0.1623           3 -606.551 1219.2  0.00  0.626
4 -4.734 -0.08449 0.1973           4 -606.413 1221.0  1.81  0.254
8 -1.163 -0.26440 0.1557 0.002022  5 -606.279 1222.9  3.64  0.101
2  4.870  0.31890                  3 -610.048 1226.2  6.99  0.019
1 11.280                           2 -621.824 1247.7 28.49  0.000
Models ranked by AICc(x) 

The complexity of complex systems

The problem with the linear regression approach is that there is an inherent assumption that the independent variables are, well… independent. But, in natural systems, there are almost always relationships between the independent variables. For instance, here are just a few structures that could underly the relationship between y, x1, and x2.

The model structure we derived in our stepwise regression model is the single effect graph in the top left. The model implied by the multiple regression model is a ‘common effect’ structure where the exogenous variables are uncorrelated–in other words x1 and x2 have independent effects on y.

But other structures could exist that we cannot easily capture in regression. For instance, For instance,  x1 and x2 have independent effects on y but remain correlated with each other (common cause with correlation (bottom right)). Or x1 may have no direct effect on y, but may affect x2 which affects y in turn (this is called a chain or fully mediated effect (top middle)). Or, x1 might directly affect y in addition to the indirect effect (partial mediator (top right)).

x2 may not have any effect on y at all, but may still covary because both are directly affected by x1 (common cause (bottom left)).

Using SEMs to compare structural hypotheses

Given all the possible structures, how can we ever know which governs our study system? Well, the first pass is to use common sense. If x1 is height and x2 is weight, the principle of allometry should have us exclude any model without a relationship between them.

The second pass is to use knowledge of your system from the literature. For example, if x1 is maternal genotype and x2 is F1 genotype (of the progeny), there can be no direct effect of the maternal effect on y if the species has an annual lifecycle, but might be partially mediated for perennial species.

Once you have a set of plausible structural hypotheses, we can use the mechanics of SEM to ask which structure best fits the data.

For now, we will assume that all 6 hypotheses above are plausible. We’ll fit each in turn. To do so, I’ll introduce a new operator ~~ which relates the undirected covariation between two variables (i.e. covariance) or with itself (i.e. variance).

# single effect
ex01_formula_x2effect <- '
Y ~ X2
X1 ~~ X1
'

ex01_sem_x2effect <- sem(ex01_formula_x2effect, data = example01_data_anon)

summary(ex01_sem_x2effect)
> summary(ex01_sem_x2effect)
lavaan 0.6-18 ended normally after 8 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                               377.733
  Degrees of freedom                                 2
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Y ~                                                 
    X2                0.162    0.028    5.745    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    X1               32.093    3.209   10.000    0.000
   .Y                25.220    2.522   10.000    0.000

We won’t use the summary output for now, so I will exclude it when fitting the rest of the models.

# chain
ex01_formula_chain <- '
X2 ~ X1
Y ~ X2
'

ex01_sem_chain <- sem(ex01_formula_chain, data = example01_data_anon)
# partially mediated
ex01_formula_mediator <- '
X2 ~ X1
Y ~ X2
Y ~ X1
'

ex01_sem_mediator <- sem(ex01_formula_mediator, data = example01_data_anon)
# common cause
ex01_formula_commoncause <- '
X2 ~ X1
Y ~ X1
'

ex01_sem_commoncause <- sem(ex01_formula_commoncause, data = example01_data_anon)
# common effect (uncorrelated)
ex01_formula_commoneffect <- '
Y ~ X1
Y ~ X2
'

ex01_sem_commoneffect <- sem(ex01_formula_commoneffect, data = example01_data_anon)
# common effect (correlated)
ex01_formula_commoneffect2 <- '
Y ~ X1
Y ~ X2
X1 ~~ X2
'

ex01_sem_commoneffect2 <- sem(ex01_formula_commoneffect2, data = example01_data_anon)

We’ll use the anova() command to get some summary statistics on all of the models.

anova(
  ex01_sem_chain, 
  ex01_sem_commoneffect, 
  ex01_sem_commoncause, 
  ex01_sem_x2effect, 
  ex01_sem_commoneffect2, 
  ex01_sem_mediator
)
Chi-Squared Difference Test

                       Df    AIC    BIC    Chisq Chisq diff RMSEA Df diff Pr(>Chisq)    
ex01_sem_commoneffect   0 1218.8 1228.7   0.0000                                        
ex01_sem_commoncause    0 2425.5 2442.0   0.0000       0.00 0.000       0               
ex01_sem_commoneffect2  0 3688.8 3708.6   0.0000       0.00 0.000       0               
ex01_sem_mediator       0 2425.5 2442.0   0.0000       0.00 0.000       0               
ex01_sem_chain          1 2423.8 2437.0   0.2754       0.28 0.000       1     0.5997    
ex01_sem_x2effect       2 2480.4 2490.3 377.7330     377.46 1.372       1     <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Warning message: lavaan->lavTestLRT():  
   some models have the same degrees of freedom 

The models are not mutually nested, so we can’t interpret the chi-squared test. However, we can use the two information criteria (AIC and BIC) since those are agnostic to the particular models in comparison. Both AIC and BIC indicate that the uncorrelated common effect structure is the most likely model.

That’s not the end of the story, though…

Cross-validating SEMs

As an applied data scientist, I’ve become very skeptical of parametric goodness-of-fit metrics. After all, I can fit a neural net that perfectly fits any dataset by memorizing it. What I am really interested is in how well a model can make accurate predictions.

Ideally, we would fit our model on one dataset and then collect separate data to validate its predictive accuracy. This is called cross-validation. Given the limitations of field research, however, a more common approach is to split your single dataset into two groups. One for fitting and one for validation.

The simple form of cross-validation only gives us one chance to measure accuracy (or lack thereof, i.e. error). In order to maximize the use of our data and get better estimates, it is a standard practice to split the dataset into a given number of evenly distributed, randomly sampled splits (i.e. folds). Then we can perform the cross-validation as many times as we have folds. This is called k-fold cross-validation.

We can use the library cvsem to easily conduct a 10-fold cross validation on all of our models.

# Cross validating
models_to_cv <- 
  cvgather(
    ex01_sem_chain, 
    ex01_sem_commoneffect, 
    ex01_sem_commoncause, 
    ex01_sem_x2effect, 
    ex01_sem_commoneffect2, 
    ex01_sem_mediator
    )

cvsem(
  data = example01_data_anon,
  Models = models_to_cv,
  k = 10
)
Cross-Validation Results of 6 models 
based on  k =  10 folds. 

                   Model E(KL-D)   SE
1         ex01_sem_chain    0.32 0.16
3   ex01_sem_commoncause    0.33 0.16
6      ex01_sem_mediator    0.33 0.16
4      ex01_sem_x2effect    5.90 1.32
5 ex01_sem_commoneffect2    7.16 1.23
2  ex01_sem_commoneffect    7.20 1.23

Both of the ‘common cause’ structures exhibit the worst predictive performance. They have the highest errors (by default, the error is calculated via KL-Divergence–the difference between the covariance matrix of the validation set and the covariance matric implied by the fitted model). To be fair, a very quick look at the relationship between x1 and x2 could have told us that any model that fails to account for the relationship of these two models is not realistic.

The next highest error comes from the model suggested by our initial stepwise regression method. Thus, based on predictive performance, we can conclusively exclude these hypothesized structures.

The chain, common cause, and mediator structures all have similar predictive accuracy. How can we decide which is the true structure of our system?  Unfortunately, there is no statistical method to differentiate with our data. However, we can still use the structural model to help us develop further experiments to select the final model. For example, we could envision conducting an experiment where we held x2 constant in one set of plots and let it vary in another.

If the system were a common cause structure, holding x2 would cause no difference in the y values between plots. However, if the system were a fully mediated chain through x2 , holding x2 would completely decouple the association between x1 and y, whereas controlling x2 in a partially mediated system would only attenuate the covariation between x1 and y.

In our toy example, the true causal structure will be easy to diagnose when I tell you what the variables really are:

y ~ Shark attacks,
x1 ~ Weather,
x2 ~ Ice cream sales  

Estimating SEM coefficients

You might be tempted to take the SEM for the common cause structure and use that in your paper. I’d advise against it. Since we used all of our data in estimating the structure of the system, we don’t want to reuse the same data to estimate parameter coefficients. That would be double-dipping on our data. Having our cake and eating it too.

Instead, we need to either collect additional data (ideal) or reserve a split of our original data to use as the final fitting data. Since this is a simulated dataset, we can ‘collect’ all of the new data we want.

set.seed(666)

n <- 200

weather <- rnorm(n, mean = 20, sd = 5)

ice_cream_sales <- 50 + 2 * weather + rnorm(n, mean = 0, sd = 5)

shark_attacks <- 5 + 0.3 * weather + rnorm(n, mean = 0, sd = 5)

example01_newdata <- data.frame(shark_attacks, weather, ice_cream_sales)
# common cause
ex01_new_commoncause <- '
ice_cream_sales ~ weather
shark_attacks ~ weather
# shark_attacks ~ 1
'

ex01_sem_new_commoncause <- sem(ex01_new_commoncause, data = example01_newdata)

summary(ex01_sem_new_commoncause)
> summary(ex01_sem_new_commoncause)
lavaan 0.6-18 ended normally after 10 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                    Estimate  Std.Err  z-value  P(>|z|)
  ice_cream_sales ~                                    
    weather            2.131    0.065   32.853    0.000
  shark_attacks ~                                      
    weather            0.419    0.061    6.852    0.000

Covariances:
                     Estimate  Std.Err  z-value  P(>|z|)
 .ice_cream_sales ~~                                    
   .shark_attacks       3.026    1.644    1.841    0.066

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .ice_cream_sals   24.452    2.445   10.000    0.000
   .shark_attacks    21.727    2.173   10.000    0.000

Our coefficient estimates are close, but not spot on the true parameter values we simulated. The reason is that the simulated errors are pretty high (sd = 5). Rather than fitting the final model once, we can borrow another technique from machine learning and repeatedly fit bootstrap replicates. This will both stabilize our estimates and provide a convenient, non-parametric confidence interval.

Boostrapping SEM coefficients

Content to come!

Extensions to SEM

Content to come!

Laten variables

Content to come!

Composite variables

Content to come!

Non-linear relationships

Content to come!

Hierarchical relationships

Content to come!

Background

This tutorial was originally developed for a graduate workshop at Yale University on 2024 07 25.