Psychometrics

Animated MCMC plot

Psychometrics: Machine Learning meets Psychology

Joshua Wiley, M.A.

Senior Analyst — Elkhart Group

What is psychometrics?

“field of study concerned with the theory and technique of psychological measurement”

Wikipedia

What is psychometrics?

But really I am just going to talk about how to combine sets of variables using latent variables

Example 1

Inflammation is the body's natural response to bacteria or other pathogens

In response to an infection, wound, or surgery, inflammation is appropriate and adaptive

Inflammation also responds to stress

Sometimes, people are under chronic stress and always have low grade inflammation

Example 1

Chronic inflammation is related to negative outcomes

Increased risk of cardiovascular disease

Tumor growth and metastasis

Increased sensitivity to rejection

Fatigue, social withdrawal, depression

Example 1

Wouldn't it be great if we could predict inflammation?

Or even measure it?

Inflammation is elevated levels of: IL-1, IL-6, TNF-a, NF-kB, CRP, fibrinogen, adhesion molecules

No single biological measure captures all of inflammation

Midlife in the United States

A national longitudinal study of health & well-being (MIDUS)

We will use three measures of inflammation from the blood:

age and waist-to-hip ratio (whr) are predictors

Biological data on about 1,200 adults

Example 1

require(semutils)
require(xtable)
desc <- SEMSummary(~ ., data = midus)
desc.table <- APAStyler(desc, type = "cor")
plot(desc, points = FALSE)
Heatmap of the correlation matrix among study variables

Example 1

Descriptive Statistics for MIDUS
N M SD 1. 2. 3. 4. 5.
1. age 1231 54.54 11.69 - 0.13 0 0.17 0.11
2. whr 1231 0 1 - 0.11 0.18 0.05
3. crp 1231 0 1 - 0.52 0.54
4. il6 1231 -0.01 0.99 - 0.44
5. fibrinogen 1231 0 1 -

Aside on knitr

knitr made these slides --- code is dynamically linked with text and output automagically included

Written by Yihui Xie

Official website: http://yihui.name/knitr/

The book: Dynamic Documents with R and knitr, is coming out July from CRC Press

I am giving a longer presentation on knitr to the LA R User Group this summer, check http://elkhartgroup.com/ for details

Example 1

Factor Model

Let \(\mathbf{y_i}\) be all the dependent variables (factor indicators), \(\boldsymbol{\nu}\) be a vector of the intercepts, \(\boldsymbol{\Lambda}\) be the parameter vector of factor loadings, \(\boldsymbol{\eta_i}\) be the factor scores, \(\boldsymbol{\varepsilon_i}\) be the residuals

But \(\boldsymbol{\eta_i}\) is completely missing (latent)! We will need identifiability constraints.

Think the matrix equivalent of regression

\[ \mathbf{y_i} = \boldsymbol{\nu + \Lambda\eta_i + \varepsilon_i} \]

We model the variance covariance matrix

\[ \boldsymbol{\Sigma = \Lambda\Psi\Lambda^{T} + \Theta} \]

Where \(\boldsymbol{\Sigma}\) is the population variance covariance matrix, \(\boldsymbol{\Psi}\) is the variance covariance matrix of \(\boldsymbol{\eta_i}\), and \(\boldsymbol{\Theta}\) is the residual variance covariance matrix

Factor Model

To identify the model, we can either fix the factor variance(s), by convention to 1, or fix one loading (an element of \(\boldsymbol{\Lambda}\) ), again by convention to 1

Also impose a structure that \(\boldsymbol{\Theta} = diag(\boldsymbol{\varepsilon_i})\) --- conditioning on the latent factors, the data are independent

In practice, we can relax some of those constraints as long as we do not relax too many

In terms of the distribution of the data, this boils down to assuming that:

\[ \mathbf{y_i} \sim MVN(\boldsymbol{\nu + \Lambda\eta_i}, diag(\boldsymbol{\varepsilon_i})) \]

Using Stan

There are many ways to estimate these models, and lots of dedicated software

To start, we will write out the models in the bayesian software, Stan in R

The data are the three inflammatory markers, crp, il6, and fibrinogen

# setup data in the format for Stan
cfa.data <- list(N = nrow(midus), K = 3L, L = 2L, x = matrix(unlist(midus[, 
    c("crp", "il6", "fibrinogen")]), ncol = 3))

Using Stan

Stan code: eventually it is one big model, but we will break it into chunks

cfa.model <- 'data {
  int<lower=0> N; // number of  observations
  int<lower=0> K; // vars
  int<lower=0> L; // number of loadings
  real x[N, K];
}'

The first chunk declares teh data. This includes the type of the variables (integer, real (i.e., decimales)), bounds on them, if applicable, and the names

Brackets are used to indicate a matrix, and double forward slash are comments

Using Stan

cfa.model <- paste0(cfa.model, '
parameters {
  real alpha[K];
  real lambda[L];
  real f[N]; // factor
  real<lower=0> sigma2[K]; // residual variances
  real<lower=0> fsigma2; // factor variances
}')

Here we define the parameter section, including all the parameters used in the models

The brackets here specify vectors

Using Stan

This section defines transformations of the parameters

We calculate the standard deviation of the variances

cfa.model <- paste0(cfa.model, '
transformed parameters {
  real<lower=0> sigma[K];
  real<lower=0> fsigma;

  fsigma <- sqrt(fsigma2);
  for (j in 1:K) {
    sigma[j] <- sqrt(sigma2[j]);
  }
}')

Using Stan

This section defines the actual model

Priors on the intercepts and loadings are \(\sim \mathcal{N}(0, 100)\), priors on the standard deviations are \(\sim Inv-Gamma(.001, .001)\)

The data values are then normally distributed on the location parameter, and their standard deviations

cfa.model <- paste0(cfa.model, '
model {
  for (j in 1:K) {
    alpha[j] ~ normal(0, 100);
    sigma[j] ~ inv_gamma(.001, .001);
  }
  for (j in 1:L) {
    lambda[j] ~ normal(0, 100);
  }
  fsigma ~ inv_gamma(.001, .001);

  f ~ normal(0, fsigma);

  for (i in 1:N) {
    x[i, 1] ~ normal(alpha[1] + f[i], sigma[1]);
    x[i, 2] ~ normal(alpha[2] + lambda[1] * f[i], sigma[2]);
    x[i, 3] ~ normal(alpha[3] + lambda[2] * f[i], sigma[3]);
  }
}')

Using Stan

Compiles and run --- note compiling can take a bit as the c++ code is compiled to an excecutable

require(rstan)
cfa.1 <- stan(model_code = cfa.model, data = cfa.data, iter = 500, 
    chains = 2)

# example of how to use an already compiled model and run
# the chain longer
cfa.2 <- stan(fit = cfa.1, data = cfa.data, iter = 2500, chains = 2)

Using Stan

Just the simple summary Stan gives by default

print(cfa.2, pars = c("alpha", "lambda", "sigma2", "fsigma"))
## Inference for Stan model: cfa.model.
## 2 chains, each with iter=2500; warmup=1250; thin=1; 
## post-warmup draws per chain=1250, total post-warmup draws=2500.
## 
##           mean se_mean  sd 2.5% 25% 50% 75% 98% n_eff Rhat
## alpha[1]   0.0       0 0.0 -0.1 0.0 0.0 0.0 0.1   934    1
## alpha[2]   0.0       0 0.0 -0.1 0.0 0.0 0.0 0.0  1425    1
## alpha[3]   0.0       0 0.0 -0.1 0.0 0.0 0.0 0.1  1443    1
## lambda[1]  0.8       0 0.0  0.7 0.8 0.8 0.8 0.9   534    1
## lambda[2]  0.9       0 0.1  0.8 0.8 0.9 0.9 1.0   403    1
## sigma2[1]  0.4       0 0.0  0.3 0.4 0.4 0.4 0.4   655    1
## sigma2[2]  0.6       0 0.0  0.5 0.6 0.6 0.6 0.6  1075    1
## sigma2[3]  0.5       0 0.0  0.5 0.5 0.5 0.6 0.6   788    1
## fsigma     0.8       0 0.0  0.7 0.8 0.8 0.8 0.9   429    1
## 
## Samples were drawn using NUTS2 at Thu Jun 13 08:47:42 2013.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Example 1

Using Stan

Now we can get and plot the posterior distributions of the coefficients

params <- extract(cfa.2, c("lambda", "alpha", "sigma2", "fsigma2"))
require(reshape2) # reshape data to long
require(ggplot2) # graphing
p <- ggplot(melt(params$lambda), aes(value)) +
  geom_histogram() + facet_wrap(~Var2)
print(p)
plot of chunk unnamed-chunk-13

Using Stan

Posteriors for intercepts

p %+% melt(params$alpha)
plot of chunk unnamed-chunk-14

Using Stan

Posteriors for residual variables

p %+% melt(params$sigma2)
plot of chunk unnamed-chunk-15

Multilevel Equivalent

We can fit a similar model using multilevel models, by reshaping long

lmidus <- melt(cbind(midus, id = 1:nrow(midus)), id.vars = c("id", "age", "whr"))
require(MCMCglmm)
m <- MCMCglmm(value ~ 0 + variable, random = ~ id, data = lmidus,
  pr=TRUE, pl=TRUE, verbose=FALSE)
# plot random effects versus latent factor scores
plot(colMeans(m$Sol)[-(1:3)], colMeans(extract(cfa.2, "f")$f))
plot of chunk unnamed-chunk-16

Structural Equations

In Stan, it is easy to extend the factor model to see what other factors predict inflammation

cfa2.model <- '
data {
  int<lower=0> N; // number of  observations
  int<lower=0> K; // vars
  int<lower=0> L; // number of loadings
  real x[N, K];
  real age[N];
  real whr[N];
}
parameters {
  real alpha[L]; // note
  real lambda[L];
  real beta[L]; // regression coefficients
  real f[N]; // factor
  real cons; // factor intercept
  real<lower=0> sigma2[K]; // residual variances
  real<lower=0> fsigma2; // factor residual variance
}
transformed parameters {
  real<lower=0> sigma[K];
  real<lower=0> fsigma;

  fsigma <- sqrt(fsigma2);
  for (j in 1:K) {
    sigma[j] <- sqrt(sigma2[j]);
  }
}'

Structural Equations

cfa2.model <- paste0(cfa2.model, '
model {
  for (j in 1:K) {
    sigma[j] ~ inv_gamma(.001, .001);
  }
  for (j in 1:L) {
    alpha[j] ~ normal(0, 100);
    lambda[j] ~ normal(0, 100);
    beta[j] ~ normal(0, 100);
  }

  cons ~ normal(0, 100);
  fsigma ~ inv_gamma(.001, .3);

  for (i in 1:N) {
    f[i] ~ normal(cons + beta[1] * age[i] + beta[2] * whr[i], fsigma);
    x[i, 1] ~ normal(f[i], sigma[1]);
    x[i, 2] ~ normal(alpha[1] + lambda[1] * f[i], sigma[2]);
    x[i, 3] ~ normal(alpha[2] + lambda[2] * f[i], sigma[3]);
  }
}
')

Structural Equations

Now we setup the data and run

cfa2.data <- list(N = nrow(midus), K = 3L, L = 2L,
  x = matrix(unlist(midus[, c("crp", "il6", "fibrinogen")]), ncol = 3),
  age = midus$age, whr = midus$whr)

cfa2 <- stan(model_code = cfa2.model, data = cfa2.data, iter = 2500, chains = 2)

Using Stan

Print a parameter summary

print(cfa2, pars = c("alpha", "lambda", "cons", "beta", "sigma2", 
    "fsigma"))
## Inference for Stan model: cfa2.model.
## 2 chains, each with iter=2500; warmup=1250; thin=1; 
## post-warmup draws per chain=1250, total post-warmup draws=2500.
## 
##           mean se_mean  sd 2.5%  25%  50%  75%  98% n_eff
## alpha[1]   0.0       0 0.0 -0.1  0.0  0.0  0.0  0.0   859
## alpha[2]   0.0       0 0.0 -0.1  0.0  0.0  0.0  0.0   422
## lambda[1]  0.9       0 0.1  0.8  0.8  0.8  0.9  1.0   150
## lambda[2]  0.9       0 0.1  0.8  0.8  0.9  0.9  1.0   130
## cons      -0.3       0 0.1 -0.6 -0.4 -0.3 -0.2 -0.1   508
## beta[1]    0.0       0 0.0  0.0  0.0  0.0  0.0  0.0   512
## beta[2]    0.1       0 0.0  0.1  0.1  0.1  0.1  0.2  1061
## sigma2[1]  0.4       0 0.0  0.3  0.4  0.4  0.4  0.5   149
## sigma2[2]  0.6       0 0.0  0.5  0.5  0.6  0.6  0.6   430
## sigma2[3]  0.5       0 0.0  0.5  0.5  0.5  0.6  0.6   281
## fsigma     0.8       0 0.0  0.7  0.7  0.8  0.8  0.8   135
##           Rhat
## alpha[1]     1
## alpha[2]     1
## lambda[1]    1
## lambda[2]    1
## cons         1
## beta[1]      1
## beta[2]      1
## sigma2[1]    1
## sigma2[2]    1
## sigma2[3]    1
## fsigma       1
## 
## Samples were drawn using NUTS2 at Thu Jun 13 08:53:37 2013.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Using Stan

Traceplots of loadings

rstan::traceplot(cfa2, "lambda")
plot of chunk unnamed-chunk-21

Example 2 - Frequentist

Example 2

Table 1. The Perceived Cancer Control and Threat Scale
  Not at all true A little true Somewhat true Quite true Completely true
1. I feel that cancer is a serious threat to my life 1 2 3 4 5
2. I feel that getting cancer (or a second cancer) will interfere with important life goals 1 2 3 4 5
4. I feel overwhelmed when I think about getting cancer (or a second cancer) 1 2 3 4 5
7. I feel in control when I think about my future health 1 2 3 4 5
8. I can control taking actions to prevent getting cancer (or a second cancer) 1 2 3 4 5
9. I can control detecting cancer at the earliest possible stage 1 2 3 4 5

Example 2

desc.table <- APAStyler(SEMSummary(~ ., data = midus), type = "cor")
print(xtable(desc.table$table, caption = "Descriptive Statistics for Cancer Threat/Control"), 
    type = "html", file = "", html.table.attributes = NULL, comment = FALSE, 
    caption.placement = "top")
Descriptive Statistics for Cancer Threat/Control
N M SD 1. 2. 3. 4. 5.
1. age 1231 54.54 11.69 - 0.13 0 0.17 0.11
2. whr 1231 0 1 - 0.11 0.18 0.05
3. crp 1231 0 1 - 0.52 0.54
4. il6 1231 -0.01 0.99 - 0.44
5. fibrinogen 1231 0 1 -

Example 2

You can find more details on this sample, the questionnaire, and models here

To try out this and other code, download a dataset simulated based on the real data (cannot share for confidentiality reasons)

http://joshuawiley.com/files/simwdat.csv

Example 2

Example 2

We can start by focusing on control at time 3, and use a multilevel measurement model

ldat <- melt(cbind(wdat[, 4:6], id = 1:nrow(wdat)), id.vars = "id")
head(ldat)  # look at data
##   id variable value
## 1  1   prc7T3     4
## 2  2   prc7T3     2
## 3  3   prc7T3     3
## 4  4   prc7T3     3
## 5  5   prc7T3     3
## 6  6   prc7T3     4
require(lme4)  # frequentist mixed effects models
m <- lmer(value ~ 0 + variable + (1 | id), data = ldat, REML = FALSE)

Example 2

summary(m)
## Linear mixed model fit by maximum likelihood 
## Formula: value ~ 0 + variable + (1 | id) 
##    Data: ldat 
##   AIC  BIC logLik deviance REMLdev
##  5182 5210  -2586     5172    5186
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 0.740    0.860   
##  Residual             0.565    0.752   
## Number of obs: 1848, groups: id, 616
## 
## Fixed effects:
##                Estimate Std. Error t value
## variableprc7T3    3.268      0.046    71.0
## variableprc8T3    3.539      0.046    76.9
## variableprc9T3    3.615      0.046    78.5
## 
## Correlation of Fixed Effects:
##             vrb7T3 vrb8T3
## varblprc8T3 0.567        
## varblprc9T3 0.567  0.567

Example 2

We can extract, store, and plot the best linear unbiased predictors (BLUPs) of the random intercept

# mixed model estimates
mm.yhat <- ranef(m)$id[["(Intercept)"]]
hist(mm.yhat)
plot of chunk unnamed-chunk-28

lavaan

There is lots of specialized software for latent variable models as are often used in psychometrics

One of these in R is lavaan

It is available as a free R package, and you can find examples and details online at http://lavaan.org

require(lavaan)

lavaan

Setting up models is straightforward and lavaan uses sensible defaults so

lavaan.model <- '
  control =~ prc7T3 + prc8T3 + prc9T3
  prc7T3 ~~ e * prc7T3
  prc8T3 ~~ e * prc8T3
  prc9T3 ~~ e * prc9T3
'
# estimate it
fit <- cfa(lavaan.model, data = wdat)
summary(fit)
## lavaan (0.5-13) converged normally after  17 iterations
## 
##   Number of observations                           616
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               33.032
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)
## Latent variables:
##   control =~
##     prc7T3            1.000
##     prc8T3            1.058    0.059   17.795    0.000
##     prc9T3            1.051    0.059   17.741    0.000
## 
## Variances:
##     prc7T3    (e)     0.565    0.023
##     prc8T3    (e)     0.565    0.023
##     prc9T3    (e)     0.565    0.023
##     control           0.689    0.068

That is it. Unlike Stan where everything has to be written, lavaan uses defaults so less code to write

lavaan

We can also get predicted factor scores and compare those with the multilevel measurement model

lavaan.yhat <- as.vector(predict(fit))
plot(mm.yhat, lavaan.yhat)
abline(0, 1)
plot of chunk unnamed-chunk-31

lavaan

Easier to extend the models in lavaan than in Stan or mixed model framework

We can do a two factor model as in the diagram for threat and control at time 3

lavaan.model <- '
  control  =~ prc7T3 + prc8T3 + prc9T3
  threat  =~ prc1T3 + prc2T3 + prc4T3 + prc7T3
'
# estimate it
fit <- cfa(lavaan.model, data = wdat)

lavaan

summary(fit)
## lavaan (0.5-13) converged normally after  30 iterations
## 
##   Number of observations                           616
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               29.487
##   Degrees of freedom                                 7
##   P-value (Chi-square)                           0.000
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)
## Latent variables:
##   control =~
##     prc7T3            1.000
##     prc8T3            1.356    0.098   13.828    0.000
##     prc9T3            1.306    0.094   13.963    0.000
##   threat =~
##     prc1T3            1.000
##     prc2T3            1.220    0.095   12.830    0.000
##     prc4T3            0.912    0.075   12.237    0.000
##     prc7T3           -0.325    0.053   -6.105    0.000
## 
## Covariances:
##   control ~~
##     threat           -0.106    0.031   -3.366    0.001
## 
## Variances:
##     prc7T3            0.717    0.050
##     prc8T3            0.402    0.052
##     prc9T3            0.470    0.051
##     prc1T3            0.776    0.066
##     prc2T3            0.560    0.078
##     prc4T3            1.038    0.072
##     control           0.480    0.062
##     threat            0.728    0.088

lavaan

Can get a path diagram using the semPlot package

require(semPlot)
semPaths(fit, what = "par")
plot of chunk unnamed-chunk-34

Mplus

There are also powerful commercial software for estimating latent variable models

Mplus http://statmodel.com/ estimates latent variables, and the observed variables can be nearly any type (continuous, categorical, survival), mixture models, and even has a Bayesian estimator

To make using Mplus easier, Michael Hallquist wrote the MplusAutomation packge, and I codevelop it with him

You can get more information from github here or see a recent presentation I gave on it

Mplus

We can fit the same two factor model from lavaan in Mplus

require(MplusAutomation)
cd(tempdir(), pre <- "cfa", num <- "prc_lgm")

cfa <- mplusObject(
  TITLE = "Factor Model for PRC Control and Threat;",
  MODEL = "
  Threat BY prc1T3 prc2T3 prc4T3 prc7T3;
  Control BY prc7T3 prc8T3 prc9T3;",
SAVEDATA = "
  FILE IS estimates.dat;
  SAVE = FSCORES;",
usevariables = paste0("prc", c(1, 2, 4, 7, 8, 9), "T3"),
rdata = wdat)

# estimate the model
m.cfa <- mplusModeler(cfa, paste0(pre, num, ".dat"),
  paste0(pre, num, ".inp"), run=TRUE)

Mplus

summary of the model object (mostly fit indices used in factor analysis and SEM)

summary(m.cfa)
##  Factor Model for PRC Control and Threat 
## 
## Estimated using ML 
## Number of obs: 616, number of (free) parameters: 20 
## 
## Model: Chi2(df = 7) = 29.487, p = 1e-04 
## Baseline model: Chi2(df = 15) = 1121.756, p = 0 
## 
## Fit Indices: 
## 
## CFI = 0.98, TLI = 0.956, SRMR = 0.035 
## RMSEA = 0.072, 90% CI [0.047, 0.1], p < .05 = 0.075 
## AIC = 10791.398, BIC = 10879.863

Mplus

A simple plot of coefficients

plot(m.cfa)
plot of chunk unnamed-chunk-37

Review

Packages

Models

Questions?

/

#
fake yves saint laurent crossbody wallet Heels wedding christian louboutin outlet replica buy Wholesale replica store celine bags sunglasses authentic louis vuitton outlet bags authentic louis vuitton outlet purses Official celine outlet replica handbags mini luggage tote Replica store shop women crossbody purses yves saint laurent Fake shop chloe handbags bags official price Bags wallet buy fake shop yves saint laurent Christian louboutin replica store shop heels shoes Celine outlet online handbags luggage the best Wallet handbags outlet sale chloe shop Purse women replica store prada best bags kelly hermes shop store online Wholesale yves saint laurent crossbody bags replica chloe handbags sales Bags clutch authentic online prada fake replica Heels shoes authentic best christian louboutin sale fake Replica outlet bags trapeze 2015 celine hermes Official price Outlet Cheap bags handbags canada goose saleCanada Goose Outletceline boston bagscheap canada goose
Bags wallet yves saint laurent online 2015 replica store yves saint laurent Heren sale Bags trapeze cheap celine outlet store artsy bag louis vuitton 2015 women clutch yves saint laurent outlet online Sale replica authentic wholesale chloe handbags bags Yves saint laurent outlet sale real women totes bags bags Garden Party hermes online best store fake online Real bags handbags Outlet sale hermes Louis vuitton sale fake official price luggage wallet replica prada sunglasses usa Store shop heels shoes christian louboutin best real celine replica purses yves saint laurent avslag replica Celine bags mini outlet replica wholesale Celine fake replica online handbags price yves saint laurent sac discount canada goose salecheap canada goose saleceline outletcanada goose black friday
red bottoms men celine bag outlet prada bags celine luggage sale red bottoms heels prada wallet on sale celine mini luggage tote replica christian louboutin wedding shoes christian louboutin pumps chloe cheap wallets for sale hermes replica bags celine handbags outlet prada bags on sale christian louboutin sneakers hermes belt replica hermes replica bags for sale celine bag sale prada sunglasses cheap prada sunglasses cheap red bottoms men buy chloe wallets prada sunglasses on sale christian louboutin women hermes outlet handbags hermes bag replica red bottoms sneakers prada shoes on sale hermes replica belt yves saint laurent crossbody purses yves saint laurent wallet replica chloe cheap wallets for sale prada sunglasses cheap celine phantom outlet outlet celine red bottoms heels celine bag replica kelly bag hermes bag outlet christian louboutin heels