## Overview

Examples using linear regression and linear regression with smooths

ggplot2 for graphics and visualization

Synthesize models and visualization to

• Visualize data to inform models and
• condition on models to explore data

## Basics

require(ggplot2)
(p <- ggplot(d, aes(mentalHealth)) + geom_histogram(binwidth = 2)) How would you summarize the distribution?

## Basics

You could use the expectation $E(y) = 52.92$

p + geom_vline(xintercept = mean(d$mentalHealth), size = 2, colour = "red") ## Basics (p <- ggplot(d, aes(mastery, mentalHealth)) + geom_point()) How would you summarize this distribution? ## Basics You could use the conditional expectation $$E(y | f(x))$$ where $$f(\cdot)$$ is some function, say linear m <- lm(mentalHealth ~ mastery, data = d)  $\widehat{mentalHealth} = 36.9 + 5.14 mastery$ p + geom_abline(intercept = coef(m), slope = coef(m)) ## Multiple dimensions The conditional expectation generalizes to $$\mathbb{R}^{N}$$, but graphs do not. coef(lm(mentalHealth ~ mastery + neuroticism, d))[["mastery"]]  ##  -1  Under certain conditions, $$E(y | f(x_1, x_2)) \approx E( y | f(x_1) ~~ | ~~ f(x_2 | f(x_1))~)$$ m1 <- lm(cbind(mentalHealth, mastery) ~ neuroticism, d) coef(lm(mentalHealth ~ mastery, rd <- as.data.frame(resid(m1))))[["mastery"]]  ##  -1  ## Multiple dimensions We can take advantage of this for data and model visualization p <- ggplot(d, aes(mastery, mentalHealth)) + geom_point() + stat_smooth(method = "lm") require(gridExtra) grid.arrange(p, p %+% rd, ncol = 2) ## Multiple dimensions This concept also generalizes: $$E( y | f(x_1, \ldots, x_{n - 1}) ~~ | ~~ f(x_n | f(x_1, \ldots, x_{n - 1}))~)$$ rd <- as.data.frame(resid(lm(cbind(mentalHealth, mastery) ~ age + hope + neuroticism, d))) p %+% rd ## Multiple dimensions require(GGally) ggpairs(d[, c(1, 2, 11, 12)], lower = list(continuous = "smooth")) ## Other Functional Forms So far simple linear functional forms, but $$f(\cdot)$$ can be any When form is unknown, splines or smooth parameters are useful, especially for "nuissance" variables Many options -- mgcv package has the gam function for generalized additive models with a general smooth function s() defaulting to thin-plate Also the splines package for b-splines bs() or natural cubic splines ns() you can use in any modeling function require(mgcv) require(splines)  ## Other Functional Forms # fit and store models m.linear <- lm(positiveAffect ~ deltaTime, d) m.quadratic <- lm(positiveAffect ~ deltaTime + I(deltaTime^2), d) m.smooth <- gam(positiveAffect ~ s(deltaTime, k = 5), data = d) # data set of raw and predicted values pdat <- cbind(d[, c("deltaTime", "positiveAffect")], linear = fitted(m.linear), quadratic = fitted(m.quadratic), smooth = fitted(m.smooth)) # plot of raw data with separate coloured lines by model p <- ggplot(pdat, aes(deltaTime, positiveAffect)) + geom_point() + geom_line(aes(y = linear), colour = "green", size=1.2) + geom_line(aes(y = quadratic), colour = "blue", size=1.2) + geom_line(aes(y = smooth), colour = "red", size=1.2)  ## Other Functional Forms ## Other Functional Forms Days since treatment, visual acuity (logMAR), and age are nuissance variables; we care about mastery & positive affect ldat <- melt(d[, c("deltaTime", "logMAR1", "age", "mastery", "positiveAffect")], id.vars="positiveAffect") ggplot(ldat, aes(value, positiveAffect)) + geom_point() + stat_smooth(se=FALSE) + facet_wrap(~variable, scales="free_x") ## Other Functional Forms # smooth model - k = 10 for high flexibility m.smooth <- gam(positiveAffect ~ s(deltaTime, k = 10) + s(logMAR1, k = 10) + s(age, k = 10), data = d) # linear model only m.linear <- lm(positiveAffect ~ deltaTime + logMAR1 + age, data = d) # pull out residuals in a data set pdat <- data.frame(mastery = d$mastery,
residualPositiveAffect = c(resid(m.smooth), resid(m.linear)),
Type = rep(c("smooth", "linear"), each = nrow(d)))

# make the graph (separate by model type)
p <- ggplot(pdat, aes(mastery, residualPositiveAffect)) +
geom_point() + stat_smooth(method="lm") +
facet_wrap(~ Type)


## Other Functional Forms # correlation between residuals
cor(resid(m.smooth), resid(m.linear))

##  0.96


## Review

Packages

• ggplot2
• gridExtra
• GGally
• mgcv
• reshape

Graphics Functions

• ggplot() - setup graph
• geom_point() - add points
• geom_histogram() - histogram
• stat_smooth() - add smooth line
• ggpairs() - matrix of graphs
• aes() - set aesthetics (e.g., x axis, colour)

Model Functions

• lm() - linear model
• gam() - additive model
• coef() - extract coefficients
• resid() - extract residuals
• fitted() - extract fitted values
• s() - smooth a term
• I() - R commands in formulae

/

#