knitr: Automatically embedding R output in documents

Joshua Wiley, M.A.

Senior Analyst — Elkhart Group


I: What are 'dynamic documents'?

source documents containing both program code and narratives

Yihui Xie

program code: e.g., for statistical analyses, making graphs

narratives: literate text, explaining the results or output from the program code

I: Why write dynamic documents?

I: Why not write dynamic documents?

If you already dynamic documents decrease errors and save time.

I: Resources

II: Basics

Markup, Code Chunks, Graphs and Output

II: Markdown

knitr supports many markup languages, but to start, markdown is nice and simple

Headers (level 1, 2, 3): #, ##, ###

Bold, italics: **x**, _x_


| Column 1 | Column 2 | Column 3 |
|   left   |   right  | center   |
|   row 2  |   text   | text     |


# graphics packages

# boxplot
qplot(cut, price, data = diamonds, geom = "boxplot")

# linear regression
summary(lm(price ~ carat, data = diamonds))

II: R & Markdown

R Markdown files: .Rmd

Write regular markdown for most the file

Put R code in chunks. Chunks start with “```{r}” and end with “```

Chunk options go at the start between the braces:
```{r, options}

Next is a simple, but complete, R markdown file

II: R & Markdown

# Diamond Cut, Size, and Price

Diamonds with an _ideal_ cut have a lower median price.

# graphics packages

# boxplot
qplot(cut, price, data = diamonds, geom = "boxplot")

One explanation for this unexpected finding is that ideal cut diamonds also tend to be **smaller**, and size is related to price.

summary(lm(price ~ carat, data = diamonds))

II: R & Markdown Results

We run require(knitr); knit2html("example1.rmd") and the result is

II: Adding Options

The defaults are really easy to use, but for a more polished report, we often need to change them.

All the options available are documented here

To start, we can resize the image, add a caption, and hide the source code, using these options:

We will also inline some output and show to to use MathJax for \(\LaTeX\) equations

II: R & Markdown

# Diamond Cut, Size, and Price

Diamonds with an _ideal_ cut have a lower median price.

```{r fig.height=3, fig.width=4, fig.cap="Boxplot of Diamond Prices by Cut", echo=FALSE}
# graphics packages
require(ggplot2, quietly=TRUE)

# boxplot
qplot(cut, price, data = diamonds, geom = "boxplot")

One explanation for this unexpected finding is that ideal cut diamonds also tend to be **smaller**.  
The mean is \(E(carat | cut = Ideal) = `r mean(subset(diamonds, cut == "Ideal")$carat)`\), and size is related to price.

```{r echo=FALSE}
summary(lm(price ~ carat, data = diamonds))

II: R & Markdown Results

II: Customizing

There are many more knitr options, but often, to further refine reports, we need to customize the output from R itself.

We are also going to look at a new type of markup, \(\LaTeX\). The convention for these files is to use the .rnw extension.

One common output from R is tables, whether snippets of data, descriptive information, or summaries of analyses. The xtable package converts lots of R model output, matrices, and data frames to a nice tabular output for \(\LaTeX\) or HTML (suitable for HTML or markdown files).

II: R & \(\LaTeX\)

\documentclass{article} \usepackage{floatrow}
<< include=FALSE >>=
require(ggplot2); require(xtable) # load packages
\section{Diamond Cut, Size, and Price}
Diamonds with an \emph{ideal} cut have a lower median price.
One explanation is that ideal cut diamonds are \emph{smaller}.
The mean is $E(carat | cut = Ideal) =
\Sexpr{mean(subset(diamonds, cut == "Ideal")$carat)}$,
and size is related to price (shown in the table).

\begin{figure}[!h] \begin{floatrow} \capbtabbox{
<< echo=FALSE, results='asis' >>=
print(xtable(coef(summary(lm(price ~ carat, data = diamonds)))[, -3]), floating = FALSE)
}{ \caption{Regression predicting diamond price by size} }
<< echo=FALSE >>=
qplot(cut, price, data = diamonds, geom = "boxplot")
}{ \caption{Diamond price by quality of cut} }
\end{floatrow} \end{figure} \end{document}

II: R & \(\LaTeX\) Results

Image of PDF from LaTeX for Example 3

II: Customizing

We saw a few new commands:

II: Reusing Chunks

knitr stores code in chunks, if we name a chunk, we can reuse it

Chunks are named by adding text at the beginning of the code chunk start tag
(e.g., << name, options >>)

Reusing chunks also allows us to show code in a nonlinear order

The next example shows how to use chunk names, and more advanced customization of output within R and with knitr, as well as how to use referencing

II: R & \(\LaTeX\)

\documentclass{article} \begin{document}
<< include=FALSE >>=
<< chunk1, include=FALSE >>=
m <- lm(mpg ~ hp, data = mtcars)
\section{Resampling Residuals}
We can bootstrap, by resampling the residuals from a model
and refitting.  Figure \ref{fig:chunk2} shows the ``data''
from 4 resamples added to the $\hat{y}$s from Table \ref{tab:lr}.
<< chunk2, echo=FALSE, fig.pos='!h', fig.cap='Resample Plots', out.width=".5\\linewidth", fig.align='center' >>=
yhat <- predict(m)
set.seed(10); par(mfrow = c(2, 2), mar = rep(1, 4))
invisible(replicate(4, with(mtcars, plot(hp, yhat + sample(resid(m), , TRUE)))))
$\ldots$ In this case, the model was a linear regression:
<< chunk1 >>=
and the coefficients are
<< echo=FALSE, results='asis' >>=
print(xtable(m, caption="Linear Regression", label="tab:lr", digits=2))

II: R & \(\LaTeX\) Results

Image of PDF from LaTeX for Example 4

III: Advanced

Workflow, Caching, Graphic Devices, Hooks

III: Workflow

As you move from simple, single file projects to ones with multiple files, an automated system can lead to some new issues

By default, knitr names output files based on the input files but with a different extension (e.g., .rnw becomes .tex)

Graphic plot files go into a relative subdirectory, figure/ by default

To have code in one place and have the output (e.g., for a production server in another) we can customize the output file and directory

III: Workflow

In code_setup.R
options(width = 100, digits = 2, warn=-1, width.cutoff=140)
opts_chunk$set(warning=FALSE, message=FALSE, echo=FALSE, width.cutoff=140)
suppressPackageStartupMessages(require(rCharts, quietly=TRUE))
In my .Rhtml files
if (MO6LOCK) {
} else {

Run via knit("file.Rhtml", "path/to/go/file.html")

III: Workflow

Watch out for files overwriting themselves: unnamed chunks have the same names between files and default to going to the same figure subdirectory, use fig.path option to set a custom one, put each file in a separate directory, or name chunks (well)

If you will be moving files, watch out for absolute vs. relative file paths. If you are putting a presentation online but do not have your own server, you can upload images and link the URLs, in which case you want an absolute path

In the default setup, any change to the document requires recompiling: whether the change is in data, code, or narrative

Given you are mixing different types of code (R, some markup language, etc.), nice to have a good editor. RStudio is pretty awesome, I use Emacs + ESS

Version control can help recover changes if you overwrite something you want back, and help you keep track of your progress. I like git, and GitHub is free and helps make the process easy

III: Caching

For slow computations, rerunning everything when only one code chunk changed, is tedious. Even worse, you have to rerun if you find a typo in your narrative

You can cache code chunks in knitr setting the cache.path and the flag cache=TRUE in each chunk you want cached

Caching creates a database of the R objects, saved graphs, and text output from a chunk, as well as an md5 hash of the chunk.
As long as the md5 is the same, the chunk is not rerun, which brings up...

Dependencies: if a code chunk depends on another, you must include dependson or try out knitrs auto dependency based on global variables

III: Caching

You can specify code chunks that are dependencies using positive integers (e.g., dependson = c(1, 4) for chunks 1 and 4) or negative integers for previous chunks (e.g., dependson = c(-1, -2) for previous two chunks

Caching not only saves objects, but also the text output and graphics, but watch out for things like loading packages that are needed later, or setting global chunk options

If you use chunk referencing by reusing the same name, you cannot cache both chunks, instead use a new name and use ref.label to reference the other chunk

cache.extra is a great way to cache extra information, such as the version of R or a package, the random seed, etc.

III: Graphics Engines

By default, knitr will pick the graphics device to use by the file type. For example, png for HTML files, PDF for \(\LaTeX\).

Although knitr has default graphics devices, you can use almost any graphics device in R in knitr. Common examples are: pdf, ps, png, bmp, svg.

Finally, you can specify multiple devices (e.g., dev=c('pdf', 'png')) so that for example both png and pdf image files were made for each graph. This is useful for R markdown, that you may want to process to HTML or use pandoc to process to PDF.

III: Custom Hooks

knitr has a default set of functions that control how code in chunks is processed, called hooks, these include functions that change what happens when an option is set

For advanced users or special cases, you may want to write custom hooks. Writing your own hooks allows you to control what happens before and after a chunk is processed.

Custom hooks can be used by setting additional options to the code chunk. Combined, this allows flexible customization, such as a simple option to set margins, or make a grid layout for graphics, setup animations, etc.

You can also set output hooks, to customize the output from R, such as how warnings and error messages are formatted.