Setting up R session

Install or update the latest version of the packages mgcv and itsadug:

install.packages(c("mgcv", "itsadug"))

Load libraries:

library(mgcv)
library(itsadug)
# indicate that messages of itsadug must be printed to the command line
infoMessages("off")

Check R version and package versions:

R.version.string
## [1] "R version 3.2.1 (2015-06-18)"
packageVersion("mgcv")
## [1] '1.8.6'
packageVersion("itsadug")
## [1] '1.0.2'

If these versions do not agree with the ones included in the Lab Assignment file, please let us know.


Random effects (optional)

Download the source file nonlinear_0.1.tar.gz and install the packages by the following commands in R:

download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/packages/nonlinear_0.1.tar.gz", "nonlinear_0.1.tar.gz")
install.packages("nonlinear_0.1.tar.gz", type="source", repos=NULL)

If the package is installed succesfully, you could remove the source file (delete it).

Run the demo about random effects as follows:

library(nonlinear)
demoRandom()

Inspection of the data

Download the data pupilsize_S1.rds and load the data in R:

download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/Data/pupilsize_S1.rds", "pupilsize_S1.rds", mode="wb")
dat <- readRDS('pupilsize_S1.rds')

Inspect the data frame:

head(dat)
str(dat)

The data set is a subset of the data, with only 10 participants included in order to run some more complex models.

The column Pupil contains the raw pupil dilation measure (blinks are removed). The data is sampled with 500 Hz, but downsampled to 50 Hz (i.e., a measurement each 20 ms).

Plot the grandaverage of the data. You can use your own code, or complete the code in the R block below.

# First we group the timestamps in bins of 100 ms:
dat$Timebin <- timeBins(dat$Time, 100)

# Calculate the grand averages with tapply or aggregate for the different timebins.
# Complete the code: 
#   - replace the dots in line A with the column of dat 
#     that contains the data to be averaged.
#   - replace the dots in line B with the column(s) of dat 
#     thatgroup the data.

avg <- aggregate( dat$Pupil,    # line A
      list( Timebin=dat$Timebin ),         # line B
      mean, na.rm=TRUE)

# Plot the averages:
# Complete the code: 
#   - replace the dots in line A with the values to plot on the x-axis
#   - replace the dots in line B with the values to plot on the y-axis
plot( avg$Timebin ,  # line A
      avg$x ,  # line B
      type='l', main="Grand average",
      xlab='Time (ms)', ylab='Pupil size',
      ylim=c(1000,1200), 
      lwd=2, 
      bty='n')
# Note: see help(par) for info about the different parameters

Alternative ways to use aggregate to plot the data:

avg <- aggregate( Pupil ~ Timebin, data=dat, mean, na.rm=TRUE)
# or:
avg <- aggregate( Pupil ~ Timebin + Context, data=dat, mean, na.rm=TRUE)

Example of plotting two lines, one for each condition:

# First we group the timestamps in bins of 100 ms:
dat$Timebin <- timeBins(dat$Time, 100)

# Calculate the grand averages with tapply or aggregate for the different timebins.
# Complete the code: 
#   - replace the dots in line A with the column of dat 
#     that contains the data to be averaged.
#   - replace the dots in line B with the column(s) of dat 
#     thatgroup the data.

avg <- aggregate( dat$Pupil,    # line A
      list( Timebin=dat$Timebin, Context=dat$Context),         # line B
      mean, na.rm=TRUE)

# Plot the averages:
# Complete the code: 
#   - replace the dots in line A with the values to plot on the x-axis
#   - replace the dots in line B with the values to plot on the y-axis
plot( avg[avg$Context=="AP", ]$Timebin ,  # line A
      avg[avg$Context=="AP", ]$x ,  # line B
      type='l', main="Grand averages",
      xlab='Time (ms)', ylab='Pupil size',
      ylim=c(1000,1200), 
      lwd=2, 
      bty='n')
lines(avg[avg$Context=="PA", ]$Timebin ,  # line A
      avg[avg$Context=="PA", ]$x, lwd=2, lty=2 )
legend("bottomright",
       legend=c("AP", "PA"),
       lwd=2, lty=c(1,2))
# Note: see help(par) for info about the different parameters
  • What does the Timebin column represent? In case you want to see the code for calculating the timebins type timeBins in the command line. Why would it sometimes be useful to plot Timebin rather than raw timestamps?

The timebin column groups the timestamps in larger time bins of 100 ms. This makes the line smoother and allows for inspection of the major effects.

  • Inspect the plot. Why would there be such a sudden increase in dilation after 3000 ms? Hint: type tapply(dat$Time, list(dat$Event), max) in the commandline.

Most time series run until 3000 ms, so the averages after 3000 ms are based on limited data points. That’s why these might go more extreme.


Baseline model

a. Smooth

Check the numer of rows in the data. With more than 10.000 data points you should use the function bam() rather than gam(). Also when you’re planning to include many random smooths and other complex model terms, bam() is more efficient.

nrow(dat)
## [1] 44093

Run the model with a nonlinear effect for Time. Replace the dots ... with a nonlinear smooth for Time.

m1 <- bam( Pupil ~ s(Time), data=dat  )
Inspect the summary (summary(m1)).
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Pupil ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1090.137      1.963   555.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 5.013  6.114 58.81  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.00808   Deviance explained = 0.819%
## fREML = 3.2806e+05  Scale est. = 1.6979e+05  n = 44093
  • What does the intercept represent?
    The intercept is a contant value to be added to the other modelterms (here only s(Time)). Because we don’t have any categorical predictors included, it is roughly the mean value of pupil dilation (1089.92).
  • What does the summary of smooth terms tell about the wigglyness of the smooth? (See the column edf)

    The edf is 5, which means that the smooth is not a straight line (in that case, the edf would be 1).

  • What does it mean that the smooth term is significant?

    The smooth is at some point significantly different from a flat horizontal zero line. The confidence intervals of the smooth do not contain the zero line.

Use the following code to visualize the smooth.

par(mfrow=c(1,2), cex=1.1)

# Partial effect plot:
plot(m1, select=1, shade=TRUE)
abline(h=0)

# Fitted values:
plot_smooth(m1, view="Time")

What is the difference between the two plots? What do the plots represent?

The first plot visualizes the partial effect of the smooth of Time. So it selectively visualizes the first line of the smooth term summary, but no other terms are included. The second plot visualizes the summed or fitted effects, which are the model’s predictions. This plot sets every value in the model to a certain value and sums all these effects.

b. Random intercepts

Participants differ considerably in pupil dilation size. Visualize the individual patterns. You could use the code below, or use another plotting method.

boxplot(Pupil ~ Subject, data=dat)

# or alternatively:
library(lattice)
xyplot(Pupil ~ Time | Subject, pch=16, cex=.5, data = dat)

To take into account differences between participants we could add a random intercept for each subject to the model. Replace the dots ... with the random intercepts for Subject and random intercepts for Item.

m2 <- bam( Pupil ~ s(Time) 
           + s(Subject, bs='re')
           + s(Item, bs='re'), data=dat  )

Inspect the summary (summary(m2)).

  • What does the intercept represent? The intercept is a contant value to be added to the other modelterms (here only s(Time)).

  • What does a random intercept terms represent? Adjustments from the intercept for each subject and item, drawn from a normal distribution. For example, a negative value of a random intercept moves the whole fitted effects for that particular subject downwards. These random effects account for differences between subjects and items in average pupil dilation size.

Use the following code to visualize the smooth.

# make sure to have messages on:
infoMessages("on")
# split plo window in two:
par(mfrow=c(1,2), cex=1.1)

# Partial effect plot:
plot(m2, select=1, shade=TRUE)
abline(h=0)

# Fitted values:
plot_smooth(m2, view="Time")
## Summary:
##  * Time : numeric predictor; with 30 values ranging from 0.000000 to 3243.000000. 
##  * Subject : factor; set to the value(s): s08. 
##  * Item : factor; set to the value(s): 30.

  • What do the plots represent? (Also pay attention to the summary in the command window for the plot_smooth function.) The first plot visualizes the partial effect of Time, i.e. the first line of the smooth term summary. The second plot visualizes the summed effects over time, i.e., the intercept and values for Subject and Item are included. The message in the commandline suggests that the plot is the fitted effect of Time for subject “s08” and item “30”.

  • Run the plot_smooth function again, but now add rm.ranef=TRUE as argument: plot_smooth(m2, view="Time", rm.ranef=TRUE). Why did the plot change so much?

plot_smooth(m2, view="Time", rm.ranef=TRUE)
## Summary:
##  * Time : numeric predictor; with 30 values ranging from 0.000000 to 3243.000000. 
##  * Subject : factor; set to the value(s): s08. (Might be canceled as random effect, check below.) 
##  * Item : factor; set to the value(s): 30. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(Subject),s(Item)
## 

Now the effects of subject and item (intercept adjustments) have been canceled out. Now the confidence intervals have increased, because this plot now shows the uncertainty around the interval. In the previous plot the confidence bands were smaller because the model was more certain about the intercept of the specific subject and item combination.

We now inspect the random effects. Run the following code.

par(mfrow=c(1,3), cex=1.1)

# Partial effect plot:
plot(m2, select=2)

#  get the values for individual subjects:
rand <- get_random(m2)
# note that get_random returns a list with
# the random adjustments for each random 
# intercept or slope:
rand
## $`s(Subject)`
##        s01        s02        s03        s04        s05        s07 
## -101.35582 -236.97476   39.30879   69.15500 -702.93196 -203.94801 
##        s08        s09        s10        s06 
##  386.88710  -55.72687  849.04654  -43.46002 
## 
## $`s(Item)`
##          1          2          3          4          5          6 
## -22.382871 -45.003660  26.324149  17.031028  14.617727 -39.481939 
##          7          8          9         10         11         12 
## -30.403224 -23.352711  27.480479  26.427325  24.066908  82.179005 
##         13         14         15         16         17         18 
## -79.583049   7.162514  53.285312 110.270764 -43.767303 -51.976279 
##         19         20         21         22         23         24 
## -62.005568 -48.999765  13.031326  18.448216  76.034841 -17.816977 
##         25         26         27         28         29         30 
## -37.125977  16.777733  -6.531431 -28.830664  23.932075  34.731534 
##         31         32 
## -42.589029   8.049511
# similar plot with subj names added:
xy <- qqnorm(rand[[1]], main='s(Subject)', 
    col='steelblue', pch=16, bty='n') # <- some additional arguments that could be left out
qqline(rand[[1]])

text( xy$x, rand[[1]], labels=names( rand[[1]] ),
      pos=3, col="steelblue", xpd=TRUE) # <- some additional arguments that could be left out

# same for items:
xy <- qqnorm(rand[[2]], main='s(Item)', 
    col='steelblue', pch=16, bty='n') # <- some additional arguments that could be left out
qqline(rand[[2]])

text( xy$x, rand[[2]], labels=names( rand[[2]] ),
      pos=3, col="steelblue", xpd=TRUE) # <- some additional arguments that could be left out