Random effects and autocorrelation (optional)

Download the source file nonlinear_0.1.tar.gz and install the packages by the following commands in R:

download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/packages/nonlinear_0.1.tar.gz", "nonlinear_0.1.tar.gz")
install.packages("nonlinear_0.1.tar.gz", type="source", repos=NULL)

If the package is installed succesfully, you could remove the source file (delete it).

Run the demo about autocorrelation as follows:

library(nonlinear)
demoAR()

Setting up R session

Load packages:

library(mgcv)
library(itsadug)

Check R version and package versions:

R.version.string
## [1] "R version 3.2.1 (2015-06-18)"
packageVersion("mgcv")
## [1] '1.8.6'
packageVersion("itsadug")
## [1] '1.0.2'

If these versions do not agree with the ones included in the Lab Assignment file, please let us know.


Loading data

The data discussed in this lab session is collected during two consequent courses (2014, 2015) at the CCP Spring Training in Experimental Psycholinguistics (STEP), in Edmonton, Canada. The data is collected in a noisy environment with participants who had already seen the experiment. The experiment is simplified version of an eye-tracking experiment of Vince Poretta (in prep; 2015).

Download the data to your working directory and load the data in R (see more info about loading data below)

download.file("http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/Data/gazedata.rda", "countdat.rda")

load('countdat.rda')

Inspection of the data

Inspect the data with the follwoing two commands. Pay attention to the column InterestAreaLabel.

head(countdat)
str(countdat)

Visualizing the data

In this assignment we are interested in the looks to the Target (e.g., “shop”) in comparison with the looks to the Onset Competitor (e.g., “shut”) and RhymeCompetitor (e.g., “drop”), and with the looks to a distractor (e.g., “fill”).

First, we plot the data. The data plots provide a sanity check for the GAMMs: if the pattern of the smooths are completely off (different range of values, or a completely different pattern), you may want to see what causes these differences.

Calculating the grand averages of the proportions of looks to the different areas of interest per timebin.

# IMPORTANT: note the spaces in the column names:
tail(colnames(countdat), 15)
##  [1] "RhymecompReg"                    "OnsetcompFreq"                  
##  [3] "OnsetcompReg"                    "DistractorFreq"                 
##  [5] "DistractorReg"                   "Vowel"                          
##  [7] "Stimtype"                        "Datafile"                       
##  [9] "NativeEnlish"                    "StartTarget"                    
## [11] "InterestAreaLabel.OnsetComp_IA " "InterestAreaLabel.Target_IA "   
## [13] "InterestAreaLabel.RhymeComp_IA " "InterestAreaLabel.Distract_IA " 
## [15] "InterestAreaLabel.NA"
# removing spaces from column names, because these will give trouble:
colnames(countdat) <- gsub("\\ ","", colnames(countdat))
# calculating proportions:
countdat$TotalCount  <- with(countdat,  InterestAreaLabel.Target_IA  
    + InterestAreaLabel.OnsetComp_IA 
    + InterestAreaLabel.RhymeComp_IA   
    + InterestAreaLabel.Distract_IA 
    + InterestAreaLabel.NA )

countdat$propTarget <- with(countdat, InterestAreaLabel.Target_IA  / TotalCount )
countdat$propOnset <- with(countdat, InterestAreaLabel.OnsetComp_IA  / TotalCount )
countdat$propRhyme <- with(countdat, InterestAreaLabel.RhymeComp_IA  / TotalCount )
countdat$propDistr <- with(countdat, InterestAreaLabel.Distract_IA  / TotalCount )

Plot the grand averages per timebin. Use your preferred plotting functions, or use the code below.

# Setup empty plot window:
emptyPlot(range(countdat$Timebin), c(0,1), 
    h=.5,
    main='Looks to interest areas', xlab='Time (ms)', ylab="Proportion")

# Distractor
avg <- aggregate(propDistr ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propDistr, col="black", lty=2)
# Onset competitor
avg <- aggregate(propOnset ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propOnset, col="indianred", lwd=2)
# Rhyme competitor
avg <- aggregate(propRhyme ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propRhyme, col="steelblue", lwd=2)
# Target
avg <- aggregate(propTarget ~ Timebin, data=countdat, mean, na.rm=TRUE)
lines(avg$Timebin, avg$propTarget, col="green", lwd=2)

As most people seem to have found the target before 2000 ms after word onset, we reduce the data by excluding all data points recorded after 2000 ms:

countdat <- countdat[countdat$Timebin < 2000,]

Analyzing looks to target versus looks to onset competitor

Create a new column that contains a matrix with looks to the target (success) and looks to the onset competitor (failures).

countdat$TargetLooks <- cbind(countdat$InterestAreaLabel.Target_IA, countdat$InterestAreaLabel.OnsetComp_IA)
head(countdat)
str(countdat)

cbind() data

If you are working in R Studio, the command View(countdat) will show something different than head(countdat) for the column TargetLooks.

The cause for these differences is the use of the function cbind: cbind combines vectors into different columns of a matrix. Try for example:

test1 <- 1:10
test2 <- 2:11

mat <- cbind(test1,test2)
class(mat)
## [1] "matrix"

We have stored a matrix with two columns in the data frame countdat under the name TargetLooks. So each cell of the column countdat$TargetLooks has two values:

head(countdat$TargetLooks)
##      [,1] [,2]
## [1,]    0    0
## [2,]   14    0
## [3,]   20    0
## [4,]   20    0
## [5,]   20    0
## [6,]   20    0

The first column indicates the counts of successes (i.e., looks to the target), and the second column indicates the counts of failures (i.e., looks to the distractor).

Regression functions (e.g., lm, lmer, bam) will recognize this structure as binomial data: counts of successes and failures.

As there are more than 10.000 rows in the data, we will use the function bam():

nrow(countdat)
## [1] 20958

a. Trend over time

Run a GAM model that includes the following nonlinear effects:

  • a main effect of Timebin

  • a main effect of TrialIndex

  • a random smooth for subjects over Timebin (include k=5 to reduce the running time); the subjects are indicated by the column RecordingSessionLabel, but we change this first into Subject (saves typing).

Because logistic models take more time to run, you can leave out other random effects (it will take too long to complete this lab session).

# order the data with function start_event:
countdat <- start_event(countdat, column="Timebin", event="Event")

# make RecordingSessionLabel as factor, and store it ito a column with shorter name:
countdat$Subject <- as.factor(countdat$RecordingSessionLabel)

m1 <- bam(TargetLooks ~  s(Timebin) + s(TrialIndex)  
          + s(Timebin, Subject, bs='fs', m=1, k=5), 
          data=countdat, family="binomial")
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
## repeated 1-d smooths of same variable.

Inspect the summary of the model.

summary(m1)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## TargetLooks ~ s(Timebin) + s(TrialIndex) + s(Timebin, Subject, 
##     bs = "fs", m = 1, k = 5)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.4412     0.5366   6.413 1.43e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf  Ref.df Chi.sq p-value    
## s(Timebin)           7.455   7.981  217.7  <2e-16 ***
## s(TrialIndex)        6.201   7.355  318.2  <2e-16 ***
## s(Timebin,Subject) 112.074 124.000 6470.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.421   Deviance explained = 22.3%
## fREML = 1.1538e+05  Scale est. = 1         n = 20958
  • What does the intercept represent in this model? What does the intercept tell: do people look on average more to the target or to the onset competitor?

The intercept is the log odds ratio between looks to the target and looks to the onset competitor. As the intercept is far higher than 0, the intercept is higher than a proportion of 0.5, rather close to a proportion of 1. This means overall more looks to target than to onset competitor.

  • What does the smooth over Timebin represent?

The partial effect of how the log odds for looks to the target versus looks to the onset competitor change over time.

More info on logits (binomial count data)

The regression model uses the counts of target and onset competitor looks to calculate a logit, which is the log odds ratio between looks to the target and looks to the onset competitor: \(log(\frac{N_{target.looks}}{N_{onsetcomp.looks}})\) (or following the logit definition: \(log(\frac{N_{target.looks}}{N - N_{target.looks}})\) with\(N=N_{target.looks}+N_{onsetcomp.looks}\)).

  • If \(N_{target.looks}\) equals \(N_{onsetcomp.looks}\), the ratio is 1, and the logit is 0 (log(1)=0).

  • If \(N_{target.looks}\) is a larger count than the \(N_{onsetcomp.looks}\), the ratio is higher than 1, and the logit is a positive value, e.g. (log(2)=0.6931472).

  • If \(N_{target.looks}\) is a smaller count than the \(N_{onsetcomp.looks}\), the ratio is lower than 1, and the logit is a negative value, e.g. (log(.5)=-0.6931472).

Illustration of the relation between logits and proportions:

Code for plotting the logits:

total <- 500
successes <- 1:total

proportions <- successes / total
logits <- log( proportions / (1-proportions))

plot(proportions, logits)

Back transformation of logits to proportions:

plot(logits, plogis(logits))

Visualizing the trend over time:

par(mfrow=c(1,2))
plot(m1, select=1, scale=0, shade=TRUE)
abline(h=0)

plot_smooth(m1, view="Timebin", rm.ranef=TRUE)

Interpretation of the plots. - Why is the shape of the smooth different from the target looks as we saw in the data plots?

In the models we compare the ratio of target looks and onset competitor looks. The earlier plot visualized the proportions of looks to target, onset competitor, rhyme competitor and distractor looks.

  • What do these plot tell about the looks to the target versus the looks to the onset competitor?
  • Left plot: partial effect. Over time the log odds increase, so over time the looks to the target increase and the looks to the onset competitor decreases.
  • Right plot: summed effects, fitted values, intercept is included. Here we see that the log odds ratio is estimated around zero at the start of the trial. This means equal amount of looks to target and distractor. Gradually the looks to the target increase. Around 700 ms after sound onset, the participants look significantly more to target than to the distractor, because the smooth is significantly different from zero. Below is an illustration how we could use the plot to determine at which timebin people can distinguish between target and onset competitor:
out <- plot_smooth(m1, view="Timebin", rm.ranef=TRUE)
(diff <- find_difference(out$fv$fit, out$fv$CI, xVals=out$fv$Timebin))
## $start
## [1] 708.6207
## 
## $end
## [1] 1950
## 
## $xVals
## [1] TRUE
addInterval(0, lowVals=diff$start, highVals = diff$end, col='red', lwd=2)
abline(v=c(diff$start, diff$end), lty=3, col='red')
text(mean(c(diff$start, diff$end)), 1, "sign. more looks to target", col='red', font=3)

For interpretation of the results, we could tranform the summed effects back to proportions with the function plot_smooth:

par(mfrow=c(1,1))
plot_smooth(m1, view="Timebin", rm.ranef=TRUE, 
            transform=plogis, ylim=c(0,1))
abline(h=.5, v=700, col='red', lty=2)
  • When do people start to look significantly more to the target than to the onset competitor?

As we already derived from the logit plot, we see that around 700 ms after sound onset the proportion of taregt looks increases .5 . significantly.


b. Nonlinear interactions

Run a new GAMM model that includes a nonlinear interaction between Timebin and TrialIndex, and a random smooth for subjects (Subject) over Timebin (include k=5 for the random effect to reduce the running time).

m2 <- bam(TargetLooks ~  te(Timebin, TrialIndex) 
          + s(Timebin, Subject, bs='fs', m=1, k=5),
          data=countdat, family="binomial")

Inspect the summary of model m2.

summary(m2)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## TargetLooks ~ te(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs", 
##     m = 1, k = 5)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.4212     0.5741   5.959 2.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df Chi.sq p-value    
## te(Timebin,TrialIndex)  21.82  22.62   1594  <2e-16 ***
## s(Timebin,Subject)     113.72 124.00   6511  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.415   Deviance explained = 21.1%
## fREML = 1.2794e+05  Scale est. = 1         n = 20958
  • What does the intercept represent in this model?

The log odds ratio between looks to the target and looks to the onset competitor, and a constant value that needs to be added to the interaction surface and the random effects.

  • What does it mean that the smooth term te(Timebin, TrialIndex) is significant in the summary?

The interaction surface is at some point significantly different from a flat plane through zero. However, this does not mean that the interaction itself is significant: it could be that the two main effects included in the interaction cause the surface to be significantly different from a flat zero surface.

Compare this model with model m1 with the function compareML.

compareML(m1,m2)
## m1: TargetLooks ~ s(Timebin) + s(TrialIndex) + s(Timebin, Subject, 
##     bs = "fs", m = 1, k = 5)
## 
## m2: TargetLooks ~ te(Timebin, TrialIndex) + s(Timebin, Subject, bs = "fs", 
##     m = 1, k = 5)
## 
## Model m1 preferred: lower fREML score (12561.705), and lower df (1.000).
## -----
##   Model    Score Edf Difference     Df
## 1    m2 127943.4   8                  
## 2    m1 115381.6   7  12561.705 -1.000
## 
## AIC difference: -345.21, model m1 has lower AIC.
  • Could you conclude that the interaction between Timebin and TrialIndex is significant?

Actually the model without interaction is significant, because it’s fREML score is much lower, and its also has less degrees of freedom to model the data. The model that includes only the two main effects is preferred, so the nonlinear trend over time is changing nonlinearly for different trials, but only with a constant value.

Visualize the interaction:

par(mfrow=c(1,2), cex=1.1)

# plot the partial effect:
plot(m2, select=1, rug=FALSE, main="Interaction", 
     cex.axis=2, cex.lab=2) # parameters for increasing font size
# Now with colors:
pvisgam(m2, select=1, view=c("Timebin", "TrialIndex"), zlim=c(-5,14), main="")
## [1] "Tensor(s) to be plotted: te(Timebin,TrialIndex)"