(from Richard McElreath, Statistical Rethinking, 2nd edition)

Assumption

  • 20 people born each year
  • Uniform happyness at birth, never changes
  • At 18 years old, eligible to marry. Probability of mariage in each year proportional to happiness.
  • Married people remain married until death.
  • At age 65, move to south coast of Spain

Purpose of simulation: show that "controlling" for marriage status creates a correlation between age and happiness, even though we know there is no correlation.

Pseudocode:

for t in 1:N: for p in 1:20: u ~ uniform(0, 1) add person (yob=t, id=p, happyness=u, marriageYear=-1, active=1) to data end for (y, i, h, m, a) in data: if a == 1 and t-y >= 65: a <- 0 end if a == 1 and t-y >= 18 and m == -1: x ~ uniform(0, 1) if x < h m = t end end end end

In [1]:
options(repr.plot.width=5, repr.plot.height=5)
N = 1000
d <- c()
for (t in 1:N) {
    for (p in 1:20) {
        u <- runif(1)
        d <- rbind(d, c(t, p, u, -1, 1))
    }

    for (k in 1:nrow(d)) {
        if (d[k, 5] == 1 & t-d[k, 1] >= 65) {
            d[k, 5] <- 0
        }
        if (d[k, 5] ==1 & t-d[k, 1] >= 18 & d[k, 4] == -1) {
            x <- runif(1)
            if (x < d[k, 3]) {
                d[k, 4] <- t
            }
        }
    }
}
d <- data.frame(d)
colnames(d) = c('yearOfBirth', 'id', 'happyness', 'yearOfMarriage', 'active')
In [2]:
d
A data.frame: 20000 × 5
yearOfBirthidhappynessyearOfMarriageactive
<dbl><dbl><dbl><dbl><dbl>
1 10.62603998210
1 20.17610604290
1 30.63755080200
1 40.19059768190
1 50.62739305190
1 60.88193941190
1 70.78229004190
1 80.94091459190
1 90.39370558190
1100.56506059200
1110.42473782190
1120.91578508190
1130.02277007270
1140.24671927220
1150.26506390220
1160.78611475190
1170.54984230220
1180.25997598190
1190.28386308190
1200.50323644210
2 10.11223783270
2 20.53179581200
2 30.83149464200
2 40.30098566200
2 50.04900386-10
2 60.22119243280
2 70.10302163360
2 80.37679712200
2 90.29198968220
2100.07820483280
⋮⋮⋮⋮⋮
999110.49704276-11
999120.02731464-11
999130.59909935-11
999140.32947182-11
999150.93678503-11
999160.58662415-11
999170.30133592-11
999180.90417594-11
999190.47295345-11
999200.36885327-11
1000 10.10986075-11
1000 20.72820126-11
1000 30.30867509-11
1000 40.16785673-11
1000 50.46957143-11
1000 60.62324436-11
1000 70.11293581-11
1000 80.10722321-11
1000 90.33942030-11
1000100.23653725-11
1000110.16829029-11
1000120.57695470-11
1000130.80887565-11
1000140.80032018-11
1000150.40082283-11
1000160.37687157-11
1000170.30728460-11
1000180.17809839-11
1000190.97013598-11
1000200.37275293-11
In [3]:
d$ageOfMarriage <- d$yearOfMarriage - d$yearOfBirth
d$ageOfMarriage[d$yearOfMarriage == -1] <- -1

hist(d[d$yearOfMarriage > 0,]$ageOfMarriage, freq=F, nclass=30, 
     xlab='age of marriage', main='Distribution of age of marriage')

d contains all relevant information. Let us draw some random data points from this log.

In [5]:
drawSample <- function() {
    yob <- sample(N, 1) ### uniformly distributed random integer from 1..N
    p <- sample(20, 1) ### uniformly distributed random integer from 1..20
    s <- d[(d$yearOfBirth == yob) & (d$id == p),]
    age <- sample(0:65, 1) ### uniformly distributed random integer from 0..65
    married <- 0
    if ((yob+age > as.numeric(s$yearOfMarriage)) & (as.numeric(s$yearOfMarriage) > -1)) {
        married <- 1
    }
    return(c(age, s$happyness, c('single', 'married')[1+married]))
}
In [6]:
df <- data.frame(t(replicate(10000, drawSample())))
colnames(df) <- c('age', 'happyness', 'married')
df$age <- as.numeric(as.character(df$age))
df$happyness <- as.numeric(as.character(df$happyness))
In [7]:
plot(happyness~married, df)
In [51]:
plot(married ~ happyness, df)
In [52]:
plot(age~married, df)
In [53]:
plot(married~age, df)
In [61]:
plot(happyness~age, df, cex=.3)
In [67]:
summary(lm(happyness ~ age, df))
Call:
lm(formula = happyness ~ age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50269 -0.24986  0.00556  0.25416  0.49694 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.035e-01  5.691e-03  88.462   <2e-16 ***
age         -9.763e-06  1.512e-04  -0.065    0.949    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2881 on 9998 degrees of freedom
Multiple R-squared:  4.172e-07,	Adjusted R-squared:  -9.96e-05 
F-statistic: 0.004171 on 1 and 9998 DF,  p-value: 0.9485
In [81]:
summary(lm(happyness ~ age + married, df))
Call:
lm(formula = happyness ~ age + married, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5738 -0.2495 -0.0028  0.2475  0.6427 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.6532495  0.0100565   64.96   <2e-16 ***
age           -0.0029129  0.0002199  -13.25   <2e-16 ***
marriedsingle -0.1572672  0.0087688  -17.93   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2836 on 9997 degrees of freedom
Multiple R-squared:  0.03117,	Adjusted R-squared:  0.03098 
F-statistic: 160.8 on 2 and 9997 DF,  p-value: < 2.2e-16
In [83]:
library(ggplot2)

ggplot(df, aes(x=age, y=happyness, col=married)) + 
    geom_point(size=.1) +
    geom_smooth(method='lm') +
    geom_smooth(col='black', method='lm')
In [ ]: