For this example, we use a dataset from the package mlmRev.

library(mlmRev)
## Loading required package: lme4
## Loading required package: Matrix

The dataset is called Exam, and contains simulated data about examresults of children. However, since it is simulated data with an multilevel structure, we can rename the variables to something more related to RECAP. Which is what we will do, to show how such a multilevel structure works. In this case we assume that each cohort study collected the same variables (in the same units), there are no missings and the cohort studies started at the same time. Note, that this situation will practically never happen. However, to keep things simple in order to explain the multilevel analysis we will assume it holds.

In this example we want to explain birthweight by gestational age, gender and a cohort specific variable homebirth (the number of home births per 100 childbirths) to keep things simple. Moreover, for each child we know to which cohort study it belongs.

To get the data from the package mlmRev run the following code chunk

data(Exam) #get the data
child.data <- Exam[, c(1, 2, 4, 7, 8, 10)] #keep only a few variables
names(child.data) <- c("cohort", "birthweightnorm", "homebirth", "gestational.agenorm", "gender", "child") #rename the variables
head(child.data)
##   cohort birthweightnorm homebirth gestational.agenorm gender child
## 1      1           0.261     0.166               0.619      F   143
## 2      1           0.134     0.166               0.206      F   145
## 3      1          -1.724     0.166              -1.365      M   142
## 4      1           0.968     0.166               0.206      F   141
## 5      1           0.544     0.166               0.371      F   138
## 6      1           1.735     0.166               2.189      M   155

The simulated variables in this example where more or less standard normally distributed. This makes it easy to change the variables to gram (for birthweight) and weeks (for gestational age). We assume that the average birthweight is equal to 1325 gram with a standard deviation of 75. The average gestational age is 30 weeks with a standard deviation equal to 0.65. When running the following chunk of code the standard normally distributed variables are changed to variables in grams and weeks.

#change the standardized birthweight to birthweight in gram
child.data$birthweight <- (child.data$birthweightnorm)*75+1325

#Change the standardized gestational age to gestational age in weeks in two decimals
child.data$gestational.age <- (child.data$gestational.agenorm)*0.65+30
child.data$gestational.age <- round(child.data$gestational.age, digits=2)

#keep only the relevant variables
child.data <- child.data[, c(1, 3, 5, 6, 7, 8)]

#Give each child it own specific childnumber (instead of per cohort study)
child.data$childnr <- seq.int(nrow(child.data))

The example contains 65 schools (we renamed them to cohort studies). To make this example more relatable to RECAP we will combine some similar schools/cohort studies by running the following chunk of code. We end up with cohorts A till T (20 cohort studies).

child.data$cohort <- as.numeric(child.data$cohort)
child.data[child.data$cohort %in%  c('1', '20', '11', '52'), 1] <- 'A'
child.data[child.data$cohort %in% c('2', '3', '55'), 1] <- 'B'
child.data[child.data$cohort %in% c('4', '29', '33', '49'), 1] <- 'C'
child.data[child.data$cohort %in% c('5', '7', '21'), 1] <- 'D'
child.data[child.data$cohort %in% c('6', '53', '63'), 1] <- 'E'
child.data[child.data$cohort %in% c('8', '15', '47', '48'), 1] <- 'F'
child.data[child.data$cohort %in% c('9', '26', '44', '54'), 1] <- 'G'
child.data[child.data$cohort %in% c('10', '16', '31', '40'), 1] <- 'H'
child.data[child.data$cohort %in% c('12', '61', '56'), 1] <- 'I'
child.data[child.data$cohort %in% c('13', '17', '36', '45'), 1] <- 'J'
child.data[child.data$cohort %in% c('14', '24', '62'), 1] <- 'K'
child.data[child.data$cohort %in% c('18', '42', '57'), 1] <- 'L'
child.data[child.data$cohort %in% c('19', '43', '60'), 1] <- 'M'
child.data[child.data$cohort %in% c('22', '46'), 1] <- 'N'
child.data[child.data$cohort %in% c('23', '25', '37'), 1] <- 'O'
child.data[child.data$cohort %in% c('27', '32', '34'), 1] <- 'P'
child.data[child.data$cohort %in% c('28', '59'), 1] <- 'Q'
child.data[child.data$cohort %in% c('30', '58', '64'), 1] <- 'R'
child.data[child.data$cohort %in% c('35', '39', '41'), 1] <- 'S'
child.data[child.data$cohort %in% c('38', '50', '51', '65'), 1] <- 'T'

#Sort the data by cohort
child.data$cohort <- sort(child.data$cohort, decreasing=FALSE)

Since we combined different cohort studies, we should redefine the cohort specific variable homebirth which should have the same value for each child in the same cohort. For the new (combined) cohorts we will take the average value of the cohort specific variable of the cohorts that were combined.

#Make one cohort variable
child.data[child.data$cohort == 'A', 2] <- mean(child.data[child.data$cohort == 'A', 2])
child.data[child.data$cohort == 'B', 2] <- mean(child.data[child.data$cohort == 'B', 2])
child.data[child.data$cohort == 'C', 2] <- mean(child.data[child.data$cohort == 'C', 2])
child.data[child.data$cohort == 'D', 2] <- mean(child.data[child.data$cohort == 'D', 2])
child.data[child.data$cohort == 'E', 2] <- mean(child.data[child.data$cohort == 'E', 2])
child.data[child.data$cohort == 'F', 2] <- mean(child.data[child.data$cohort == 'F', 2])
child.data[child.data$cohort == 'G', 2] <- mean(child.data[child.data$cohort == 'G', 2])
child.data[child.data$cohort == 'H', 2] <- mean(child.data[child.data$cohort == 'H', 2])
child.data[child.data$cohort == 'I', 2] <- mean(child.data[child.data$cohort == 'I', 2])
child.data[child.data$cohort == 'J', 2] <- mean(child.data[child.data$cohort == 'J', 2])
child.data[child.data$cohort == 'K', 2] <- mean(child.data[child.data$cohort == 'K', 2])
child.data[child.data$cohort == 'L', 2] <- mean(child.data[child.data$cohort == 'L', 2])
child.data[child.data$cohort == 'M', 2] <- mean(child.data[child.data$cohort == 'M', 2])
child.data[child.data$cohort == 'N', 2] <- mean(child.data[child.data$cohort == 'N', 2])
child.data[child.data$cohort == 'O', 2] <- mean(child.data[child.data$cohort == 'O', 2])
child.data[child.data$cohort == 'P', 2] <- mean(child.data[child.data$cohort == 'P', 2])
child.data[child.data$cohort == 'Q', 2] <- mean(child.data[child.data$cohort == 'Q', 2])
child.data[child.data$cohort == 'R', 2] <- mean(child.data[child.data$cohort == 'R', 2])
child.data[child.data$cohort == 'S', 2] <- mean(child.data[child.data$cohort == 'S', 2])
child.data[child.data$cohort == 'T', 2] <- mean(child.data[child.data$cohort == 'T', 2])

child.data$homebirth <- ((child.data$homebirth+0.8)/2)*100
child.data$homebirth <- round(child.data$homebirth, digits=0)

Now, the data is ready to be used for multilevel modelling. Note that the multilevel structure is as follows: level 1 contains the childeren and level 2 contains the cohort studies.

We can plot the birthweight (the outcome we want to explain) for each cohort studie included in the study by running the following chunk of code:

plot(as.factor(child.data$cohort), child.data$birthweight,
xlab="cohort study", ylab="birthweight", main= "Boxplot of the birthweights")

From this plot, we can see that there is variation in birthweight between the different cohort studies: the median birthweight differs per cohort study. Moreover, the variability of birthweight within each cohort studies might also differ: the size of the white boxes (the first quantile - third quantile) differ per cohort study.

First, we will start with a linear regression model that does not take the multilevel structure into account.

#Normal linear regression model without cohort specific variable
LS.model <- lm(birthweight ~ gestational.age + gender, data = child.data)
summary(LS.model)
## 
## Call:
## lm(formula = birthweight ~ gestational.age + gender, data = child.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -192.25  -38.97    1.34   40.25  218.08 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -714.68      43.97   -16.3  < 2e-16 ***
## gestational.age    68.16       1.46    46.6  < 2e-16 ***
## genderM           -12.73       1.93    -6.6  4.5e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.1 on 4056 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.357 
## F-statistic: 1.13e+03 on 2 and 4056 DF,  p-value: <2e-16
#Normal linear regression model with cohort specific variable
LS.model1 <- lm(birthweight ~ gestational.age + gender + homebirth, data = child.data)
summary(LS.model1)
## 
## Call:
## lm(formula = birthweight ~ gestational.age + gender + homebirth, 
##     data = child.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -203.34  -38.09    1.23   41.01  208.87 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -663.0199    44.2425  -14.99  < 2e-16 ***
## gestational.age   65.5935     1.4952   43.87  < 2e-16 ***
## genderM          -12.2867     1.9167   -6.41  1.6e-10 ***
## homebirth          0.6234     0.0848    7.35  2.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.7 on 4055 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.365 
## F-statistic:  778 on 3 and 4055 DF,  p-value: <2e-16

In both models all variables are statistically significant. Besides ignoring the multilevel structure, one commonly used method is to add a dummy variable for each cohort study. This means including 20 dummy variables.

#Normal linear regression model without cohort specific variable
LS.model2 <- lm(birthweight ~ gestational.age + gender + factor(cohort), data = child.data)
summary(LS.model2)
## 
## Call:
## lm(formula = birthweight ~ gestational.age + gender + factor(cohort), 
##     data = child.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -227.70  -37.96    2.14   39.79  194.29 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -602.60      44.31  -13.60  < 2e-16 ***
## gestational.age    65.46       1.46   44.79  < 2e-16 ***
## genderM           -14.22       1.94   -7.32  3.1e-13 ***
## factor(cohort)B    -3.83       6.02   -0.64     0.52    
## factor(cohort)C   -29.03       4.93   -5.89  4.2e-09 ***
## factor(cohort)D   -43.35       5.64   -7.68  2.0e-14 ***
## factor(cohort)E   -40.13       5.78   -6.94  4.4e-12 ***
## factor(cohort)F   -52.20       5.22  -10.00  < 2e-16 ***
## factor(cohort)G   -29.06       6.15   -4.73  2.4e-06 ***
## factor(cohort)H   -42.17       5.28   -7.99  1.7e-15 ***
## factor(cohort)I   -33.49       6.17   -5.43  6.1e-08 ***
## factor(cohort)J   -34.76       5.08   -6.85  8.7e-12 ***
## factor(cohort)K   -32.10       5.08   -6.32  2.9e-10 ***
## factor(cohort)L   -31.37       5.35   -5.86  5.0e-09 ***
## factor(cohort)M   -45.05       5.66   -7.95  2.3e-15 ***
## factor(cohort)N   -33.05       5.85   -5.65  1.8e-08 ***
## factor(cohort)O   -44.21       6.51   -6.79  1.3e-11 ***
## factor(cohort)P    10.93       6.81    1.61     0.11    
## factor(cohort)Q     2.74       6.87    0.40     0.69    
## factor(cohort)R   -30.72       6.27   -4.90  1.0e-06 ***
## factor(cohort)S   -34.69       6.18   -5.62  2.1e-08 ***
## factor(cohort)T   -27.15       5.23   -5.19  2.2e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.3 on 4037 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.394 
## F-statistic:  127 on 21 and 4037 DF,  p-value: <2e-16
#Normal linear regression model with cohort specific variable
LS.model3 <- lm(birthweight ~ gestational.age + gender + homebirth + factor(cohort), data = child.data)
summary(LS.model3)
## 
## Call:
## lm(formula = birthweight ~ gestational.age + gender + homebirth + 
##     factor(cohort), data = child.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -227.70  -37.96    2.14   39.79  194.29 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -749.206     50.310  -14.89  < 2e-16 ***
## gestational.age   65.461      1.461   44.79  < 2e-16 ***
## genderM          -14.215      1.943   -7.32  3.1e-13 ***
## homebirth          2.715      0.523    5.19  2.2e-07 ***
## factor(cohort)B  -17.404      7.587   -2.29   0.0218 *  
## factor(cohort)C   -1.886      4.761   -0.40   0.6920    
## factor(cohort)D  -24.342      4.997   -4.87  1.1e-06 ***
## factor(cohort)E  -42.845      6.047   -7.09  1.6e-12 ***
## factor(cohort)F  -25.052      5.066   -4.94  7.9e-07 ***
## factor(cohort)G  -34.487      6.685   -5.16  2.6e-07 ***
## factor(cohort)H   14.838      9.351    1.59   0.1126    
## factor(cohort)I   91.393     21.961    4.16  3.2e-05 ***
## factor(cohort)J   41.260     12.601    3.27   0.0011 ** 
## factor(cohort)K   30.343     10.174    2.98   0.0029 ** 
## factor(cohort)L    6.635      6.453    1.03   0.3039    
## factor(cohort)M   14.676     10.019    1.46   0.1430    
## factor(cohort)N   10.389      7.609    1.37   0.1722    
## factor(cohort)O   18.232     10.959    1.66   0.0963 .  
## factor(cohort)P   13.640      6.611    2.06   0.0392 *  
## factor(cohort)Q    5.459      6.680    0.82   0.4138    
## factor(cohort)R   15.431      8.300    1.86   0.0631 .  
## factor(cohort)S   25.034     10.325    2.42   0.0154 *  
## factor(cohort)T       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.3 on 4037 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.394 
## F-statistic:  127 on 21 and 4037 DF,  p-value: <2e-16

When adding dummy variables for each cohort (without the cohort specific variable), we can see that almost all dummy variables are statistically significant. However, when the cohort specific variable is added to the model, we can see that a lot of dummy variables are not statistically significant anymore. This is because the cohort specific variable already explains some of the variability between the cohort studies. Moreover, for one of the dummy variables the estimate is non-available (NA). This is due because the variable is linearly related to another one.

When using the model structure where the cohort studies are represented by dummy variables, one assumes that the observations are still independent of each other. However, children from one cohort study might be more similar than to a randomly choosing other child from one of the other cohort studies. Or to put it more simple, a child from two parents is probably more similar to another child from the same parents (brother or sister) then a randomly chosen other child. Thus, adding dummy variables for each cohort study does not take this correlation into account. This could potentially lead to wrongly calculated standard errors (too low) leading to overstatement of the statistical significance.

One way to take this correlation structure into account is by means of a mixed effects model. We will in this workshop only look at the random intercepts model, since the workshop tries to explain why some methods might be needed for RECAP and not go into full detail of these methods. With the random intercepts model each cohort study has its own intercept consisting of a fixed part (which is similar for each cohort study) and a random part. This random part is different for each cohort study, however will on average be equal to zero. Running the following chunk of code will estimate a random intercepts model:

randint.model <- lmer(birthweight ~ gestational.age + gender + homebirth + (1|cohort), data=child.data)
summary(randint.model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: birthweight ~ gestational.age + gender + homebirth + (1 | cohort)
##    Data: child.data
## 
## REML criterion at convergence: 44566
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.851 -0.651  0.034  0.688  3.335 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  cohort   (Intercept)  210     14.5    
##  Residual             3403     58.3    
## Number of obs: 4059, groups:  cohort, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)     -660.056     44.685   -14.8
## gestational.age   65.472      1.461    44.8
## genderM          -14.081      1.938    -7.3
## homebirth          0.692      0.277     2.5
## 
## Correlation of Fixed Effects:
##             (Intr) gsttn. gendrM
## gestatinl.g -0.965              
## genderM     -0.066  0.046       
## homebirth   -0.184 -0.070  0.011

Note, that all the coefficient estimates are significant (|t-value| > 2). Moreover, the coefficient estimate for the cohort specific variable is now much smaller than in the model with dummy variables for each cohort study (140.5824). Thus, in the model with a random intercept the variation between the cohort studies is explained by the random intercepts instead of this variable.

To get the random parts of the intercept for each of the cohort studies we can run the following chunk of code:

ranef(randint.model)
## $cohort
##   (Intercept)
## A      18.438
## B      11.298
## C      -2.289
## D     -17.353
## E     -19.372
## F     -24.114
## G      -9.648
## H      -7.456
## I      16.265
## J       4.135
## K       3.331
## L      -1.847
## M      -9.361
## N      -2.059
## O      -7.588
## P      27.190
## Q      20.036
## R       0.651
## S       0.238
## T      -0.495

We can also calculate the ICC. Remember the rule of thumb: if the ICC > 5% it is advised to use a mixed effects model.

varcor <- as.data.frame(VarCorr(randint.model))
ICC <- varcor[1,4]/(varcor[1,4] + varcor[2,4])
ICC
## [1] 0.0581

Hence, for this study it was a good choice to use a mixed effects model.