This is an R Markdown document. When you click the Knit button an HTML document will be generated that includes both content as well as the output of any embedded R code chunks within the document. Moreover, clicking on the green triangles in the right upper corner of code chunks will run small parts of the code. This will be most convenient when we go through all the practicals step by step. Moreover, it is possible to following everything we do by means of the HTML document.
First, we need to install packages that we need during the workshop.
install.packages(c("mice", "lme4", "dplyr", "plyr", "mlmRev"))
In practical I, we are using only the plyr package.
library(plyr)
Lets assume we have two datsets, which we want to combine. This can be done in two ways: join and add. When we want to join two datasets they need to have some similar subjects (the variables may differ). When we want to add two datasets they need to contain similar variables, but may contain different subjects.
First, we will generate two datasets A and B, which have some similar subjects and different variables:
#the randomly generated numbers will be same each time we use this set.seed
set.seed(40917)
df <- data.frame(subject = seq(1, 15, 1),
mean = seq(10, 24, 1),
sd = seq(2, 2.14, 0.01))
datasetA <- cbind(seq(1, 15,1),
data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetA[, 2:4] <- round(datasetA[, 2:4], 2)
names(datasetA) <- c("subjectID", "X1", "X2", "X3")
df <- data.frame(subject = seq(1, 15, 1),
mean = seq(110, 124, 1),
sd = seq(2, 2.14, 0.01))
datasetB <- cbind(seq(8, 22,1),
data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetB[, 2:4] <- round(datasetB[, 2:4], 2)
names(datasetB) <- c("subjectID", "X4", "X5", "X6")
This leads to the following summary statistics, where datasetA contains subjects 1 to 15, and datasetB contains subjects 8 to 22:
datasetA #subjectIDs from 1 to 15
## subjectID X1 X2 X3
## 1 1 7.93 8.53 8.76
## 2 2 6.92 11.91 11.28
## 3 3 8.80 13.02 13.50
## 4 4 12.60 11.15 12.29
## 5 5 13.23 13.08 9.33
## 6 6 14.39 16.65 14.55
## 7 7 12.24 14.87 13.65
## 8 8 10.79 18.58 15.09
## 9 9 18.26 15.21 16.57
## 10 10 16.66 17.40 19.75
## 11 11 20.67 19.79 20.33
## 12 12 20.05 21.01 21.83
## 13 13 22.31 19.55 20.96
## 14 14 22.94 24.52 28.03
## 15 15 26.90 24.45 22.41
datasetB #subjectIDs from 8 tot 22.
## subjectID X4 X5 X6
## 1 8 111.15 108.22 112.64
## 2 9 111.85 106.89 110.61
## 3 10 112.05 112.41 109.47
## 4 11 110.50 116.91 107.40
## 5 12 116.88 115.25 113.11
## 6 13 115.87 113.88 114.52
## 7 14 117.34 116.56 116.28
## 8 15 121.13 114.77 118.78
## 9 16 117.66 118.53 119.89
## 10 17 119.91 119.56 120.85
## 11 18 119.10 121.02 120.11
## 12 19 123.36 121.97 120.31
## 13 20 121.59 121.11 118.18
## 14 21 123.76 122.39 126.91
## 15 22 125.10 125.48 123.59
With Inner join only keep the subjects that exists in both datasets:
AB.innerjoin <- join(datasetA, datasetB, by = "subjectID", type = "inner")
AB.innerjoin #keep subjects 8 to 15
## subjectID X1 X2 X3 X4 X5 X6
## 1 8 10.79 18.58 15.09 111.15 108.22 112.64
## 2 9 18.26 15.21 16.57 111.85 106.89 110.61
## 3 10 16.66 17.40 19.75 112.05 112.41 109.47
## 4 11 20.67 19.79 20.33 110.50 116.91 107.40
## 5 12 20.05 21.01 21.83 116.88 115.25 113.11
## 6 13 22.31 19.55 20.96 115.87 113.88 114.52
## 7 14 22.94 24.52 28.03 117.34 116.56 116.28
## 8 15 26.90 24.45 22.41 121.13 114.77 118.78
Note, we have 8 observations (for subjectID 8 untill 15) and that for each subject we have an observation for each variabele.
With Full outer join keep all subjects:
AB.fullouterjoin <- join(datasetA, datasetB, by = "subjectID", type = "full")
AB.fullouterjoin #keep all subjects
## subjectID X1 X2 X3 X4 X5 X6
## 1 1 7.93 8.53 8.76 NA NA NA
## 2 2 6.92 11.91 11.28 NA NA NA
## 3 3 8.80 13.02 13.50 NA NA NA
## 4 4 12.60 11.15 12.29 NA NA NA
## 5 5 13.23 13.08 9.33 NA NA NA
## 6 6 14.39 16.65 14.55 NA NA NA
## 7 7 12.24 14.87 13.65 NA NA NA
## 8 8 10.79 18.58 15.09 111.15 108.22 112.64
## 9 9 18.26 15.21 16.57 111.85 106.89 110.61
## 10 10 16.66 17.40 19.75 112.05 112.41 109.47
## 11 11 20.67 19.79 20.33 110.50 116.91 107.40
## 12 12 20.05 21.01 21.83 116.88 115.25 113.11
## 13 13 22.31 19.55 20.96 115.87 113.88 114.52
## 14 14 22.94 24.52 28.03 117.34 116.56 116.28
## 15 15 26.90 24.45 22.41 121.13 114.77 118.78
## 16 16 NA NA NA 117.66 118.53 119.89
## 17 17 NA NA NA 119.91 119.56 120.85
## 18 18 NA NA NA 119.10 121.02 120.11
## 19 19 NA NA NA 123.36 121.97 120.31
## 20 20 NA NA NA 121.59 121.11 118.18
## 21 21 NA NA NA 123.76 122.39 126.91
## 22 22 NA NA NA 125.10 125.48 123.59
Note, we have 22 observations and there are some non-availables (NA’s) for each variabele. We have NA’s for X1 till X3 for subjectID 16 till 22 and NA’s for X4 till X6 for subjectID 1 till 7.
With Master join (left outer join) keep all subjects of one dataset and only the matching rows of the other:
AB.leftjoin <- join(datasetA, datasetB, by = "subjectID", type = "left")
AB.leftjoin #keep all subjects from datasets A and match rows from B
## subjectID X1 X2 X3 X4 X5 X6
## 1 1 7.93 8.53 8.76 NA NA NA
## 2 2 6.92 11.91 11.28 NA NA NA
## 3 3 8.80 13.02 13.50 NA NA NA
## 4 4 12.60 11.15 12.29 NA NA NA
## 5 5 13.23 13.08 9.33 NA NA NA
## 6 6 14.39 16.65 14.55 NA NA NA
## 7 7 12.24 14.87 13.65 NA NA NA
## 8 8 10.79 18.58 15.09 111.15 108.22 112.64
## 9 9 18.26 15.21 16.57 111.85 106.89 110.61
## 10 10 16.66 17.40 19.75 112.05 112.41 109.47
## 11 11 20.67 19.79 20.33 110.50 116.91 107.40
## 12 12 20.05 21.01 21.83 116.88 115.25 113.11
## 13 13 22.31 19.55 20.96 115.87 113.88 114.52
## 14 14 22.94 24.52 28.03 117.34 116.56 116.28
## 15 15 26.90 24.45 22.41 121.13 114.77 118.78
Note, that we have 15 observations and NA’s for X4 till X6 for subjectIDs 1 till 7.
With Detail join (right outer join) keep all subjects of one dataset and only the matching rows of the other:
AB.rightjoin <- join(datasetA, datasetB, by = "subjectID", type = "right")
AB.rightjoin #keep all subjects from datasets A and match rows from B
## subjectID X1 X2 X3 X4 X5 X6
## 1 8 10.79 18.58 15.09 111.15 108.22 112.64
## 2 9 18.26 15.21 16.57 111.85 106.89 110.61
## 3 10 16.66 17.40 19.75 112.05 112.41 109.47
## 4 11 20.67 19.79 20.33 110.50 116.91 107.40
## 5 12 20.05 21.01 21.83 116.88 115.25 113.11
## 6 13 22.31 19.55 20.96 115.87 113.88 114.52
## 7 14 22.94 24.52 28.03 117.34 116.56 116.28
## 8 15 26.90 24.45 22.41 121.13 114.77 118.78
## 9 16 NA NA NA 117.66 118.53 119.89
## 10 17 NA NA NA 119.91 119.56 120.85
## 11 18 NA NA NA 119.10 121.02 120.11
## 12 19 NA NA NA 123.36 121.97 120.31
## 13 20 NA NA NA 121.59 121.11 118.18
## 14 21 NA NA NA 123.76 122.39 126.91
## 15 22 NA NA NA 125.10 125.48 123.59
Note, that we have 15 observations and NA’s for X1 till X3 for subjectIDs 16 till 22.
Besides joining datasets, we can also add datasets. In this case we measured the same variables (not all have to be the same) on different subjects. First we will simulate two datasets C and D, with some variables similar and different subjects.
df <- data.frame(subject = seq(1, 15, 1),
mean = seq(10, 24, 1),
sd = seq(2, 2.14, 0.01))
datasetC <- cbind(seq(1, 15,1),
data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetC[, 2:4] <- round(datasetC[, 2:4], 2)
names(datasetC) <- c("subjectID", "X1", "X2", "X3")
df <- data.frame(subject = seq(1, 15, 1),
mean = seq(10, 24, 1),
sd = seq(2, 2.14, 0.01))
datasetD <- cbind(seq(16, 30,1),
data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetD[, 2:4] <- round(datasetD[, 2:4], 2)
names(datasetD) <- c("subjectID", "X4", "X2", "X3")
This leads to the following summary statistics, where datasetA contains subjectsIDs from 1 to 15 with variables X1, X2, and X3 and datasetB contains subjectIDs from 16 tot 30 with variables X4, X2, and X3:
datasetC
## subjectID X1 X2 X3
## 1 1 12.77 11.80 10.72
## 2 2 12.34 8.49 10.17
## 3 3 15.12 11.81 12.06
## 4 4 11.76 15.55 16.45
## 5 5 14.84 9.63 14.08
## 6 6 15.55 12.72 17.96
## 7 7 16.24 18.50 18.45
## 8 8 16.46 18.64 20.14
## 9 9 16.20 19.10 18.49
## 10 10 16.20 16.59 17.47
## 11 11 20.18 19.48 18.51
## 12 12 18.52 24.05 20.25
## 13 13 20.36 21.62 21.29
## 14 14 23.79 23.56 22.06
## 15 15 25.21 22.06 22.43
datasetD
## subjectID X4 X2 X3
## 1 16 9.76 10.29 10.18
## 2 17 10.17 11.91 10.36
## 3 18 9.08 13.56 14.37
## 4 19 13.75 11.21 11.29
## 5 20 13.90 11.99 14.75
## 6 21 16.34 18.64 14.43
## 7 22 18.44 14.91 16.33
## 8 23 15.44 18.09 14.31
## 9 24 14.35 18.63 19.76
## 10 25 22.58 22.20 19.36
## 11 26 17.62 17.02 18.80
## 12 27 22.59 19.09 26.22
## 13 28 23.22 23.63 17.00
## 14 29 26.57 24.15 19.24
## 15 30 23.21 20.21 24.71
When adding two dataframes that do not have all the same variables there are two options: 1. Drop the variables that are not similar 2. Keep the variables that are not similar and put them equal to NA for the other dataset.
This look as follows when we drop the variables that are not similar (in this case X1 in dataset C and X4 in dataset D):
datasetC.dropX1 <- subset(datasetC, select = c("subjectID", "X2", "X3"))
datasetD.dropX4 <- subset(datasetD, select = c("subjectID", "X2", "X3"))
datasetC.dropX1
## subjectID X2 X3
## 1 1 11.80 10.72
## 2 2 8.49 10.17
## 3 3 11.81 12.06
## 4 4 15.55 16.45
## 5 5 9.63 14.08
## 6 6 12.72 17.96
## 7 7 18.50 18.45
## 8 8 18.64 20.14
## 9 9 19.10 18.49
## 10 10 16.59 17.47
## 11 11 19.48 18.51
## 12 12 24.05 20.25
## 13 13 21.62 21.29
## 14 14 23.56 22.06
## 15 15 22.06 22.43
datasetD.dropX4
## subjectID X2 X3
## 1 16 10.29 10.18
## 2 17 11.91 10.36
## 3 18 13.56 14.37
## 4 19 11.21 11.29
## 5 20 11.99 14.75
## 6 21 18.64 14.43
## 7 22 14.91 16.33
## 8 23 18.09 14.31
## 9 24 18.63 19.76
## 10 25 22.20 19.36
## 11 26 17.02 18.80
## 12 27 19.09 26.22
## 13 28 23.63 17.00
## 14 29 24.15 19.24
## 15 30 20.21 24.71
Now that we dropped variables X1 and X4 we are left with two datasets that contain the same variables. Hence, we can add them.
add.CD.drop <- rbind(datasetC.dropX1, datasetD.dropX4)
add.CD.drop #subjectID are from 1 to 30.
## subjectID X2 X3
## 1 1 11.80 10.72
## 2 2 8.49 10.17
## 3 3 11.81 12.06
## 4 4 15.55 16.45
## 5 5 9.63 14.08
## 6 6 12.72 17.96
## 7 7 18.50 18.45
## 8 8 18.64 20.14
## 9 9 19.10 18.49
## 10 10 16.59 17.47
## 11 11 19.48 18.51
## 12 12 24.05 20.25
## 13 13 21.62 21.29
## 14 14 23.56 22.06
## 15 15 22.06 22.43
## 16 16 10.29 10.18
## 17 17 11.91 10.36
## 18 18 13.56 14.37
## 19 19 11.21 11.29
## 20 20 11.99 14.75
## 21 21 18.64 14.43
## 22 22 14.91 16.33
## 23 23 18.09 14.31
## 24 24 18.63 19.76
## 25 25 22.20 19.36
## 26 26 17.02 18.80
## 27 27 19.09 26.22
## 28 28 23.63 17.00
## 29 29 24.15 19.24
## 30 30 20.21 24.71
However, normally we want to avoid dropping variables since they contain information. Hence, another way to add two datasets is to keep the variables that are not similar and make them NA for the other dataset:
datasetC.addX4 <- cbind(datasetC, rep("NA", 15))
names(datasetC.addX4) <- c("subjectID", "X1", "X2", "X3", "X4")
datasetD.addX1 <- as.data.frame(cbind(datasetD$subjectID, rep("NA", 15),
datasetD$X2, datasetD$X3, datasetD$X4))
names(datasetD.addX1) <- c("subjectID", "X1", "X2", "X3", "X4")
Now the datasets look as follows:
datasetC.addX4
## subjectID X1 X2 X3 X4
## 1 1 12.77 11.80 10.72 NA
## 2 2 12.34 8.49 10.17 NA
## 3 3 15.12 11.81 12.06 NA
## 4 4 11.76 15.55 16.45 NA
## 5 5 14.84 9.63 14.08 NA
## 6 6 15.55 12.72 17.96 NA
## 7 7 16.24 18.50 18.45 NA
## 8 8 16.46 18.64 20.14 NA
## 9 9 16.20 19.10 18.49 NA
## 10 10 16.20 16.59 17.47 NA
## 11 11 20.18 19.48 18.51 NA
## 12 12 18.52 24.05 20.25 NA
## 13 13 20.36 21.62 21.29 NA
## 14 14 23.79 23.56 22.06 NA
## 15 15 25.21 22.06 22.43 NA
datasetD.addX1
## subjectID X1 X2 X3 X4
## 1 16 NA 10.29 10.18 9.76
## 2 17 NA 11.91 10.36 10.17
## 3 18 NA 13.56 14.37 9.08
## 4 19 NA 11.21 11.29 13.75
## 5 20 NA 11.99 14.75 13.9
## 6 21 NA 18.64 14.43 16.34
## 7 22 NA 14.91 16.33 18.44
## 8 23 NA 18.09 14.31 15.44
## 9 24 NA 18.63 19.76 14.35
## 10 25 NA 22.2 19.36 22.58
## 11 26 NA 17.02 18.8 17.62
## 12 27 NA 19.09 26.22 22.59
## 13 28 NA 23.63 17 23.22
## 14 29 NA 24.15 19.24 26.57
## 15 30 NA 20.21 24.71 23.21
Now, we can add the two datasets:
add.CD.keep <- rbind(datasetC.addX4, datasetD.addX1)
add.CD.keep
## subjectID X1 X2 X3 X4
## 1 1 12.77 11.8 10.72 NA
## 2 2 12.34 8.49 10.17 NA
## 3 3 15.12 11.81 12.06 NA
## 4 4 11.76 15.55 16.45 NA
## 5 5 14.84 9.63 14.08 NA
## 6 6 15.55 12.72 17.96 NA
## 7 7 16.24 18.5 18.45 NA
## 8 8 16.46 18.64 20.14 NA
## 9 9 16.2 19.1 18.49 NA
## 10 10 16.2 16.59 17.47 NA
## 11 11 20.18 19.48 18.51 NA
## 12 12 18.52 24.05 20.25 NA
## 13 13 20.36 21.62 21.29 NA
## 14 14 23.79 23.56 22.06 NA
## 15 15 25.21 22.06 22.43 NA
## 16 16 NA 10.29 10.18 9.76
## 17 17 NA 11.91 10.36 10.17
## 18 18 NA 13.56 14.37 9.08
## 19 19 NA 11.21 11.29 13.75
## 20 20 NA 11.99 14.75 13.9
## 21 21 NA 18.64 14.43 16.34
## 22 22 NA 14.91 16.33 18.44
## 23 23 NA 18.09 14.31 15.44
## 24 24 NA 18.63 19.76 14.35
## 25 25 NA 22.2 19.36 22.58
## 26 26 NA 17.02 18.8 17.62
## 27 27 NA 19.09 26.22 22.59
## 28 28 NA 23.63 17 23.22
## 29 29 NA 24.15 19.24 26.57
## 30 30 NA 20.21 24.71 23.21
So to conclude, we can join and add datasets. If we have observations from similar subjects on different variabiles we can join the datasets in four ways:
When two datasets measures some similar variables on different subjects we can add theses datasets. To do this we have to decide on how to handle variables that were not included in both datasets: