This is an R Markdown document. When you click the Knit button an HTML document will be generated that includes both content as well as the output of any embedded R code chunks within the document. Moreover, clicking on the green triangles in the right upper corner of code chunks will run small parts of the code. This will be most convenient when we go through all the practicals step by step. Moreover, it is possible to following everything we do by means of the HTML document.

Combining Datasets & Missing Data

Lets assume we have two datsets, which we want to combine. This can be done in two ways: join and add. When we want to join two datasets they need to have some similar subjects (the variables may differ). When we want to add two datasets they need to contain similar variables, but may contain different subjects.

Join two datasets

First, we will generate two datasets A and B, which have some similar subjects and different variables:

#the randomly generated numbers will be same each time we use this set.seed
set.seed(40917) 
df <- data.frame(subject = seq(1, 15, 1), 
                 mean = seq(10, 24, 1), 
                 sd = seq(2, 2.14, 0.01))

datasetA <- cbind(seq(1, 15,1), 
                  data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetA[, 2:4] <- round(datasetA[, 2:4], 2)
names(datasetA) <- c("subjectID", "X1", "X2", "X3")

df <- data.frame(subject = seq(1, 15, 1), 
                 mean = seq(110, 124, 1), 
                 sd = seq(2, 2.14, 0.01))
datasetB <- cbind(seq(8, 22,1), 
                  data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetB[, 2:4] <- round(datasetB[, 2:4], 2)
names(datasetB) <- c("subjectID", "X4", "X5", "X6")

This leads to the following summary statistics, where datasetA contains subjects 1 to 15, and datasetB contains subjects 8 to 22:

datasetA #subjectIDs from 1 to 15

##    subjectID    X1    X2    X3
## 1          1  7.93  8.53  8.76
## 2          2  6.92 11.91 11.28
## 3          3  8.80 13.02 13.50
## 4          4 12.60 11.15 12.29
## 5          5 13.23 13.08  9.33
## 6          6 14.39 16.65 14.55
## 7          7 12.24 14.87 13.65
## 8          8 10.79 18.58 15.09
## 9          9 18.26 15.21 16.57
## 10        10 16.66 17.40 19.75
## 11        11 20.67 19.79 20.33
## 12        12 20.05 21.01 21.83
## 13        13 22.31 19.55 20.96
## 14        14 22.94 24.52 28.03
## 15        15 26.90 24.45 22.41

datasetB #subjectIDs from 8 tot 22.

##    subjectID     X4     X5     X6
## 1          8 111.15 108.22 112.64
## 2          9 111.85 106.89 110.61
## 3         10 112.05 112.41 109.47
## 4         11 110.50 116.91 107.40
## 5         12 116.88 115.25 113.11
## 6         13 115.87 113.88 114.52
## 7         14 117.34 116.56 116.28
## 8         15 121.13 114.77 118.78
## 9         16 117.66 118.53 119.89
## 10        17 119.91 119.56 120.85
## 11        18 119.10 121.02 120.11
## 12        19 123.36 121.97 120.31
## 13        20 121.59 121.11 118.18
## 14        21 123.76 122.39 126.91
## 15        22 125.10 125.48 123.59

Inner Join

With Inner join only keep the subjects that exists in both datasets:

AB.innerjoin <- join(datasetA, datasetB, by = "subjectID", type = "inner")
AB.innerjoin #keep subjects 8 to 15

##   subjectID    X1    X2    X3     X4     X5     X6
## 1         8 10.79 18.58 15.09 111.15 108.22 112.64
## 2         9 18.26 15.21 16.57 111.85 106.89 110.61
## 3        10 16.66 17.40 19.75 112.05 112.41 109.47
## 4        11 20.67 19.79 20.33 110.50 116.91 107.40
## 5        12 20.05 21.01 21.83 116.88 115.25 113.11
## 6        13 22.31 19.55 20.96 115.87 113.88 114.52
## 7        14 22.94 24.52 28.03 117.34 116.56 116.28
## 8        15 26.90 24.45 22.41 121.13 114.77 118.78

Note, we have 8 observations (for subjectID 8 untill 15) and that for each subject we have an observation for each variabele.

Full Outer Join

With Full outer join keep all subjects:

AB.fullouterjoin <- join(datasetA, datasetB, by = "subjectID", type = "full")
AB.fullouterjoin #keep all subjects

##    subjectID    X1    X2    X3     X4     X5     X6
## 1          1  7.93  8.53  8.76     NA     NA     NA
## 2          2  6.92 11.91 11.28     NA     NA     NA
## 3          3  8.80 13.02 13.50     NA     NA     NA
## 4          4 12.60 11.15 12.29     NA     NA     NA
## 5          5 13.23 13.08  9.33     NA     NA     NA
## 6          6 14.39 16.65 14.55     NA     NA     NA
## 7          7 12.24 14.87 13.65     NA     NA     NA
## 8          8 10.79 18.58 15.09 111.15 108.22 112.64
## 9          9 18.26 15.21 16.57 111.85 106.89 110.61
## 10        10 16.66 17.40 19.75 112.05 112.41 109.47
## 11        11 20.67 19.79 20.33 110.50 116.91 107.40
## 12        12 20.05 21.01 21.83 116.88 115.25 113.11
## 13        13 22.31 19.55 20.96 115.87 113.88 114.52
## 14        14 22.94 24.52 28.03 117.34 116.56 116.28
## 15        15 26.90 24.45 22.41 121.13 114.77 118.78
## 16        16    NA    NA    NA 117.66 118.53 119.89
## 17        17    NA    NA    NA 119.91 119.56 120.85
## 18        18    NA    NA    NA 119.10 121.02 120.11
## 19        19    NA    NA    NA 123.36 121.97 120.31
## 20        20    NA    NA    NA 121.59 121.11 118.18
## 21        21    NA    NA    NA 123.76 122.39 126.91
## 22        22    NA    NA    NA 125.10 125.48 123.59

Note, we have 22 observations and there are some non-availables (NA’s) for each variabele. We have NA’s for X1 till X3 for subjectID 16 till 22 and NA’s for X4 till X6 for subjectID 1 till 7.

Master Join

With Master join (left outer join) keep all subjects of one dataset and only the matching rows of the other:

AB.leftjoin <- join(datasetA, datasetB, by = "subjectID", type = "left")
AB.leftjoin #keep all subjects from datasets A and match rows from B

##    subjectID    X1    X2    X3     X4     X5     X6
## 1          1  7.93  8.53  8.76     NA     NA     NA
## 2          2  6.92 11.91 11.28     NA     NA     NA
## 3          3  8.80 13.02 13.50     NA     NA     NA
## 4          4 12.60 11.15 12.29     NA     NA     NA
## 5          5 13.23 13.08  9.33     NA     NA     NA
## 6          6 14.39 16.65 14.55     NA     NA     NA
## 7          7 12.24 14.87 13.65     NA     NA     NA
## 8          8 10.79 18.58 15.09 111.15 108.22 112.64
## 9          9 18.26 15.21 16.57 111.85 106.89 110.61
## 10        10 16.66 17.40 19.75 112.05 112.41 109.47
## 11        11 20.67 19.79 20.33 110.50 116.91 107.40
## 12        12 20.05 21.01 21.83 116.88 115.25 113.11
## 13        13 22.31 19.55 20.96 115.87 113.88 114.52
## 14        14 22.94 24.52 28.03 117.34 116.56 116.28
## 15        15 26.90 24.45 22.41 121.13 114.77 118.78

Note, that we have 15 observations and NA’s for X4 till X6 for subjectIDs 1 till 7.

Detail Join

With Detail join (right outer join) keep all subjects of one dataset and only the matching rows of the other:

AB.rightjoin <- join(datasetA, datasetB, by = "subjectID", type = "right")
AB.rightjoin #keep all subjects from datasets A and match rows from B

##    subjectID    X1    X2    X3     X4     X5     X6
## 1          8 10.79 18.58 15.09 111.15 108.22 112.64
## 2          9 18.26 15.21 16.57 111.85 106.89 110.61
## 3         10 16.66 17.40 19.75 112.05 112.41 109.47
## 4         11 20.67 19.79 20.33 110.50 116.91 107.40
## 5         12 20.05 21.01 21.83 116.88 115.25 113.11
## 6         13 22.31 19.55 20.96 115.87 113.88 114.52
## 7         14 22.94 24.52 28.03 117.34 116.56 116.28
## 8         15 26.90 24.45 22.41 121.13 114.77 118.78
## 9         16    NA    NA    NA 117.66 118.53 119.89
## 10        17    NA    NA    NA 119.91 119.56 120.85
## 11        18    NA    NA    NA 119.10 121.02 120.11
## 12        19    NA    NA    NA 123.36 121.97 120.31
## 13        20    NA    NA    NA 121.59 121.11 118.18
## 14        21    NA    NA    NA 123.76 122.39 126.91
## 15        22    NA    NA    NA 125.10 125.48 123.59

Note, that we have 15 observations and NA’s for X1 till X3 for subjectIDs 16 till 22.

Add two datasets

Besides joining datasets, we can also add datasets. In this case we measured the same variables (not all have to be the same) on different subjects. First we will simulate two datasets C and D, with some variables similar and different subjects.

df <- data.frame(subject = seq(1, 15, 1), 
                 mean = seq(10, 24, 1), 
                 sd = seq(2, 2.14, 0.01))
datasetC <- cbind(seq(1, 15,1), 
                  data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetC[, 2:4] <- round(datasetC[, 2:4], 2)
names(datasetC) <- c("subjectID", "X1", "X2", "X3")

df <- data.frame(subject = seq(1, 15, 1), 
                 mean = seq(10, 24, 1), 
                 sd = seq(2, 2.14, 0.01))
datasetD <- cbind(seq(16, 30,1), 
                  data.frame(matrix(rnorm(15*3), 15, 3) * df$sd + df$mean))
datasetD[, 2:4] <- round(datasetD[, 2:4], 2)
names(datasetD) <- c("subjectID", "X4", "X2", "X3")

This leads to the following summary statistics, where datasetA contains subjectsIDs from 1 to 15 with variables X1, X2, and X3 and datasetB contains subjectIDs from 16 tot 30 with variables X4, X2, and X3:

datasetC

##    subjectID    X1    X2    X3
## 1          1 12.77 11.80 10.72
## 2          2 12.34  8.49 10.17
## 3          3 15.12 11.81 12.06
## 4          4 11.76 15.55 16.45
## 5          5 14.84  9.63 14.08
## 6          6 15.55 12.72 17.96
## 7          7 16.24 18.50 18.45
## 8          8 16.46 18.64 20.14
## 9          9 16.20 19.10 18.49
## 10        10 16.20 16.59 17.47
## 11        11 20.18 19.48 18.51
## 12        12 18.52 24.05 20.25
## 13        13 20.36 21.62 21.29
## 14        14 23.79 23.56 22.06
## 15        15 25.21 22.06 22.43

datasetD

##    subjectID    X4    X2    X3
## 1         16  9.76 10.29 10.18
## 2         17 10.17 11.91 10.36
## 3         18  9.08 13.56 14.37
## 4         19 13.75 11.21 11.29
## 5         20 13.90 11.99 14.75
## 6         21 16.34 18.64 14.43
## 7         22 18.44 14.91 16.33
## 8         23 15.44 18.09 14.31
## 9         24 14.35 18.63 19.76
## 10        25 22.58 22.20 19.36
## 11        26 17.62 17.02 18.80
## 12        27 22.59 19.09 26.22
## 13        28 23.22 23.63 17.00
## 14        29 26.57 24.15 19.24
## 15        30 23.21 20.21 24.71

When adding two dataframes that do not have all the same variables there are two options: 1. Drop the variables that are not similar 2. Keep the variables that are not similar and put them equal to NA for the other dataset.

Drop variables

This look as follows when we drop the variables that are not similar (in this case X1 in dataset C and X4 in dataset D):

datasetC.dropX1 <- subset(datasetC, select = c("subjectID", "X2", "X3"))
datasetD.dropX4 <- subset(datasetD, select = c("subjectID", "X2", "X3"))

datasetC.dropX1

##    subjectID    X2    X3
## 1          1 11.80 10.72
## 2          2  8.49 10.17
## 3          3 11.81 12.06
## 4          4 15.55 16.45
## 5          5  9.63 14.08
## 6          6 12.72 17.96
## 7          7 18.50 18.45
## 8          8 18.64 20.14
## 9          9 19.10 18.49
## 10        10 16.59 17.47
## 11        11 19.48 18.51
## 12        12 24.05 20.25
## 13        13 21.62 21.29
## 14        14 23.56 22.06
## 15        15 22.06 22.43

datasetD.dropX4

##    subjectID    X2    X3
## 1         16 10.29 10.18
## 2         17 11.91 10.36
## 3         18 13.56 14.37
## 4         19 11.21 11.29
## 5         20 11.99 14.75
## 6         21 18.64 14.43
## 7         22 14.91 16.33
## 8         23 18.09 14.31
## 9         24 18.63 19.76
## 10        25 22.20 19.36
## 11        26 17.02 18.80
## 12        27 19.09 26.22
## 13        28 23.63 17.00
## 14        29 24.15 19.24
## 15        30 20.21 24.71

Now that we dropped variables X1 and X4 we are left with two datasets that contain the same variables. Hence, we can add them.

add.CD.drop <- rbind(datasetC.dropX1, datasetD.dropX4)
add.CD.drop #subjectID are from 1 to 30.

##    subjectID    X2    X3
## 1          1 11.80 10.72
## 2          2  8.49 10.17
## 3          3 11.81 12.06
## 4          4 15.55 16.45
## 5          5  9.63 14.08
## 6          6 12.72 17.96
## 7          7 18.50 18.45
## 8          8 18.64 20.14
## 9          9 19.10 18.49
## 10        10 16.59 17.47
## 11        11 19.48 18.51
## 12        12 24.05 20.25
## 13        13 21.62 21.29
## 14        14 23.56 22.06
## 15        15 22.06 22.43
## 16        16 10.29 10.18
## 17        17 11.91 10.36
## 18        18 13.56 14.37
## 19        19 11.21 11.29
## 20        20 11.99 14.75
## 21        21 18.64 14.43
## 22        22 14.91 16.33
## 23        23 18.09 14.31
## 24        24 18.63 19.76
## 25        25 22.20 19.36
## 26        26 17.02 18.80
## 27        27 19.09 26.22
## 28        28 23.63 17.00
## 29        29 24.15 19.24
## 30        30 20.21 24.71

Keep Variables

However, normally we want to avoid dropping variables since they contain information. Hence, another way to add two datasets is to keep the variables that are not similar and make them NA for the other dataset:

datasetC.addX4 <- cbind(datasetC, rep("NA", 15))
names(datasetC.addX4) <- c("subjectID", "X1", "X2", "X3", "X4")

datasetD.addX1 <- as.data.frame(cbind(datasetD$subjectID, rep("NA", 15), 
                                      datasetD$X2, datasetD$X3, datasetD$X4))
names(datasetD.addX1) <- c("subjectID", "X1", "X2", "X3", "X4")

Now the datasets look as follows:

datasetC.addX4

##    subjectID    X1    X2    X3 X4
## 1          1 12.77 11.80 10.72 NA
## 2          2 12.34  8.49 10.17 NA
## 3          3 15.12 11.81 12.06 NA
## 4          4 11.76 15.55 16.45 NA
## 5          5 14.84  9.63 14.08 NA
## 6          6 15.55 12.72 17.96 NA
## 7          7 16.24 18.50 18.45 NA
## 8          8 16.46 18.64 20.14 NA
## 9          9 16.20 19.10 18.49 NA
## 10        10 16.20 16.59 17.47 NA
## 11        11 20.18 19.48 18.51 NA
## 12        12 18.52 24.05 20.25 NA
## 13        13 20.36 21.62 21.29 NA
## 14        14 23.79 23.56 22.06 NA
## 15        15 25.21 22.06 22.43 NA

datasetD.addX1

##    subjectID X1    X2    X3    X4
## 1         16 NA 10.29 10.18  9.76
## 2         17 NA 11.91 10.36 10.17
## 3         18 NA 13.56 14.37  9.08
## 4         19 NA 11.21 11.29 13.75
## 5         20 NA 11.99 14.75  13.9
## 6         21 NA 18.64 14.43 16.34
## 7         22 NA 14.91 16.33 18.44
## 8         23 NA 18.09 14.31 15.44
## 9         24 NA 18.63 19.76 14.35
## 10        25 NA  22.2 19.36 22.58
## 11        26 NA 17.02  18.8 17.62
## 12        27 NA 19.09 26.22 22.59
## 13        28 NA 23.63    17 23.22
## 14        29 NA 24.15 19.24 26.57
## 15        30 NA 20.21 24.71 23.21

Now, we can add the two datasets:

add.CD.keep <- rbind(datasetC.addX4, datasetD.addX1)
add.CD.keep

##    subjectID    X1    X2    X3    X4
## 1          1 12.77  11.8 10.72    NA
## 2          2 12.34  8.49 10.17    NA
## 3          3 15.12 11.81 12.06    NA
## 4          4 11.76 15.55 16.45    NA
## 5          5 14.84  9.63 14.08    NA
## 6          6 15.55 12.72 17.96    NA
## 7          7 16.24  18.5 18.45    NA
## 8          8 16.46 18.64 20.14    NA
## 9          9  16.2  19.1 18.49    NA
## 10        10  16.2 16.59 17.47    NA
## 11        11 20.18 19.48 18.51    NA
## 12        12 18.52 24.05 20.25    NA
## 13        13 20.36 21.62 21.29    NA
## 14        14 23.79 23.56 22.06    NA
## 15        15 25.21 22.06 22.43    NA
## 16        16    NA 10.29 10.18  9.76
## 17        17    NA 11.91 10.36 10.17
## 18        18    NA 13.56 14.37  9.08
## 19        19    NA 11.21 11.29 13.75
## 20        20    NA 11.99 14.75  13.9
## 21        21    NA 18.64 14.43 16.34
## 22        22    NA 14.91 16.33 18.44
## 23        23    NA 18.09 14.31 15.44
## 24        24    NA 18.63 19.76 14.35
## 25        25    NA  22.2 19.36 22.58
## 26        26    NA 17.02  18.8 17.62
## 27        27    NA 19.09 26.22 22.59
## 28        28    NA 23.63    17 23.22
## 29        29    NA 24.15 19.24 26.57
## 30        30    NA 20.21 24.71 23.21

So to conclude, we can join and add datasets. If we have observations from similar subjects on different variabiles we can join the datasets in four ways:

Inner join
Outer join
Master join
Detail join

When two datasets measures some similar variables on different subjects we can add theses datasets. To do this we have to decide on how to handle variables that were not included in both datasets:

Drop these variables
Keep these variables

I: Combining Datasets & Missing Data

Manon Grevinga, Stef van Buuren

Practical 1 of 6