1.8 Exercises

Exercise 1.1 (Reporting practice) What are the reporting practices in your field? Take a random sample of articles that have appeared during the last 10 years in the leading journal in your field. Select only those that present quantitative analyses, and address the following topics:

  1. Did the authors report that there were missing data?

  2. If not, can you infer from the text that there must have been missing data?

  3. Did the authors discuss how they handled the missing data?

  4. Were the missing data properly addressed?

  5. Can you detect a trend over time in reporting practice?

  6. Would the editors of the journal be interested in your findings?
Exercise 1.2 (Loss of information) Suppose that a dataset consists of 100 cases and 10 variables. Each variable contains 10% missing values. What is the largest possible subsample under listwise deletion? What is the smallest? If each variable is MCAR, how many cases will remain?

Exercise 1.3 (Stochastic regression imputation) The correlation of the data in Figure 1.4 is equal to 0.33. This is relatively low compared to the other correlations reported in Section 1.3. This seems to contradict the statement that stochastic regression imputation does not bias the correlation. Could this low correlation be due to random variation?

  1. Rerun the code with a different seed value. What is the correlation now?

  2. Write a loop to apply apply stochastic regression imputation with the seed increasing from 1 to 1000. Calculate the regression weight and the correlation for each solution, and plot the histogram. What are the mean, minimum and maximum values of the correlation?

  3. Do your results indicate that stochastic regression imputation alters the correlation?

Exercise 1.4 (Stochastic regression imputation (continued)) The largest correlation found in the previous exercise exceeds the value found in Section 1.3.4. This seems odd since the correlation of the imputed values under regression imputation is equal to 1, and hence the imputed data have a maximal contribution to the overall correlation.

  1. Can you explain why this could happen?

  2. Adapt the code from the previous exercise to test your explanation. Was your explanation satisfactory?

  3. If not, can you think of another reason, and test that? Hint: Find out what is special about the solutions with the largest correlations.

Exercise 1.5 (Nonlinear model) The model fitted to the airquality data in Section 1.4.3 is a simple linear model. Inspection of the residuals reveals that there is a slight curvature in the average of the residuals.

  1. Start from the completed cases, and use plot(fit) to obtain diagnostic plots. Can you explain why the curvature shows up?

  2. Experiment with solutions, e.g., by transforming Ozone or by adding a quadratic term to the model. Can you make the curvature disappear? Does the amount of explained variance increase?

  3. Does the curvature also show up in the imputed data? If so, does the same solution work? Hint: You can assess the \(j^\mathrm{th}\) fitted model by getfit(fit, j), where fit was created by with(imp,...).

  4. Advanced: Do you think your solution would necessitate drawing new imputations?