11.1 Long and wide format

Longitudinal data can be coded into “long” and “wide” formats. A wide dataset will have one record for each individual. The observations made at different time points are coded as different columns. In the wide format every measure that varies in time occupies a set of columns. In the long format there will be multiple records for each individual. Some variables that do not vary in time are identical in each record, whereas other variables vary across the records. The long format also needs a “time” variable that records the time in each record, and an “id” variable that groups the records from the same person.

A simple example of the wide format is

id age Y1 Y2
 1  14 28 22
 2  12 34 16
 3  ...

In the long format, this dataset looks like

Note that the concepts of long and wide are general, and also apply to cross-sectional data. For example, we have seen the long format before in Section 5.1.3, where it referred to stacked imputed data that was produced by the complete() function. The basic idea is the same.

Both formats have their advantages. If the data are collected on the same time points, the wide format has no redundancy or repetition. Elementary statistical computations like calculating means, change scores, age-to-age correlations between time points, or the \(t\)-test are easy to do in this format. The long format is better at handling irregular and missed visits. Also, the long format has an explicit time variable available that can be used for analysis. Graphs and statistical analyses are easier in the long format.

Applied researchers often collect, store and analyze their data in the wide format. Classic ANOVA and MANOVA techniques for repeated measures and structural equation models for longitudinal data assume the wide format. Modern multilevel techniques and statistical graphs, however, work only from the long format. The distinction between the two formats is a first stumbling block for those new to longitudinal analysis.

Singer and Willett (2003) advise to store data in both formats. The wide and the long formats can be easily converted in another by means of gather() and spread() functions on tidyr (Wickham and Grolemund 2017). The wide-to-long conversion can usually be done without a problem. The long-to-wide conversion can be difficult. If individuals are seen at different times, direct conversion is impractical. The number of columns in the wide format becomes overly large, and each column contains many missing values. An ad hoc solution is to create homogeneous time groups, which then become the new columns in the wide format. Such regrouping will lead to loss of precision of the time variable. For some studies this need not be a problem, but for others it will.

Multiple imputation is somewhat more convenient in the wide format. Apart from the fact that the columns are ordered in time, there is nothing special about the imputation problem. We may thus apply techniques for single level data to longitudinal data. Section 11.2 discusses an imputation technique in the wide format in a clinical trial application with the goal of performing a statistical analysis according to the intention to treat (ITT) principle. The longitudinal character of the data helped specify the imputation model.

Multiple imputation of the longitudinal data in the long form can be done by multilevel imputation techniques. See Chapter 7 for an overview. Section 11.3 discusses multiple imputation in the long format. The application defines a common time raster for all persons. Multiple imputations are drawn for each raster point. The resulting imputed datasets can be converted to, and analyzed in, the wide format if desired. This approach is a more principled way to deal with the information loss problem discussed previously. The procedure aligns times to a common raster, hence the name time raster imputation (cf. Section 11.3).