R setup

To get started with growthcleanr, the growthcleanr package must be installed. To install the latest growthcleanr release from CRAN:

install.packages("growthcleanr")

To install the latest development version from GitHub using devtools:

devtools::install_github("carriedaymont/growthcleanr", ref="main")

Installing growthcleanr will install several additional packages in turn.

Further installation details and notes can be found under Installation.

Data preparation

Once the growthcleanr package and its dependencies are installed, the minimal input data vectors needed for cleaning data are:

  1. subjid - a unique identifier for each subject; if you have long (64-bit) integer subjid values, add the bit64 R package to your install list to ensure R will process these correctly
  2. param - a string designating the measurement being ‘HEIGHTCM’, ‘LENGTHCM’, or ‘WEIGHTKG
  3. agedays - a numeric value for age in days at time of measurement; will be rounded down (floor) to nearest integer if needed (note that there are 30.4375 days in one month; see also the additional notes on calculating agemos from whole numbers of months
  4. sex - a numeric value of 0 (male) or 1 (female)
  5. measurement - numeric value of measurement in cm for height or kg for weight

Note: the dataset should be sorted by subjid, param, and agedays, as demonstrated in the Example below.

Example

Starting with a dataset such as the following:

subjid param agedays sex measurement
1 HEIGHTCM 2790 0 118.5
1 HEIGHTCM 3677 0 148.22
1 WEIGHTKG 2790 0 23.8
1 WEIGHTKG 3677 0 38.41
2 HEIGHTCM 2112 1 118.2
2 HEIGHTCM 2410 1 123.9
2 HEIGHTCM 2708 1 128.03
2 HEIGHTCM 3029 1 135.3
2 WEIGHTKG 2112 1 25.84
2 WEIGHTKG 2410 1 28.61
2 WEIGHTKG 2708 1 28.61
2 WEIGHTKG 3029 1 34.5

Note that in this example, we see data from two subjects, sorted as described above, with two pairs of height/weight measurements for the first patient and four pairs for the second. Most of these values pass the “eyeball test”, but the third weight measurement for subjid 2 looks like it might be a value that has been carried forward from the previous value, given that their same ageday height has changed by several centimeters but their third height value is exactly the same as their second. This is an example of an implausible value and exclusion type that growthcleanr might identify.

The cleangrowth() function provided by growthcleanr will return a vector that specifies, for each measurement, whether to:

  • Include” the measurement,
  • Conclude that the measurement is “Missing”,
  • Or to “Exclude” the measurement (numerous variations of Exclude exist, as specified in Understanding growthcleanr output).

For a data.frame object source_data containing growth data:

library(growthcleanr)

# prepare data as a data.table
data <- as.data.table(source_data)

# set the data.table key for better indexing
setkey(data, subjid, param, agedays)

# generate new exclusion flag field using function
cleaned_data <- data[, gcr_result := cleangrowth(subjid, param, agedays, sex, measurement)]

# extract data limited only to values flagged for inclusion:
only_included_data <- cleaned_data[gcr_result == "Include"]

If our example dataset above were named source_data, examining cleaned_data would show:

subjid param agedays sex measurement gcr_result
1 HEIGHTCM 2790 0 118.5 Include
1 HEIGHTCM 3677 0 148.22 Include
1 WEIGHTKG 2790 0 23.8 Include
1 WEIGHTKG 3677 0 38.41 Include
2 HEIGHTCM 2112 1 118.2 Include
2 HEIGHTCM 2410 1 123.9 Include
2 HEIGHTCM 2708 1 128.03 Include
2 HEIGHTCM 3029 1 135.3 Include
2 WEIGHTKG 2112 1 25.84 Include
2 WEIGHTKG 2410 1 28.61 Include
2 WEIGHTKG 2708 1 28.61 Exclude-Carried-Forward
2 WEIGHTKG 3029 1 34.5 Include

In this sample output, we see that growthcleanr has indeed flagged the third weight measurement for subjid 2 as an apparent carried forward value. The measurement remains in the dataset, and it is up to the researcher to determine which exclusions to accept directly or investigate further.

This example uses the default configuration options for cleangrowth(). Longer examples demonstrating some common options are in Usage, and full details about all options are documented in Configuration.

To compute percentiles and Z-scores using the included CDC macro, continue with these next steps:

# Use the built-in utility function to convert the input observations to wide
# format for BMI calculation
cleaned_data_wide <- longwide(cleaned_data)

# Compute simple BMI
cleaned_data_bmi <- simple_bmi(cleaned_data_wide)

# Compute Z-scores and percentiles
cleaned_data_bmiz <- ext_bmiz(cleaned_data_bmi)

The longwide() function defaults to converting only records included by cleangrowth(), with options to change this default, and ext_bmiz() will return the wide version of the cleaned data with many metrics added. Both of these functions are described in Computing BMI percentiles and Z-scores.