With the growthcleanr package installed as described in Installation, it can be loaded as a package:

Basic operations using example synthetic data

For convenience, growthcleanr ships with an example synthetic dataset created using Synthea, with support for simulated growth measurement errors based on the protocol included as supplementary material to the Daymont et al. paper. This dataset is automatically loaded with the growthcleanr library, and is called syngrowth.

dim(syngrowth)
## [1] 77721     6
head(syngrowth)
##   id                               subjid    param agedays sex measurement
## 1  1 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM    1435   1       102.6
## 2  2 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG    1435   1        16.6
## 3  3 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM    1806   1       102.6
## 4  4 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG    1806   1        16.6
## 5  5 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG    1806   1        19.6
## 6  6 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM    2177   1       117.0

It can be used as in the following example. Note that processing the example data with cleangrowth() will likely take a few minutes to complete.

library(data.table)
library(dplyr)

# Convert the `syngrowth` data frame to a `data.table`
data <- as.data.table(syngrowth)

# `setkey()` creates an efficient sorting key on the `data.table`; this is required
# for `cleangrowth()`
setkey(data, subjid, param, agedays)

# Add a column `gcr_result` using `cleangrowth`
cleaned_data <- data[, gcr_result := cleangrowth(subjid, param, agedays, sex, measurement)]

# View a sample of the results
head(cleaned_data)
      id                               subjid sex  agedays    param measurement                  gcr_result
1: 83330 002986c5-354d-bb9d-c180-4ce26813ca28   1 20489.22 HEIGHTCM       151.1                     Include
2: 83332 002986c5-354d-bb9d-c180-4ce26813ca28   1 20860.22 HEIGHTCM       151.1                     Include
3: 83334 002986c5-354d-bb9d-c180-4ce26813ca28   1 20860.22 HEIGHTCM       150.6 Exclude-Same-Day-Extraneous
4: 83335 002986c5-354d-bb9d-c180-4ce26813ca28   1 21231.22 HEIGHTCM       151.1                     Include
5: 83337 002986c5-354d-bb9d-c180-4ce26813ca28   1 21602.22 HEIGHTCM       151.1                     Include
6: 83339 002986c5-354d-bb9d-c180-4ce26813ca28   1 21623.22 HEIGHTCM       151.1                     Include

# Summarize results by result type
cleaned_data %>% group_by(gcr_result) %>% tally(sort=TRUE)
# A tibble: 26 x 2
   gcr_result                      n
   <fct>                       <int>
 1 Include                     61652
 2 Exclude-Extraneous-Same-Day 11263
 3 Exclude-Carried-Forward      7093
 4 Exclude-Same-Day-Extraneous  4010
 5 Exclude-Same-Day-Identical    623
 6 Exclude-SD-Cutoff             175
 7 Exclude-EWMA-8                139
 8 Exclude-Distinct-3-Or-More    125
 9 Exclude-BIV                   108
10 Exclude-EWMA-Extreme           99
# … with 16 more rows

If you are able to run these steps and see a similar result, you have the growthcleanr package installed correctly. The resulting cleaned_data can be reviewed, subsetted, and compared in more detail using all the tools R provides.

Basic configuration options

For complete information about the options that can be set on the cleangrowth() function, see Configuration. Below are a few additional examples:

This example shows three configuration options in use:

cleaned_data <- data[,
  gcr_result_both := cleangrowth(
    subjid, param, agedays, sex, measurement,
    lt3.exclude.mode = "flag.both",
    ref.data.path = "inst/extdata/",
    quietly = FALSE
  )
]
  • lt3.exclude.mode = "flag.both" will exclude both measurements for a subject if they only have two unexcluded measurements of one parameter type with at least one implausible value and no same age-day measurements of the other parameter.

  • ref.data.path = "inst/extdata" shouldn’t be necessary if you have the growthcleanr package installed, but if you are running it from its source directly you may need to specify its full path.

  • quietly = F enables verbose output, marking the progress of the algorithm through its many processing steps. This can be very helpful while testing.

This example shows built-in options for processing data in parallel batches, which can speed the process while working with large data sets:

cleaned_data <- data[,
  gcr_result_both := cleangrowth(
    subjid, param, agedays, sex, measurement,
    parallel = TRUE,
    num.batches = 4,
    log.path = "logs"
  )
]
  • parallel tells cleangrowth() to run in parallel; the default is FALSE

  • num.batches specifies how many batches to process in parallel

The best num.batches value for your environment may vary depending on the computing resources you have available. If you do not specify num.batches, growthcleanr will estimate a batch count based on R functions for checking the system hardware. If you are working with large datasets, you may need to experiment with these options to determine the best settings for your needs. You may also find it helpful to review the additional notes on Working with large datasets.

The default value of log.path is ".", the current working directory. growthcleanr will write batch-specific log files to the log.path directory, and will create the directory if necessary.

Note that if you run growthcleanr in parallel with multiple batches and see warning errors such as the following, they can be ignored.

Warning messages:
1: <anonymous>: ... may be used in an incorrect context:.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context:.fun(piece, ...)’

This set of warnings is coming from one of growthcleanr’s dependencies, and does not indicate either failure or improper execution.

This final example demonstrates using two option specific to adult subjects, adult_cutpoint and weight_cap.

cleaned_data <- data[,
  gcr_result_both := cleangrowth(
    subjid, param, agedays, sex, measurement,
    adult_cutpoint = 18,
    weight_cap = 181.4
  )
]
  • adult_cutpoint defines a subject age in years at which to cut growthcleanr analysis into separate sets for pediatric and adult subjects. The default is 20 years, and this may be lowered to 18, as shown above.

  • weight_cap specifies a known hard limit clamp for weight values. sometimes data shared for research may have measurement values clamped to hard limits for privacy protection. The growthcleanr adult algorithm can be configured to watch for a known weight cap and adjust its assessment accordingly.

The full algorithm for assessing adult measurements is described in detail in Adult algorithm.