growthcleanr package installed as described in
Installation, it can be loaded as a
growthcleanr ships with an example
synthetic dataset created using Synthea, with
support for simulated growth measurement errors based on the protocol
included as supplementary material to the Daymont et
al. paper. This dataset is automatically loaded with the
growthcleanr library, and is called
##  77721 6
## id subjid param agedays sex measurement ## 1 1 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM 1435 1 102.6 ## 2 2 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG 1435 1 16.6 ## 3 3 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM 1806 1 102.6 ## 4 4 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG 1806 1 16.6 ## 5 5 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG 1806 1 19.6 ## 6 6 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM 2177 1 117.0
It can be used as in the following example. Note that processing the
example data with
cleangrowth() will likely take a few
minutes to complete.
library(data.table) library(dplyr) # Convert the `syngrowth` data frame to a `data.table` <- as.data.table(syngrowth) data # `setkey()` creates an efficient sorting key on the `data.table`; this is required # for `cleangrowth()` setkey(data, subjid, param, agedays) # Add a column `gcr_result` using `cleangrowth` <- data[, gcr_result := cleangrowth(subjid, param, agedays, sex, measurement)] cleaned_data # View a sample of the results head(cleaned_data) id subjid sex agedays param measurement gcr_result1: 83330 002986c5-354d-bb9d-c180-4ce26813ca28 1 20489.22 HEIGHTCM 151.1 Include 2: 83332 002986c5-354d-bb9d-c180-4ce26813ca28 1 20860.22 HEIGHTCM 151.1 Include 3: 83334 002986c5-354d-bb9d-c180-4ce26813ca28 1 20860.22 HEIGHTCM 150.6 Exclude-Same-Day-Extraneous 4: 83335 002986c5-354d-bb9d-c180-4ce26813ca28 1 21231.22 HEIGHTCM 151.1 Include 5: 83337 002986c5-354d-bb9d-c180-4ce26813ca28 1 21602.22 HEIGHTCM 151.1 Include 6: 83339 002986c5-354d-bb9d-c180-4ce26813ca28 1 21623.22 HEIGHTCM 151.1 Include # Summarize results by result type %>% group_by(gcr_result) %>% tally(sort=TRUE) cleaned_data # A tibble: 26 x 2 gcr_result n<fct> <int> 1 Include 61652 2 Exclude-Extraneous-Same-Day 11263 3 Exclude-Carried-Forward 7093 4 Exclude-Same-Day-Extraneous 4010 5 Exclude-Same-Day-Identical 623 6 Exclude-SD-Cutoff 175 7 Exclude-EWMA-8 139 8 Exclude-Distinct-3-Or-More 125 9 Exclude-BIV 108 10 Exclude-EWMA-Extreme 99 # … with 16 more rows
If you are able to run these steps and see a similar result, you have
growthcleanr package installed correctly. The resulting
cleaned_data can be reviewed, subsetted, and compared in
more detail using all the tools R provides.
This example shows three configuration options in use:
cleaned_data <- data[, gcr_result_both := cleangrowth( subjid, param, agedays, sex, measurement, lt3.exclude.mode = "flag.both", ref.data.path = "inst/extdata/", quietly = FALSE ) ]
lt3.exclude.mode = "flag.both" will exclude both
measurements for a subject if they only have two unexcluded measurements
of one parameter type with at least one implausible value and no same
age-day measurements of the other parameter.
ref.data.path = "inst/extdata" shouldn’t be
necessary if you have the
growthcleanr package installed,
but if you are running it from its source directly you may need to
specify its full path.
quietly = F enables verbose output, marking the
progress of the algorithm through its many processing steps. This can be
very helpful while testing.
This example shows built-in options for processing data in parallel batches, which can speed the process while working with large data sets:
cleaned_data <- data[, gcr_result_both := cleangrowth( subjid, param, agedays, sex, measurement, parallel = TRUE, num.batches = 4, log.path = "logs" ) ]
cleangrowth() to run in
parallel; the default is
num.batches specifies how many batches to process in
num.batches value for your environment may vary
depending on the computing resources you have available. If you do not
estimate a batch count based on R functions for checking the system
hardware. If you are working with large datasets, you may need to
experiment with these options to determine the best settings for your
needs. You may also find it helpful to review the additional notes on Working with large datasets.
The default value of
current working directory.
growthcleanr will write
batch-specific log files to the
log.path directory, and
will create the directory if necessary.
Note that if you run
growthcleanr in parallel with
multiple batches and see warning errors such as the following, they can
: Warning messages1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’ 2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
This set of warnings is coming from one of
growthcleanr’s dependencies, and does not indicate either
failure or improper execution.
This final example demonstrates using two option specific to adult
cleaned_data <- data[, gcr_result_both := cleangrowth( subjid, param, agedays, sex, measurement, adult_cutpoint = 18, weight_cap = 181.4 ) ]
adult_cutpoint defines a subject age in years at
which to cut
growthcleanr analysis into separate sets for
pediatric and adult subjects. The default is 20 years, and this may be
lowered to 18, as shown above.
weight_cap specifies a known hard limit clamp for
weight values. sometimes data shared for research may have measurement
values clamped to hard limits for privacy protection. The growthcleanr
adult algorithm can be configured to watch for a known weight cap and
adjust its assessment accordingly.
The full algorithm for assessing adult measurements is described in detail in Adult algorithm.