growthcleanr offers several options for configuration, with default values set to address common cases. You may wish to experiment with the settings to discover which work best for your research and your dataset.

All of the following options may be set as additional parameters in the call to cleangrowth().

The following options change the behavior of the growthcleanr algorithm.

  • recover.unit.error - default FALSE; when FALSE, measurements identified as unit errors (e.g., apparent height values in inches instead of centimeters) will be flagged but not corrected, when TRUE these values will be corrected and included as valid measurements for cleaning.

  • sd.extreme - default 25; a very extreme value check on modified (recentered) Z-scores used as a first-pass elimination of clearly implausible values, often due to misplaced decimals.

  • z.extreme - default 25; similar usage as sd.extreme, for absolute Z-scores.

NOTE: many different steps in the growthcleanr algorithm use highly refined techniques to identify implausible values that require more subtlety to detect. These default values for sd.extreme and z.extreme are set very high by design to eliminate completely implausible values early in the process; these extreme values require none of growthcleanr’s additional, more subtle approaches to be identified as exclusions. If sd.extreme and z.extreme are configured to be lower values, measurements eliminated at the early step which checks against these extreme values will not be further considered using later techniques.

  • include.carryforward - default FALSE; if set to TRUE, growthcleanr will skip algorithm step 9, which identifies carried forward measurements, and will not flag these values for exclusion.

  • ewma.exp - default -1.5; the exponent used for weighting measurements when calculating exponentially weighted moving average (EWMA). This exponent should be negative to weight growth measurements closer to the measurement being evaluated more strongly. Exponents that are further from zero (e.g., -3) will increase the relative influence of measurements close in time to the measurement being evaluated compared to using the default exponent.

  • - defaults to using CDC reference data from year 2000; supply a file path to use alternate reference data. Note that when running from an installed growthcleanr package (e.g. having called library(growthcleanr)), this path does not need to be specified. Developers testing the source code directly from the source directory will need to specify this as well.

  • error.load.mincount - default 2; minimum count of exclusions on parameter for one subject before considering excluding all measurements.

  • error.load.threshold - default 0.5; threshold to exceed for percentage of excluded measurement count relative to count of included other measurement (e.g., if 3 of 5 WTs are excluded, and 5 corresponding HTs are included, this exceeds the 0.5 threshold and the other two WTs will be excluded).

  • lt3.exclude.mode - default default; determines type of exclusion procedure to use for 1 or 2 measurements of one type without matching same ageday measurements for the other parameter. Options include:

    • default - standard growthcleanr approach
    • flag.both - in case of two measurements with at least one beyond thresholds, flag both instead of one (as in default)
  • sd.recenter - default NA; specifies how to recenter medians. May be a data frame or table w/median SD-scores per day of life by gender and parameter, or “nhanes” or “derive” as a character vector.

    • If sd.recenter is specified as a data set, use the data set
    • If sd.recenter is specified as “nhanes”, use NHANES reference medians
    • If sd.recenter is specified as “derive”, derive from input
    • If sd.recenter is not specified or NA:
      • If the input set has at least 5,000 observations, derive medians from input
      • If the input set has fewer than 5,000 observations, use NHANES

    If specifying a data set, columns must include param, sex, agedays, and sd.median (referred to elsewhere as “modified Z-score”), and those medians will be used for centering. This data set must include a row for every ageday present in the dataset to be cleaned; the NHANES reference medians include a row for every ageday in the range (731-7305 days). A summary of how the NHANES reference medians were derived is below under NHANES reference data.

  • adult_cutpoint - default 20; number between 18 and 20, describing age limit in years above which the pediatric algorithm should not apply (< adult_cutpoint), and the adult algorithm should apply (>= adult_cutpoint). Numbers outside this range will be changed to the closest number within the range.

  • weight_cap - default Inf; weight_cap Positive number, describing a weight cap in kg (rounded to the nearest .1, +/- .1) within an adult dataset. This may be used when a dataset shared for research is known to clamp values for privacy reasons. If there is no weight cap, set to Inf (default). This option is not used with pediatric subjects.

Operational options

The following options change the execution of the program overall, with no effect on the algorithm itself.

  • parallel - default FALSE; set to TRUE to run growthcleanr in parallel. Running in parallel will split the input data into batches (while keeping all records for each subject together) and process each batch on a different processor/core to maximize throughput. Recommended for large datasets with more than 100K rows.

  • num.batches - specifies the number of batches to run in parallel; only applies if parallel is set to TRUE. Defaults to the number of workers returned by the getDoParWorkers() function in the foreach package. Note that processing in parallel may affect overall system performance.

  • sdmedian.filename - filename for optionally saving sd.median data calculated on the input dataset to as CSV. Defaults to "", for which this data will not be saved. Use for extracting medians for parallel processing scenarios other than the built-in parallel option. See notes on large data sets for details.

  • sdrecentered.filename - filename to save re-centered data to as CSV. Defaults to ““, for which this data will not be saved. Useful for post-processing and debugging.

  • adult_columns_filename - default ""; Name of file to save original adult data, with additional output columns to as CSV. Defaults to ““, for which this data will not be saved. Useful for post-analysis. For more information on this output, please see README.

  • quietly - default TRUE; when TRUE, displays function messages and will output log files when parallel is TRUE.

  • log.path - default "."; sets directory for batch log file output when processing with parallel = TRUE. A new directory will be created if necessary.

NHANES reference medians

growthcleanr releases up to 1.2.4 offered two options for recentering medians, either the default of deriving medians from the input set, or supplying an externally-defined set of medians. These left out an option for researchers working with either small datasets or with data which might otherwise not be representative of the population, as deriving medians from the input set in those cases might be problematic. To provide a standard default reference to address these latter cases, a set of medians were derived from the National Health and Nutrition Examination Survey (NHANES). A summary of that process is below. As of release 1.2.5, the default behavior is:

  • If sd.recenter is specified as a data set, use the data set
  • If sd.recenter is specified as nhanes, use NHANES
  • If sd.recenter is specified as derive, derive from input
  • If sd.recenter is not specified or NA:
    • If the input set has at least 5,000 observations, derive medians from input
    • If the input set has fewer than 5,000 observations, use NHANES

With the verbose cleangrowth() option quietly = FALSE, the recentering medians approach used will be noted in the output. If the input set has fewer than 100 observations for any age-year, this will also be noted in the output.

Derivation process

The NHANES reference medians are based primarily on data from NHANES 2009-2010 through 2017-2018, including approximately 39,000 heights/lengths and weights from children and adolescents between the ages of 0 months and <240 months. Weight and height SD scores were calculated from the L, M, and S parameters for the CDC growth charts were used as the reference to calculate weight and height SD scores for the NHANES 2009-2010 through 2017-2018 samples. Based on the distributions of age-days in children at 0 months, an age adjustment was made based on the median number of days among these infants. This adjustment was made after consultation with the National Center for Health Statistics confirmed that a general assumption of ages occurring at the midpoint of the indicated integer month of age did not apply to children recorded as 0 months, and uses 0.75 months instead.

Weights were supplemented with a random sample of birthweights from NCHS’s Vital Statistics Natality Birth Data for 2018. These had sample weights assigned so that the sum of the sample weights for the sample equalled the sum of the sample weights for each month for infants in NHANES, as NHANES is a multi-stage complex survey. The reference data was then smoothed using the svysmooth() function in the R survey package to estimate the weight and height SD scores for each day up to 7,305 days, with a bandwidth chosen to balance between over- and under-fitting, and interpolation between the estimates from this function was used to obtain an estimate for each day of age. Predictions from a regression model fit to smoothed height SDs between 23 and 365 days (the youngest child in NHANES had an estimated age in days of 23) were used to extend smoothed height SD scores to children between 1 and 22 days of age.