Clean growth measurements

cleangrowth(
  subjid,
  param,
  agedays,
  sex,
  measurement,
  recover.unit.error = FALSE,
  sd.extreme = 25,
  z.extreme = 25,
  lt3.exclude.mode = "default",
  height.tolerance.cm = 2.5,
  error.load.mincount = 2,
  error.load.threshold = 0.5,
  sd.recenter = NA,
  sdmedian.filename = "",
  sdrecentered.filename = "",
  include.carryforward = FALSE,
  ewma.exp = -1.5,
  ref.data.path = "",
  log.path = NA,
  parallel = FALSE,
  num.batches = NA,
  quietly = TRUE,
  adult_cutpoint = 20,
  weight_cap = Inf,
  adult_columns_filename = "",
  prelim_infants = FALSE
)

Arguments

subjid

Vector of unique identifiers for each subject in the database.

param

Vector identifying each measurement, may be 'WEIGHTKG', 'WEIGHTLBS', 'HEIGHTCM', 'HEIGHTIN', 'LENGTHCM', or 'HEADCM'. 'HEIGHTCM'/'HEIGHTIN' vs. 'LENGTHCM' only affects z-score calculations between ages 24 to 35 months (730 to 1095 days). All linear measurements below 731 days of life (age 0-23 months) are interpreted as supine length, and all linear measurements above 1095 days of life (age 36+ months) are interpreted as standing height. Note: at the moment, all LENGTHCM will be converted to HEIGHTCM. In the future, the algorithm will be updated to consider this difference. Additionally, imperial 'HEIGHTIN' and 'WEIGHTLBS' measurements are converted to metric during algorithm calculations.

agedays

Numeric vector containing the age in days at each measurement.

sex

Vector identifying the gender of the subject, may be 'M', 'm', or 0 for males, vs. 'F', 'f' or 1 for females.

measurement

Numeric vector containing the actual measurement data. Weight must be in kilograms (kg), and linear measurements (height vs. length) in centimeters (cm).

recover.unit.error

Indicates whether the cleaning algorithm should attempt to identify unit errors (I.e. inches vs. cm, lbs vs. kg). If unit errors are identified, the value will be corrected and retained within the cleaning algorithm as a valid measurement. Defaults to FALSE.

sd.extreme

Measurements more than sd.extreme standard deviations from the mean (either above or below) will be flagged as invalid. Defaults to 25.

z.extreme

Measurements with an absolute z-score greater than z.extreme will be flagged as invalid. Defaults to 25.

lt3.exclude.mode

Determines type of exclusion procedure to use for 1 or 2 measurements of one type without matching same ageday measurements for the other parameter. Options include "default" (standard growthcleanr approach), and "flag.both" (in case of two measurements of one type without matching values for the other parameter, flag both for exclusion if beyond threshold)

height.tolerance.cm

maximum decrease in height tolerated for sequential measurements

error.load.mincount

minimum count of exclusions on parameter before considering excluding all measurements. Defaults to 2.

error.load.threshold

threshold of percentage of excluded measurement count to included measurement count that must be exceeded before excluding all measurements of either parameter. Defaults to 0.5.

sd.recenter

specifies how to recenter medians. May be a data frame or table w/median SD-scores per day of life by gender and parameter, or "NHANES" or "derive" as a character vector.

If sd.recenter is specified as a data set, use the data set
If sd.recenter is specified as "nhanes", use NHANES reference medians
If sd.recenter is specified as "derive", derive from input
If sd.recenter is not specified or NA:
- If the input set has at least 5,000 observations, derive medians from input
- If the input set has fewer than 5,000 observations, use NHANES

If specifying a data set, columns must include param, sex, agedays, and sd.median (referred to elsewhere as "modified Z-score"), and those medians will be used for recentering. A summary of how the NHANES reference medians were derived is available in README.md. Defaults to NA.

sdmedian.filename

Name of file to save sd.median data calculated on the input dataset to as CSV. Defaults to "", for which this data will not be saved. Use for extracting medians for parallel processing scenarios other than the built-in parallel option.

sdrecentered.filename

Name of file to save re-centered data to as CSV. Defaults to "", for which this data will not be saved. Useful for post-processing and debugging.

include.carryforward

Determines whether Carry-Forward values are kept in the output. Defaults to False.

ewma.exp

Exponent to use for weighting measurements in the exponentially weighted moving average calculations. Defaults to -1.5. This exponent should be negative in order to weight growth measurements closer to the measurement being evaluated more strongly. Exponents that are further from zero (e.g. -3) will increase the relative influence of measurements close in time to the measurement being evaluated compared to using the default exponent.

ref.data.path

Path to reference data. If not supplied, the year 2000 Centers for Disease Control (CDC) reference data will be used.

log.path

Path to log file output when running in parallel (non-quiet mode). Default is NA. A new directory will be created if necessary. Set to NA to disable log files.

parallel

Determines if function runs in parallel. Defaults to FALSE.

num.batches

Specify the number of batches to run in parallel. Only applies if parallel is set to TRUE. Defaults to the number of workers returned by the getDoParWorkers function in the foreach package.

quietly

Determines if function messages are to be displayed and if log files (parallel only) are to be generated. Defaults to TRUE

adult_cutpoint

Number between 18 and 20, describing ages when the pediatric algorithm should not be applied (< adult_cutpoint), and the adult algorithm should apply (>= adult_cutpoint). Numbers outside this range will be changed to the closest number within the range. Defaults to 20.

weight_cap

Positive number, describing a weight cap in kg (rounded to the nearest .1, +/- .1) within the adult dataset. If there is no weight cap, set to Inf. Defaults to Inf.

adult_columns_filename

Name of file to save original adult data, with additional output columns to as CSV. Defaults to "", for which this data will not be saved. Useful for post-analysis. For more information on this output, please see README.

prelim_infants

TRUE/FALSE. Run the in-development release of the infants algorithm (expands pediatric algorithm to improve performance for children 0 – 2 years). Not recommended for use in research. For more information regarding the logic of the algorithm, see the vignette 'Preliminary Infants Algorithm.' Defaults to FALSE.

Value

Vector of exclusion codes for each of the input measurements.

Possible values for each code are:

'Include', 'Unit-Error-High', 'Unit-Error-Low', 'Swapped-Measurements', 'Missing',
'Exclude-Carried-Forward', 'Exclude-SD-Cutoff', 'Exclude-EWMA-Extreme', 'Exclude-EWMA-Extreme-Pair',
'Exclude-Extraneous-Same-Day',
'Exclude-EWMA-8', 'Exclude-EWMA-9', 'Exclude-EWMA-10', 'Exclude-EWMA-11', 'Exclude-EWMA-12', 'Exclude-EWMA-13', 'Exclude-EWMA-14',
'Exclude-Min-Height-Change', 'Exclude-Max-Height-Change',
'Exclude-Pair-Delta-17', 'Exclude-Pair-Delta-18', 'Exclude-Pair-Delta-19',
'Exclude-Single-Outlier', 'Exclude-Too-Many-Errors', 'Exclude-Too-Many-Errors-Other-Parameter'

Examples

# \donttest{
# Run calculation using a small subset of given data
df_stats <- as.data.frame(syngrowth)
df_stats <- df_stats[df_stats$subjid %in% unique(df_stats[, "subjid"])[1:5], ]

clean_stats <-cleangrowth(subjid = df_stats$subjid,
                         param = df_stats$param,
                         agedays = df_stats$agedays,
                         sex = df_stats$sex,
                         measurement = df_stats$measurement)

# Once processed you can filter data based on result value
df_stats <- cbind(df_stats, "clean_result" = clean_stats)
clean_df_stats <- df_stats[df_stats$clean_result == "Include",]

# Parallel processing: run using 2 cores and batches
clean_stats <- cleangrowth(subjid = df_stats$subjid,
                           param = df_stats$param,
                           agedays = df_stats$agedays,
                           sex = df_stats$sex,
                           measurement = df_stats$measurement,
                           parallel = TRUE,
                           num.batches = 2)
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
# }