Adult algorithm

There will be notes in italics throughout the introductory material discussing similarities and differences between growthcleanr-adults and growthcleanr-pediatrics for those who are familiar with growthcleanr-pediatrics and are interested. The nonitalicized text of the narrative does not assume any prior knowledge of growthcleanr-pediatrics.

Goal

The overall goal of growthcleanr-adults is to prepare datasets of height (HT) and weight (WT) data for adults (18y to 65y) for analysis by identifying and excluding HT and WT values that are implausible for an individual in the dataset. For many researchers and analysts who are focused on people with atypical anthropometrics, including severe obesity or anorexia nervosa, excluding values that are implausible for an individual rather than values that are atypical for a population is particularly important.

Growthcleanr-adults is designed to be as flexible as possible. In addition to data on HT and/or WT and a subject identifier, growthcleanr-adults requires only a variable for age to run, making it feasible to run on datasets from many sources. To date, it has been designed and tested in outpatient data. It is designed to work in people with many conditions that can affect anthropometrics, and modifications can likely be made to accommodate additional conditions that users identify. It can work on datasets of almost any size, from a dataset with one subject with one observation to datasets with hundreds of millions of observations. Subjects are evaluated completely independently from each other. HT and WT measurements for each subject are also evaluated separately from each other except for two small exceptions that will be detailed in the detailed description of steps.

Growthcleanr-adult does not require sex, which is required for growthcleanr-pediatrics to determine standard deviation scores. There is no recentering step as there is in growthcleanr-pediatrics. Also, except for the swap step and the 1 distinct value step (similar to ‘singles’) in which BMI is generated, both of which can be bypassed, there is no use of the other parameter as there is in some steps of growthcleanr-pediatrics.

Key challenges and overall strategy

It is not too difficult to evaluate a subjects HT or WT values and determine that at least one of them is likely implausible. It is much harder to confirm that there is an implausible measurement and especially to determine which measurement is the problem. Also, once there is one implausible measurement for a subject, there is an increased risk of finding an implausible measurement among the remaining values. Once there are two or more implausible measurements for a subject for HT or WT, it takes a lot of work to identify correctly which measurements are the problem.

To deal with the above challenges, growthcleanr-adults has multiple steps. In general, the steps exclude more extreme errors earlier and less extreme errors later, and rarely are multiple errors for a subject excluded in one step. Additionally, much of each step is devoted to checks to confirm that any value should be excluded and determine which value that should be.

These general principles are the same as for growthcleanr-pediatrics.

Overall, it was expected that WT in adults could remain stable or could have multiple gain/loss cycles within a subject. Expected changes included shorter term changes (fluid shifts, WT loss after pregnancy, short term WT loss/gain such as during/after illness) as well as longer term changes (intentional WT loss with or without surgery, WT gain with or without prior WT loss, slower gain during pregnancy). Although we do not have data on pregnancy or other diagnoses, it is reassuring that the proportion of excluded WTs is very similar for males and females, making it less likely that WT changes common in pregnancy are being excluded by growthcleanr.

For HT, three overall patterns were allowed in adults of this age, with an expectation of one inch of measurement error in either direction. The first pattern was stable HT. The second pattern was up to two episodes of HT loss without evidence of subsequent HTs inconsistent with the proposed HT loss pattern. The third pattern was HT gain for people whose first measurement occurred before age 25 years (although most gain in HT was allowed during the first years after 18) and who demonstrated one or more periods of HT gain without evidence of subsequent HTs inconsistent with the proposed HT gain pattern.

The expected patterns of WT and HT are two of the biggest differences with growthcleanr-pediatrics. Although growthcleanr-pediatrics accounted for weight loss, it was not written into the code the way it is for growthcleanr-adults, and growthcleanr-pediatrics would likely exclude some WTs in people with gains and losses that growthcleanr-adults can handle. Also, growthcleanr-pediatrics could not handle a HT loss of >3 cm (and that 3 cm was meant to cover measurement error, not actual HT loss).

In the primary dataset used to derive growthcleanr-adults, many values are repeated for both HT and WT. To a certain extent, this is expected because HT especially is generally fairly stable in adults among this age, and for some adults especially over periods of a couple of years WT is fairly stable. However, other values show fairly clear evidence of being pulled forward from a prior visit (e.g., subject Z with a WT pattern of 61, 63, 67, 72, 61, 72, 74, 61, 81, 61, 92). These have to be addressed specifically because otherwise they can distort the findings for other measurements; for example, you can imagine that for subject Z the WT=81 might be identified as implausible, even though on inspection it is far more likely that the surrounding 61 kg WTs did not represent new measurements. Additionally, when a subject has 20 identical HTs or WTs, it is extremely unlikely that they provide the same strength of information that 20 similar but not all identical HTs or WTs provides. Therefore, for both WT and HT, in later parts of the code there is a distinction made between how many distinct values a subject has; 2 distinct WTs each repeated 5 times are treated more similarly to how 2 total WTs would be treated than to how 10 completely distinct WTs would be treated. In many, but not all steps, if one member of a repeated value group is excluded, all members of that group are excluded. A lot of attention was paid to avoid inappropriately excluding repeated values.

Repeated values are different in very important ways from carried forwards in growthcleanr-pediatrics. The biggest differences are: 1) In growthcleanr-adults, most repeated values are retained (in growthcleanr-pediatrics essentially all carried forwards are excluded). 2) In growthcleanr-adults, repeated values are identified as part of the repeated value group regardless of whether they are consecutive (in growthcleanr -pediatrics carried forwards are only consecutive). 3) In growthcleanr-adults, groups of repeated values often (not always) get excluded in the same step (this is not a possibility in growthcleanr-pediatrics because later carried forward values are excluded early, and nonconsecutive values are not identified).

Changes in recorded HT that are commonly seen in EHR data can have a substantial impact on BMI that are likely not reflective of differences in adiposity in most patients. For this reason, in addition to cleaning HT data we can optionally output a mean HT. For patients in whom we see evidence of HT gain or loss the mean HT changes as the HT changes, otherwise the mean HT is averaged over all of the subject’s included HTs. People using the dataset can decide to use the original HT or the mean HT. Note that in all cases, based on the way the algorithm works, the mean HT will be within 2 inches of the original HT.

While the adult algorithm is considered complete, several areas for potential improvement remain. These are noted as next steps for future work.

Input

Growthcleanr-adults works better with:

Precise HT and WT measurements – Growthcleanr-adults uses metric data. When extracting and converting data, it is useful to conserve as much precision as possible.
Difference between measurements in days, or as precisely as possible – Growthcleanr-adults is not especially dependent on a precise starting age, and if an exact starting age in days is not known it is reasonable to estimate it. Relatively small differences in ages could affect whether certain values are excluded in the HT gain and loss steps, but this would likely affect a very small number of subjects. Growthcleanr-adults is more dependent on precise age intervals. The allowable differences between WTs change with the interval between the WTs, and the rate of this change is high over short intervals. If age intervals are not known to the day and must be estimated, the algorithm will likely be considerably less reliable for short intervals between WTs.
As many (actually measured) measurements as possible per subject – In general, the more measurements the better. For example, if a researcher is interested in data from 2015-2018 but has data from 2012-2020, growthcleanr-adults will clean the 2015-2018 data better if all the data is put into the algorithm, even if data from some years will later be excluded. An exception to the “more is better” rule is that measurements known to be carried forward from prior measurements should be excluded before running growthcleanr-adults. For example, if a HT is recorded during a telephone visit and is identical to a prior WT, it was likely not measured on that day. Excluding these measurements before running growthcleanr-adults will result in better data cleaning.

Selected calculations

Exponentially weighted moving average (EWMA)

Many steps in the algorithm rely on an exponentially weighted moving average (EWMA – we pronounce it “yooma”). This is an average of all of a subject’s WT or HT values, weighted by how far a measurement is in days of age from the measurement of interest. Measurements closer in age are weighted much more heavily, and the influence of measurements farther away drops off fairly quickly. The EWMA is in some ways a predicted value for the WT or HT at a given age, given the other values for a subject. A WT or HT does not contribute to its own EWMA. There is also the EWMA-BEF, which is the EWMA excluding the value just before the value of interest, and the EWMA-AFT, the EWMA excluding the value just after the value of interest. For the first measurement for a subject EWMA-BEF is the same as the EWMA. Then there are the D-EWMA, D-EWMA-BEF, and D-EWMA-AFT, the differences between a WT or HT and the EWMAs for the same day. Often the sign of the D-EWMA doesn’t matter, so we focus on its absolute value, the absD-EWMA. In many steps, especially for WT, the D-EWMA is one of the criteria to determine if a value gets excluded. Often D-EWMA-BEF and D-EWMA-AFT are required as well, serving as extra checks to exclude only the correct measurements.

There are important differences in EWMA between growthcleanr-adults and growthcleanr -pediatrics. First, the parameters are different so that closer WTs/HTs are counted much more heavily in growthcleanr -adults. This is because of WT gain and loss, but the new parameters worked well for HT as well. Also, instead of making a EWMA of standard deviation scores, in adults we just make the EWMA of the raw WT or HT itself. So, the EWMAs and D-EWMAs are in kgs and cms, and the cutoffs are very different than for pediatrics.

Extrapolation

One issue that came up repeatedly for WT is that with WT gain/loss, there would be values, including local mins/maxes, that looked plausible on inspection but would come up as implausible by EWMA criteria. One check in a late step that helped to avoid excluding these plausible values was extrapolating a WT using the two prior values (P1 and P2 are used to make P-Extrap-WT) or the two next values (N1 and N2 are used to make N-Extrap-WT). Then, the algorithm checks if the WT of interest (WT_i) is between the further prior/next value and the extrapolated value (so if WT_i is between P2 and P-Extrap-WT or if WT_i is between N2 and N-Extrap-WT) with some allowable error on either side. Generally, if a WT_i is between these values for either the prior or the next extrapolation it will not get excluded in this step through standard criteria (there may be alternate criteria it can meet).

An example of the extrapolation calculations is provided below:

	Agedays	WT
P2	10000	61
P1	10014	65
WT_i	10028	72
N1	10056	68
N2	10112	58

Prior WT extrapolation

Subject gained 4 kg in 14 days from P1 to P2
There are 14 days from P2 to WTi, so extrapolating would give 4 more kg on top of P2
P-Extrap-WT = 65+4 = 69
Allowable range for WT_i is P2 to P-Extrap-WT (61-69) with 5 kg error on either side (56-74)
WT_i (72) is within the allowable range, therefore, this WT_i would not get excluded through standard criteria.

Next WT extrapolation

Subject “gained” 10 kg in 56 days from N2 to N1 (helps to think about this going back in time)
There are 28 days from N1 to WT_i, and 28=56*0.5, so extrapolating would give 10*0.5 more kg on top of N1
N-Extrap-WT = 68+5 = 73
Allowable range for WT_i is P2 to P-Extrap-WT (58-73) with 5 kg error on either side (53-78)
WT_i (72) is within the allowable range; nothing changes because it was already ineligible to get excluded through standard criteria.

Description of each step

Note that not all steps are done for both HT and WT; this is indicated for each step in the left of the step title box.

Please note that although HT is evaluated in growthcleanr-adults in cm, it is often measured in inches. Many allowable HT errors are based on multiples of inches for this reason.

Description of Steps Format

Each step is listed with its step number, its name, and exclusion output. If a new output variable is created, it will be listed in the text in bold.

Initial Height and Weight Steps Together

Step 1H 1W, BIV (Biologically implausible values): HT/WT BIV

Biologically implausible values. Excludes WT<20, WT>500, HT<50, HT>244. These are extremely broad limits on purpose. Keeping out of scale measurements in (WT=3000) can cause problems, but otherwise excluding very extreme measurements that are not in line with a person’s other measurements does not interfere with data cleaning. Very little is gained by making these numbers more conservative, and there is potential for removing plausible measurements.

2W, RV (Repeated values)

In this step, WTs that are identical to another WT and are not the first WT of that identical group have EXC-WTR set to “RV”; the EXC-WT is still “INC”. Note this will sometimes be redone in future steps and will just be designated as “RV redone” in the body of the description. For WT we will discuss using firstRV, in which case we will only use the first of an RV group (excluding all with EXC-WT==”RV”). If we do this, and we exclude the first of an RV group, we generally exclude all subsequent members of the RV group. This ultimately results in a very small number of exclusions. For other steps we will use allRV. Either way, we need to update both EXC-WT and EXC-WTR at the end of each WT step.

3H 3W, Temp extraneous

We identify where there is more than one WT or HT for the same subject, parameter, and day of age (designated same days). For each day with same days, we select the value closest to the median WT or HT for that subject (not including any same days) and set all other same days to “temp extraneous”. Then RVs are redone for WT. This will often be redone in future steps and will be designated as “temp extraneous is redone” – this includes redoing WT RVs.

4W, Weight cap: WT Weight-cap, Weight-cap-identical

Some hospitals, vendors, or other data providers may place caps on elements of the data. It seems that a cap on the upper limit of weight is most likely to affect this algorithm. If the user enters a parameter indicating that there is a weight cap in the provided dataset, this step will do three things. If no weight cap is entered, the algorithm will skip to step 5. A) If a subject has all WTs equal to the cap, it will exclude all WTs and all HTs for that subject with code weight-cap-identical. B) If a subject has at least 1 WT equal to the cap, and that WT (or WTs) is not in line with the subject’s other WTs, it will exclude the WT(s) equal to the cap with code weight-cap. C) If a subject has at least 1 WT equal to the cap that IS in line with the subject’s other WT, then there is the concern that the capped WT is inaccurate, and the subject’s true WT is higher. In this case the WT(s) equal to the cap is/are retained. Temp extraneous is redone.

Section Break: Special Cases

In this section, we will exclude WTs and HTs that meet criteria for certain special cases, such as unit errors. This helps us exclude some errors early that are relatively extreme but may be less extreme than we could exclude without the additional evidence provided by the special case data. These steps are most useful for WT but also can help for HT. In general, the special case steps involve comparing the recorded value with the same value transformed in a manner based on which special case is being tested. The recorded and transformed values are then compared with the values surrounding them using EWMA and by calculating the difference between the recorded/transformed values and the prior and next values. Criteria are developed for each type of special case. If the recorded value deviates from the surrounding values in a specific way AND the transformed value fits the surrounding values using fairly strict criteria, the special case is considered met and the recorded value is excluded. The method does not correct special cases, it just excludes them. FirstRV is used for all WT special case steps except for swaps. At the end of all special case steps (except swaps), if the first RV of a group is excluded, all other members of that RV group are excluded with an RV appended to their error code (e.g., EXC-WTR=”RV Hundreds”). At the end of each special case step, temp extraneous is redone.

5H 5W, Special case, Hundreds: HT/WT Hundreds

This step evaluates whether there is evidence that the WT is off by 100 or 200 kg from the expected value (or off by 100, 200, or 300 lbs.), or if the HT is off by 100 cm below the expected value. For example, imagine subject A had WTs of 76kg, 78kg, 177kg, 76kg. Each of those 4 values would be tested for each type of hundred special case. When the 177 kg WT was tested, one transformed value would involve subtracting 100 kg, for a transformed value of 77kg which is in line with the other measurements. This WT of 177kg will therefore meet criteria to be excluded in the Hundreds step.

6H 6W, Special case, Unit errors: HT/WT Unit-errors

This step evaluates whether there is evidence that the WT is recorded in pounds when the dataset indicates it was recorded in kilograms, and or that HT is recorded in inches when the dataset indicates it was recorded in centimeters. For example, if subject B had HTs 170.2, 170.5, 67, 170.5, 170, the HT=67 (67*2.54=170.18) would be recognized as a unit error and excluded in this step. Again, firstRV is used for WT and allRV for HT.

7H 7W, Special case, Transpositions: HT/WT Transpositions

This step evaluates whether there is evidence that 10s and 1s digit of HT or WT are transposed. For example, if subject C had WTs of 95, 94, 59, 96, 95.5, the recorded value would be 59 and the transformed value would be 95. The WT=59 would be excluded as a transposed WT in this step. One of the checks in the transpositions step is that the tens and ones digits have to be at least 3 apart to consider something a potential transposition. So, if subject D had WTs of 75 74 57 76 75.5, the WT=57 would not be excluded as a transposition.

8 (H&W), Special case, Swaps: HT/WT Swaps

This step evaluates whether there is evidence that the WT and HT are swapped with each other. This step only evaluates metric values; in other words, it does not evaluate the case in which there is a combination of unit errors and a swap. For example, if subject E had HTs of 164, 165, 72, 164, 164 and WTs on the same days of 73.1, 71, 165, 72.2, 73.2, the HT=72 and the WT=165 would be recognized and excluded as swaps. In order to compare as many HTs and WTs as possible, allRVs are used for WT. Other members of an RV group are not excluded at the end of a swaps step. Temp extraneous is redone for WT but not for HT, because the final extraneous step is the next step for HT.

Separate HT Algorithm Steps

9H, Same day extraneous: Same-day-extraneous/Same-day-identical

This is the final step in which same day extraneous HTs (multiple HT measurements for the same subject on the same day) are evaluated. The criteria to keep a same day extraneous value are fairly strict. First, if all HTs on the same day are identical, one is set to be included and the rest are included. Then, it is determined if all HTs on the same day for a subject vary by a trivial amount (<1 inch). If so, the recorded values are replaced with the mean of the same day extraneous values for that day. The original values are stored in orig_wt_sde. Then, if more than 25% of days for a subject have same day extraneous values OR if there are adjacent days that have same day extraneous values, all remaining same day extraneous values for that subject are excluded. For remaining subjects with same day extraneous values, if there is 1 (and only

HT that is within 1 cm of the HT before it, within 1 cm of the HT after it, and 1 cm of the median HT, that value is included and the rest of the same day extraneous values on that day are excluded. If no value meets that criteria on that day, all same day extraneous values for that day are excluded.

10Ha, Distinct pairs: Distinct-pairs

In this step, subjects for whom there are only 2 remaining distinct HTs are evaluated. So subject F with 2 different HTs (170, 171) and subject G with 8 HTs but only 2 distinct HTs (162, 162, 162, 163, 162, 162, 163, 162) are both evaluated in this step. In this step either all HTs or no HTs are excluded, because if it is determined that the pair of HTs is implausible it is generally too difficult to tell which of the pair should remain included. If both HTs are within 2 inches (5.08 cm) of each other, both HTs remain included. If the HTs are not within 2 inches of each other, they may stay if they meet loss or gain criteria. Loss criteria is primarily designed to identify people with a vertebral fracture and include a loss of HT that is below a limit based on whether the subject is under or over 50 years old AND whether the HT remains stable after the initial loss (for example 170, 170, 163, 163, 163 meets loss criteria but 170, 170, 163, 163, 170 does not). Gain criteria only apply to people with an initial age <25 years and allow HT gain that is dependent on the starting age and the age interval; gain criteria also require that the HT remains stable after the initial gain. The variables pairhtloss and pairhtgain indicate that values met the loss or gain criteria; these variables also facilitate calculation of meanht at the end of the algorithm.

10Hb, Distinct 3+: Distinct-3-or-more

In this step, subjects for whom there are 3 or more remaining distinct HTs. If all HTs are within a 2-inch range, they all remain included. Otherwise, the algorithm attempts to identify a band of HTs within 2 inches of each other that has at least 1.5x as many HTs as the # of HTs outside the band. If there are multiple bands of HTs that meet these criteria, the algorithm selects the band with the highest HT number score (score = total HTs within band + 0.5 * distinct HTs within band) with ties broken by a measure of the distance between the mean of the HTs within the band and the HTs within the band, with the goal of identifying the band that is most distinct from the HTs outside the band. If no band is identified, all HTs are excluded. If a band is identified, HTs outside the band are excluded. Then, in loss and gain steps, subjects with HTs excluded in this step may be reincluded if they meet specific criteria. The loss and gain criteria are similar to step 10H except that two separate episodes of HT loss are allowed, and up to 5 separate episodes of HT gain are allowed (all starting before age 25y). The variables htlossgroup and htgaingroup indicate that HTs were reallowed because of loss/gain criteria in this step and also indicate which HTs are grouped with which HTs. These variables also facilitate calculation of meanht at the end of the algorithm.

11H, Mean HT

In this step a mean HT for each subject is calculated that can provide a more stable HT, particularly useful for BMI calculations. The mean was purposely chosen over the median in order to ensure to ensure weight is provided for more common measurements. All HTs that will be averaged will be within 2 inches of each other, so the difference between the mean and median, and between the mean and the recorded HT, will generally be small. For most subjects, meanht will simply be the mean of all HTs for that subject. Exceptions are described below. If HTs were kept in step 10H because of allowed loss or gain (pairhtloss is 1 or pairhtgain is 1), then there are only 2 distinct HTs, but they are identified as representing two separate states and should not be averaged, so each distinct HT serves as its own mean for these subjects. Similarly, if HTs were kept in step 11H because of allowed loss or gain (htlossgroup is 1-3 or htgaingroup is 1-6), then only HT values within each loss group or gain group should be averaged together.

Separate WT Algorithm Steps

9HW, Extreme EWMA: Extreme-EWMA

In this step, EWMA is used to remove remaining extreme implausible measurements before choosing final same day extraneous values and excluding more moderate errors. Extreme EWMA is done first with firstRV and then with allRV because doing it with either alone missed important errors in some datasets. The cutoffs for absD-EWMA are >60 kg (so a value must be at least 60 kg from its EWMA or `predicted’ value) for WTs that are within one year of at least one other WT, and 100 kg from WTs that are not within one year of a WT. There are sometimes multiple extreme EWMA values within one subject, so the extreme EWMA steps are run multiple times excluding the most extreme value for each subject with each round until no values are excluded. The firstRV extreme EWMA step is run multiple times first, then the allRV extreme EWMA step is run multiple times. RVs are then redone.

10W, Same day extraneous: Same-day-extraneous/Same-day-identical

This is the final step in which same day extraneous WTs (multiple WT measurements for the same subject on the same day) are evaluated. AllRV are used. The criteria to keep a same day extraneous value are fairly strict. First, if all WTs on the same day are identical, one is set to be included and the rest are included. Then, it is determined if all WTs on the same day for a subject vary by a trivial amount (<1 kg). If so, the recorded values are replaced with the mean of the same day extraneous values for that day. The original values are stored in orig_wt_sde. Then, if more than 25% of days for a subject have same day extraneous values OR if there are adjacent days that have same day extraneous values, all remaining same day extraneous values for that subject are excluded. For remaining subjects with same day extraneous values, the algorithm determines if there is 1 WT with an absD-EWMA<2 and that there is only 1 WT with an absD-EWMA<5. The goal here is to find one WT that `fits’ and confirm that the rest of the WTs do not `fit.’ If those criteria are met, the WT with an absD-EWMA<2 is included and the rest for that day are excluded as same day extraneous. If either of those criteria are not met, all same day extraneous values for that day are excluded. RVs are redone.

11Wa, Distinct ordered pairs: Distinct-ordered-pairs

In this step the algorithm evaluates subjects with 2 remaining distinct WTs that are in order so that all of one WT come before all of another WT. As examples, 45, 45, 48, 48, 48 would be an ordered pair. And 102, 107 would also be an ordered pair. But 81, 81, 74, 81, 74 would not be an ordered pair. The amount that the WTs in an ordered pair are allowed to differ is determined the by the difference in ages between the last day of the distinct WT 1 and the first day of distinct WT 2. The formula is wtallow = 4 + 18 * ln(1 + ageinterval [in months]). If the absolute difference between WT1 and WT2 is below wtallow, all WTs stay included. If the absolute difference is above wtallow, all WTs are excluded. Very low WTs that are not consistent with a subject’s other WT do not necessarily trigger the wtallow criteria; for this reason, a limit on the percent change in WT for subjects with low WTs is also added to this step.

11Wb, Moderate EWMA: Moderate-EWMA

In this step the algorithm evaluates subjects with 3 or more remaining distinct WTs or 2 remaining distinct WTs that are unordered (see distinct ordered pairs above for definition). AllRV are used. In this step, multiple additional checks are made to EWMA to avoid excluding WTs that are likely consistent with a subject who is gaining or losing WT. The primary additional checks that help to evaluate whether a WT is in line with a subject’s surrounding WTs are a) the extrapolation calculation described near the beginning of this document that evaluates whether a WT is within limits set by the surrounding values and an extrapolated value and b) a check called “interpolation” which simply checks if the WT of interest is between the WTs before and after it with some allowed error. If a WT is within the bounds of the extrapolation and/or interpolation checks, it cannot get excluded by meeting standard criteria and must meet special criteria. As in the distinct ordered pairs step, a limit on the percent change in weight for low WT values is added to this step. As in extreme EWMA, the moderate EWMA step is run multiple times, removing the most extreme implausible value in each run until no implausible values are removed.

12W, Weight cap influence: Exclude-possibly-impacted-by-weight-cap

In this step, which only runs if a weight cap was entered, if there are any WTs equal to the weight cap that are still included, all remaining included WTs for the subject will be set to “Exclude – Possibly Impacted by Weight Cap.” This is because a WT equal to the weight cap could mean the WT is the weight cap, or that it is substantially higher. (Please note the following examples only are in pounds.) For example, in a dataset with a weight cap of 400 lbs, in which WTs above 400 are reset to 400, subject M has WTs of 370, 395, 400, 400, 400, 390, 365, 340. If the WTs=400 are included as is, they would give the possibly false impression that M’s WT was stable at 400 during those three middle WTs, when M’s actual WT could have been 370, 395, 405, 420, 435, 390, 365, 340. But if the WTs equal to 400 were just excluded, M’s WT would appear to be 370, 395, 390, 365, 340, which is clearly inaccurate. Therefore, all of subject M’s WTs should be excluded. On the other hand, there are subjects with WTs equal to the WT cap that are out of line with a subject’s other WTs and were excluded earlier, such as if subject N’s WTs were 140, 145, 400, 148, 150, and the 400 were excluded in an earlier step. Therefore, it is okay to exclude the 400 and keep subject N’s other WTs. In order to identify subjects whose WTs were all excluded due to weight cap issues in step 4W, or whose remaining included WTs were all excluded in this step (12W), a variable all-exc-weight-cap is created that is TRUE or FALSE. Subjects with all WTs excluded due to the weight cap likely have severe obesity, or did at one point. The all-exc-weight-cap variable helps analysts identify these subjects in a dataset, even if their WT values cannot be determined precisely due to the weight cap. If desired, analysts can reinclude WTs for these subjects excluded because of weight caps to be reincluded.

Final HT and WT Steps Together

13, Distinct 1: Distinct-single

In this step subjects with only one remaining distinct value for WT or HT, or both, are evaluated. BMI is calculated where possible to assist in providing evidence about whether a WT or HT is plausible or not. If a subject has at least one day on which a BMI can be calculated, and at least one BMI is between 16 and 60 (inclusive), then the following limits apply to parameters with only one distinct value: exclude HT<88, HT>231, WT<30, WT>273. If there is at least one BMI and no BMIs are between 16 and 60, then any HT or WT for which there is only one distinct value is excluded. For example, if the only BMIs are ~117, and the HTs are 68, 68, 68 and the WTs are 54, 53, 52, then the HTs would be excluded, and the WTs remain included. But if the only BMIs are ~117 and the HTs are 68, 68, 68 and the WTs are 54, 54, 54, then all HTs and all WTs are excluded. After these two steps, BMIs are recalculated and subjects with one distinct value are reidentified because the initial steps may have changed these results. Then, for subjects with no remaining BMI the following limits apply to parameters with only one distinct value: exclude HT<139, HT>206, WT<40, WT>225. It is very reasonable to consider adjusting these limits for specific projects, depending on your population(s) of interest.

14H, 14W, Error load: Too-Many-Errors

In this step, subjects with a large proportion of HT or WT values for a parameter excluded are identified, and all HT or WT values for that subject are excluded, because a very high number of errors makes it difficult to make reliable decisions for all values for that subject. The error ratio is calculated as Errors/Total. Errors are everything that is not included, missing, or a same day extraneous or an included value. The total is everything that is not missing or a same day extraneous. If the error ratio is >0.4 all values for that parameter for that subject are excluded.

2026-02-20