Evaluation of regression methods when immunological measurements are constrained by detection limits

Background The statistical analysis of immunological data may be complicated because precise quantitative levels cannot always be determined. Values below a given detection limit may not be observed (nondetects), and data with nondetects are called left-censored. Since nondetects cannot be considered as missing at random, a statistician faced with data containing these nondetects must decide how to combine nondetects with detects. Till now, the common practice is to impute each nondetect with a single value such as a half of the detection limit, and to conduct ordinary regression analysis. The first aim of this paper is to give an overview of methods to analyze, and to provide new methods handling censored data other than an (ordinary) linear regression. The second aim is to compare these methods by simulation studies based on real data. Results We compared six new and existing methods: deletion of nondetects, single substitution, extrapolation by regression on order statistics, multiple imputation using maximum likelihood estimation, tobit regression, and logistic regression. The deletion and extrapolation by regression on order statistics methods gave biased parameter estimates. The single substitution method underestimated variances, and logistic regression suffered loss of power. Based on simulation studies, we found that tobit regression performed well when the proportion of nondetects was less than 30%, and that taken together the multiple imputation method performed best. Conclusion Based on simulation studies, the newly developed multiple imputation method performed consistently well under different scenarios of various proportion of nondetects, sample sizes and even in the presence of heteroscedastic errors.


Background
The number of immunological parameters that can be measured in large scale epidemiological studies has been rapidly increasing. Not all of these quantitative levels can be determined precisely. Reasons for this lack of precision are that the signal produced by the stimulant is too small for the instrumentation to discriminate the signal from the background noise, or a signal is registered, but certain (laboratory) criteria that identify the substance are not met. Values that cannot be quantified are called nondetects (NDs). We assume that all NDs are below a given detection limit (DL), and therefore we are dealing with censored data. Simple solutions such as deletion of NDs and single value substitution are often used, but it is unknown to what extent these methods provide unbiased results and thus would be adequate for the analysis. Applying various approaches yielded different parameter estimates in the environmental studies [1,2].
When the number of NDs is rather small, one approach of dealing with NDs is simply dropping NDs and apply linear regression to the remaining data. A second commonly used approach is to substitute NDs with a certain value smaller than the DL (0, DL/2 or DL) and to use linear regression [3,4]. The validity of these approaches will depend on the number and the unknown range of NDs. A third common practice is to dichotomize the cytokine measurements based on a certain cut-off point (DL or median) and to apply logistic regression to this binary variable [3,5]. A major drawback of this approach is that by dichotomizing much information is lost. Note also that the choice of 0, DL/2 or DL in the single value substitution and the threshold in the logistic regression approach is arbitrary. An important issue is then how to decide which method is optimal for a particular data set. Moreover, more sophisticated statistical methods may be needed for analyzing this type of data. This paper is motivated by a study on the relationship between intensity of parasite infection and cytokines measurements resulting from whole blood assay after stimulation with lipopolysaccharide ( Table 1). One of the cytokine measurements has only a small proportion of NDs (5.5%), whereas the second measurement has a relatively large proportion of NDs (66%). In addition to the presence of censored measurements, the distribution of cytokine measurements is often positively skewed. Skewed distributions in biology often closely fit the lognormal distribution and this characterization can be advantageous in the biological system when many factors act in multiplicative ways [6,7]. Therefore, it can be assumed that the cytokine variables are normally distributed after an appropriate (log-)transformation.
Considering the efforts made by collecting data, it seems worth while to investigate sophisticated and (maybe) time-consuming statistical methods to analyze data appropriately [8]. In this paper we review several com-monly used methods in immunology and more advanced methods used in other fields such as environmetrics and econometrics [1,9,10]. A second goal is to evaluate the performances of these methods via simulation studies [2,11,12]. The validity and precision of simple methods such as deletion and single value substitution will be studied for various scenarios including different proportions on ND's and different error models. In addition the utility of advanced statistical methods will be quantified.

Results of simulation studies
In Table 2, the six methods for analyzing data containing NDs that were considered are summarized: removal of nondetects (DELETION), single substitution of NDs with half of the value of DL (DL/2), extrapolation by regression on order statistics (ROS), multiple imputation using maximum likelihood estimation (MI), tobit regression (TOBIT), and logistic regression (LOGIT). To study the performance of these methods, we simulated data sets of size 200, 400 and 1,000 with proportions of NDs of 10, 30, 50 and 70%.
Regarding RMSE we summarized results in Figure 1. The considered simulation settings were as follows. The plots on the first column represent the scenario with negative effect imitating Cytokine 1, whilst the second to fourth columns show the positive effect of malaria intensity (Cytokine 2). The Columns also show the effect of different thresholds. For the plots of the first column the cut-off points of DL values were relatively large with 14.7, 10, 17, and 29 pg/ml corresponding to 10, 30, 50 and 70% of the whole data sets. In contrast, the second to fourth columns the DL-values were very close to zero, namely 0.7, 1.6, 2.7, and 4.6 pg/ml. The three rows display the results from the three different covariates. For the quantitative covariates generated from three-component mixture (row 1) and two-component mixture (row 2), the simulation results were similar. Therefore, the results from the three-component mixture imitating Cytokine 1 are discussed in details.
In Additional file 1 the results were summarized in terms of bias, root mean square error (RMSE) and coverage probability of 95% confidence interval (CI). Entries in the table are averages of 1,000 replications. In terms of bias, at all levels of NDs and for all sample sizes, the Deletion method produced the least accurate estimates. In contrast, for small data sets containing a small proportion of NDs (size 200 and 10% NDs), the TOBIT model produced nearly unbiased estimates, while the DL/2 method also performed well. With respect to RMSE, the MI method performed best in general. Note that the ROS method produced the smallest RMSE values, although it produced rel-   atively large biases. This indicates that ROS underestimates the variance. For visualization of the results in terms of RMSE, the performance of the two bestperforming methods -TOBIT and MI -were compared for different sample sizes in Figure 2. The advantage of using the TOBIT model with small percentage of NDs seems to disappear with increasing sample sizes.
In Table 3 the averaged variances of parameter estimates were given for the TOBIT, MI, and LOGIT methods. The variances of the TOBIT model increased with increasing proportion of NDs, while the level of variance using the MI method remained stable throughout the different sample sizes and the percentages of NDs. The LOGIT method with the proportion of NDs of 30% and 50% produced smaller variances than with a very small and very large proportions of NDs.
Regarding efficiency of the methods, the right panel of Figure 2 shows the power to detect at the nominal significance level of α = 5% for the TOBIT, MI, and LOGIT methods. The MI method was the best at all proportions of NDs and for all sample sizes. For a small proportion of NDs, the performance of the TOBIT and MI methods was equivalent. Overall, the LOGIT method performed worst.
Additionally we compared the performance of TOBIT, DL/2 and MI approaches under heteroscedastic errors.
The results are depicted in Figure 3. The RMSE of the TOBIT and DL/2 methods increased rapidly with increasing proportion of NDs. The MI approach appeared to be most robust and RMSE was below 0.1 for proportions of ND under 30%. In the last row of Figure 1, the results for the third (microscopic) category with reference to negative category are given. In contrast to quantitative covariates, there were dissimilarities in the behavior. RMSE did not increase as rapidly as with quantitative covariates, when the proportion of NDs becomes large. However, the actual RMSE values were much higher. Although the order of the best methods did not vary much, the TOBIT model gave bad performance when the sample size was small (n = 200).

Application to immunological data
Different choices yielded different results as illustrated for our motivating data as can be seen in Table 4. To determine the optimal method for this particular data set, simulation results are used as a reference. For the two cytokine responses, the residuals of simple linear regression on intensity of parasite infection using DL/2 imputation for NDs are given in Figure 4.

Discussion
In the field of immunology there is great need for specialized methods for analysis of data in order to improve accuracy and power. In this paper we proposed advanced methods to deal with data sets when DL plays a significant role. Via simulation studies we first evaluated performances of several methods. Because NDs are not missing at random, biases can be expected when simply dropping NDs. Even with proportion of NDs of 10%, the bias was unacceptable. For parameter estimation substituting DL/2 in NDs was reasonable, but the variance was underestimated. Furthermore, as illustrated by our data set the choice of the imputed value (0, DL/2, DL) remains an issue. For large proportion of NDs, ROS appeared to yield large biases. Analogous to the DL/2 method, the variance of parameter estimates was underestimated. The TOBIT method appeared to be an elegant method to deal with a small proportion of NDs under the constant error assumption. If possible, the normality assumption should be checked before considering the TOBIT model ( Figure  4) [13]. For larger proportions of NDs (larger than 10%), MI outperforms the other methods in terms of RMSE. Since imputations are multiple, the MI method takes into account the uncertainty about the true values of the NDs. Furthermore, it is rather robust against heteroscedasticity of errors. Figure 2 showed for large sample size the MI method produces more accurate estimates than the TOBIT method. Note that the MI method might be improved by using more sophisticated methods to compute the mean and the standard deviation of a truncated normal distribution [14]. However, diminishing variances by increasing proportion of NDs require the careful use of the MI method when proportion of NDs is greater than 50%.
We also compared results from the different scenarios: (1) whether there is positive relationship between dependent and independent variables, (2) when the characteristics of covariates were changed (three-and two component mixture, or categorized), and (3)whether the closeness of the detection limit to zero will influence the results (Figure 1). In general, the type of included covariates in the model did not influence the findings. Therefore, our findings in this paper can be used as a reference. Nevertheless, careful consideration should be given to what are the appropriate methods for analyzing each specific data.
The limitation of our simulation study lies in skewed error distributions. However, we studied a simple solution of dichotomizing a continuous variable. Although this is an inefficient approach, and determination of cut-off points remains arbitrary [5], for some situations creating a binary outcome variable could be the most sensible option when ββββ measurements can easily be categorized. The method can also be extended to more than two categories by using ordered logistic regression (or proportional odds model).
Note that to reflect the natural ordering of the categories, ordered logistic regression should be preferred to multinomial logistic regression [15,16]. Additional advantage of using ordered logistic regression is that the results can be presented in one parameter. In contrast, in using multinomial logistic regression as in our simulation study, the first (or most common) level will be considered as reference category (negative level), and the inference of remaining two categories compared to the reference will be given. Although making more categories than two might improve performance, the determination of categories remains arbitrary.
When data are very skewed and normality cannot be achieved by the usual transformation, quantile regression could be considered [17]. This is an econometric regression model, in which a specified conditional quantile (or percentile) of the outcome variable is expressed as a linear function of covariates.
Simply the lines split the population into two parts with the proportion of 70, 80 or 90% lying below the line, and the proportion 30, 20, or 10% above the line, respectively. Similar to logistic regression the choice of the quantile is arbitrary. However, it assumes no underlying distribution, and is reported to be robust against heteroscedastic errors. Good performance of quantile (or median) regression method have been reported elsewhere [18]. However, when the proportion of NDs is greater than 50%, median regression is not suitable. Also with normally distributed data (after appropriate transformation), the improvement using median regression would be little. The computation of quantile regression is possible using R, SAS, and Stata.
In this paper we considered a single variable restricted with NDs. Extending to multiple regression, such as multiple cytokine measurements of the same individuals and/ or related cytokine levels within some set, it is very probable that we encounter NDs in more than one covariate and with different DLs. It can be expected that the large number of correlated cytokine variables would enhance the advantage of using the multiple imputation techniques [19]. In fact, using information on other correlated variables such as families would improve the performance of MI. It is not the purpose of this paper to stress that the MI method should be used everywhere in the presence of DL. Nevertheless, we showed that the search for new methods might gain deeper understanding of data, and that simulation studies can contribute to decide the optimal methods for measurement data with NDs.

Conclusion
We showed that a dichotomization of continuous variable generally causes loss of information, hence loss of power. We compared the several linear regression methods to deal with the data containing NDs based on simulation studies. The TOBIT method produced the most accurate estimates with the least bias. When the amount of NDs is relatively small (≤ 30%) and the normality assumption is met as Cytokine 1 in our example data, the use of the TOBIT method is recommended. However, as reported elsewhere [20,21], the TOBIT model is sensitive to the violation of normality assumption. Therefore, when heteroscedastic errors are suspected, and/or the amount of NDs is large, robust statistical methods have to be considered. We proposed to employ multiple imputation technique. The MI method performed consistently well under different scenarios of various proportion of NDs (≤ 50%), sample sizes and even in the presence of heteroscedastic errors.

Methods to compare
The following linear regression model is considered for the outcome y i of subject i with a covariate x i where ε i is random noise and i = 1, ..., n. The error ε is assumed to be uncorrelated with x and to have a mean equal to zero and a constant variance. The parameters α and β denote the intercept and the average change in y with x. By Ordinary Least Squares (OLS) the estimated slope and intercept of the regression line can be computed. However, in immunological data the ys in equation (1) are only partly observed. A lower threshold or detection limit, DL, interferes with measurements of low levels as follows: Since NDs of cytokine measurements reflect levels of exposure, they cannot be considered as missing at random (MAR) [22]. Therefore, deleting the lowest values is expected to produce biased results. Other types of methods to analyze these data are imputation and modelling of NDs. An overview of the available methods is given in Table 1. In environmental statistics a method called robust regression on order statistics (ROS) approach exists [1,9]. This method is often used to compute summary statistics.
To reflect uncertainty about imputation, we propose to employ multiple imputation approach as introduced by Little and Rubin [22,23]. Based on a truncated normal distribution, we first compute the mean and the standard deviation. This can be done using the functions cenmle or ros from the R-package NADA [24]. Then, the values for NDs were generated randomly and m complete data sets are created and each data set is analyzed separately. Rubin Finally, two non-imputation methods for incorporating NDs into regression models are investigated. Without adding uncertainty on the distribution of the NDs, the outcomes can be dichotomized and logistic regression can be applied. However, the relationship between the covariate and the outcome is now on a logit scale instead of a linear one. A more sophisticated approach is to use maximum likelihood estimation (MLE) method for left-censored data, called TOBIT model after the economist James Tobin [26]. The model is written as a combination of The probit part determines whether the outcome variable is below-DL, and the OLS part is a truncated regression model. The TOBIT model estimates a regression model for the data above DL, and assumes that the censored data (below DL) have the same distribution of errors as the observed data. The weakness of this method is that it may be more vulnerable to violation of the assumptions about the error distribution. Many comments can be found in the literature that in the presence of heteroscedasticity the Tobit estimates are inconsistent, and that there is only limited information about the direction of the bias [20,21].

Simulation study
We simulated data sets by drawing samples from a population similar to the example data in the Background section, and by allocating a proportion of observations as NDs.
For the covariate x (infection intensity) we used (1)  Then, based on the characteristic of Cytokine 1, outcome variables were generated using the following regression model, for individual i ∈ {1, ..., n}. Based on Cytokine 2, we generated outcome variables as And, ε were assumed to be standard normally distributed.
Based on biology, the malaria parasite measurements lend to be categorized in three classes: negative, submicroscopic, and microscopic. Instead of looking at the effect of malaria with continuous measurements, we considered the categorical malaria variable, say z. The dummy code z i = (z i1 z i2 z i3 )® denotes a vector of malaria category indicators for the ith subject, with elements z ij = 1 if ith subject has jth category; otherwise z ij = 0. The categorical covariate vector z were then generated following the multinomial distribution of categorized malaria status with proportions of 0.69, 0.14, and 0.17. Based on Cytokine 1, y were generated following the model: while based on Cytokine 2 y i = 0.84 + 0.13z i2 + 0.77z i3 + ε.
Here ε were assumed to be standard normally distributed.
For studying the effect of heteroscedastic errors we used the same model as in (3) but now with a variance depending on the value of x by using ε ~ N(0, ).

Evaluation of methods
In general, accuracy of estimate can be evaluated by bias, which represents the closeness to the true values, and precision measures the ability to repeat a previous estimates (regardless of accuracy). The combination of both accuracy and precision of estimate can be investigated by the root mean square error (RMSE) as follows: Therefore, parameter estimates provided by the various methods were compared in terms of mean bias and RMSE. Also coverage probability was provided, which is the probability that the confidence interval of the estimates contains the value. Additionally, for the unbiased methods performances were also compared for their hypothesis testing abilities in terms of power. The Wald-type statistic /SE( ) was used for testing. It is approximately distributed as a t-distribution with n -2 degrees of freedom for n observations in each sample for continuous outcome.
All computations have been done using the program language R [27].