Comparison of Logistic Regression and Linear Regression in Modeling Percentage Data.
Percentage is widely used to describe different results in food microbiology, e.g., probability of microbial growth, percent inactivated, and percent of positive samples. Four sets of percentage data, percent-growth-positive, germination extent, probability for one cell to grow, and maximum fraction of positive tubes, were obtained from our own experiments and the literature. These data were modeled using linear and logistic regression. Five methods were used to compare the goodness of fit of the two models: percentage of predictions closer to observations, range of the differences (predicted value minus observed value), deviation of the model, linear regression between the observed and predicted values, and bias and accuracy factors. Logistic regression was a better predictor of at least 78% of the observations in all four data sets. In all cases, the deviation of logistic models was much smaller. The linear correlation between observations and logistic predictions was always stronger. Validation (accomplished using part of one data set) also demonstrated that the logistic model was more accurate in predicting new data points. Bias and accuracy factors were found to be less informative when evaluating models developed for percentage data, since neither of these indices can compare predictions at zero. Model simplification for the logistic model was demonstrated with one data set. The simplified model was as powerful in making predictions as the full linear model, and it also gave clearer insight in determining the key experimental factors. Microbial data expressed as percentages have been modeled for many years. Percentage data may have very different biological meanings and expressions. In 1971, Genigeorgis et al. initiated the concept of probability for one cell to grow and produce toxin, presented as the ratio of R G over R I , where R G is the number of cells initiating growth, and R I is the number of cells in the inoculum (14). In a time-to-turbidity model, Whiting and Oriente (32) described the maximum probability of growth with the parameter P max , this value being obtained from fitting the growth curve with the logistic equation. Chea et al. modeled the extent of spore germination using the plateau value of the germination curve (6). The percent-growth-positive parameter describes the maximum proportion of wells that exhibited growth under various environmental conditions in a study using microplates inoculated with Clostridium botulinum spores (33). A conventional approach applied to modeling percentage data is to use linear regression with polynomial terms. This method usually results in moderate (R 2 2 1 (6, 26, 31, 33). Generally, all predicted negative values are forced to 0, and those >1 are forced to 1. Even without this modification, it is not meaningful to compare these conditions. For example, 120% cannot be interpreted as a higher percent germination than 101%. Logistic regression has been widely used in medical research (1, 5, 18, 19, 22, 30). In the field of predictive food microbiology, logistic models have been developed to describe the bacterial growth/no growth interface (4, 21, 24, 25). In these models, the data were presented in the 0-1 format, as in a typical binomial data set. Genigeorgis et al. first presented the concept of the probability that one cell could grow in a specific environment (14). Later, this probability was modeled in various systems using logistic regression combined with a linear regression of the lag period (3, 11, 12, 15, 16, 20). Roberts et al. used a similar concept and the regression approach to model toxin production by C. botulinum in pasteurized pork slurry (27). Cole et al. modeled the probability of growth of spoilage yeast in a model fruit drink by directly relating the logit of probability with the environmental factors (7). In these studies, probability (a continuous number between 0 and 1) instead of a dichotomous variable (i.e., 0, 1) was modeled. As pointed out by Ratkowsky and Ross (25), the response modeled by logistic regression at a given combination of limiting factors can either have a value of 0 or 1 or be a probability. Probability, generally expressed by dividing the number of successes by the total number of trials, is simply a summarization of binomial data and thus can be approximated by a logistic general linear model (8). In this study, we compared the goodness of fit of linear regression to logistic regression for modeling percentages. We modeled data from our own research and from the literature (including publications from our group) and developed models using both the logistic and linear approaches in exactly the same manner. Five different approaches were used to compare the goodness of fit of the two models. In almost all cases, the logistic models displayed greater accuracy and resulted in less biased predictions. MATERIALS AND METHODS. Data collection. Four different sets of percentage data were collected from previous experiments (6, 26, 32, 33). Each set had its own unique biological meaning and was collected with a different method. Weight is the degree of emphasis a model puts on an observation. The weight for a percentage datum point is the total number of observations associated with this percentage (2). For example, when 10 of 40 tubes turn turbid, the percentage is 25% (10/40) and the weight for this percentage is 40. The assignment of weights was determined differently for each data set, as described below. Data set I: data for percent-growth-positive were collected by Zhao et al. (33). This data set contained the exact numbers of wells that showed growth and no growth. The total number of wells in each condition is the same, so the weight assigned for each condition is the same. Environmental factors studied were pH, sodium chloride concentration, and inoculum size in a complete 3 by 3 factorial design with a total of 27 different conditions. Data set II: extent of germination data were collected by Chea et al. (6). The total number of spores studied for each condition was between 200 and 300. The small difference in the total number in each condition is negligible, and equal weight for all the data points was assumed in logistic regression. Environmental factors studied were pH, sodium chloride concentration, and temperature in a complete 3 by 3 factorial design with a total of 27 different conditions. Data set III: Razavilar and Genigeorgis studied the probability of one cell of Listeria monocytogenes to grow, as affected by sodium chloride concentration, time, and temperature (26). Weights were not obtainable, so this parameter was assumed to be the same in each case. Data set IV: P max was the parameter used to indicate the maximum fraction of positive tubes inoculated with C. botulinum (32). It was obtained by fitting the experimental data with a logistic equation. The total number of tubes varied by condition and was used as the weight in logistic regression. Four environmental factors, pH, sodium chloride concentration, temperature, and inoculum size, were studied in a total of 103 different conditions. A subset, containing 22 data points at 19°C, was not used to develop models; instead, these data points were used later to validate the models developed from the remaining 81 points. Modeling with linear and logistic regression. Both linear and logistic models were developed in S-plus (MathSoft, Inc., Seattle, Wash.) for an objective comparison. The generalized linear modeling (“glm”) function was used for both methods. The link function for logistic regression is “binomial” and for linear regression is “gaussian.” The full models generated by each approach, with the same number of terms in the same format, were used to ensure the validity of the comparison. The linear model with three predictor variables has the following general format:
Комментариев нет:
Отправить комментарий