среда, 25 сентября 2019 г.

What - s the Best R-Squared for Logistic Regression, Statistical Horizons

Logistic regression in r interpretation. One of the most frequent questions I get about logistic regression is “How can I tell if my model fits the data?” There are two general approaches to answering this question. One is to get a measure of how well you can predict the dependent variable based on the independent variables. The other is to test whether the model needs to be more complex, specifically, whether it needs additional nonlinearities and interactions to satisfactorily represent the data. In a later post, I’ll discuss the second approach to model fit, and I’ll explain why I don’t like the Hosmer-Lemeshow goodness-of-fit test. In this post, I’m going to focus on R 2 measures of predictive power. Along the way, I’m going to retract one of my long-standing recommendations regarding these measures. Unfortunately, there are many different ways to calculate an R 2 for logistic regression, and no consensus on which one is best. Mittlbock and Schemper (1996) reviewed 12 different measures; Menard (2000) considered several others. The two methods that are most often reported in statistical software appear to be one proposed by McFadden (1974) and another that is usually attributed to Cox and Snell (1989) along with its “corrected” version (see below). However, the Cox-Snell R 2 (both corrected and uncorrected) was actually discussed earlier by Maddala (1983) and by Cragg and Uhler (1970). Among the statistical packages that I’m familiar with, SAS and Statistica report the Cox-Snell measures. JMP and SYSTAT report both McFadden and Cox-Snell. SPSS reports the Cox-Snell measures for binary logistic regression but McFadden’s measure for multinomial and ordered logit. For years, I’ve been recommending the Cox and Snell R 2 over the McFadden R 2 , but I’ve recently concluded that that was a mistake. I now believe that McFadden’s R 2 is a better choice. However, I’ve also learned about another R 2 that has good properties, a lot of intuitive appeal, and is easily calculated. At the moment, I like it better than the McFadden R 2 . But I’m not going to make a definite recommendation until I get more experience with it. Here are the details. Logistic regression is, of course, estimated by maximizing the likelihood function. Let L 0 be the value of the likelihood function for a model with no predictors, and let L M be the likelihood for the model being estimated. McFadden’s R 2 is defined as. where ln(.) is the natural logarithm. The rationale for this formula is that ln( L 0 ) plays a role analogous to the residual sum of squares in linear regression. Consequently, this formula corresponds to a proportional reduction in “error variance”. It’s sometimes referred to as a “pseudo” R 2 . The Cox and Snell R 2 is. where n is the sample size. The rationale for this formula is that, for normal-theory linear regression, it’s an identity. In other words, the usual R 2 for linear regression depends on the likelihoods for the models with and without predictors by precisely this formula. It’s appropriate, then, to describe this as a “generalized” R 2 rather than a pseudo R 2 . By contrast, the McFadden R 2 does not have the OLS R 2 as a special case. I’ve always found this property of the Cox-Snell R 2 to be very attractive, especially because the formula can be naturally extended to other kinds of regression estimated by maximum likelihood, like negative binomial regression for count data or Weibull regression for survival data. It’s well known, however, that the big problem with the Cox-Snell R 2 is that it has an upper bound that is less than 1.0. Specifically, the upper bound is 1 – L 0 2/ n . This can be a lot less than 1.0, and it depends only on p , the marginal proportion of cases with events: upper bound = 1 – [ p p (1- p ) (1- p ) ] 2. This has a maximum of .75 when p =.5. By contrast, when p =.9 (or .1), the upper bound is only .48. For those who want an R 2 that behaves like a linear-model R 2 , this is deeply unsettling. There is a simple correction, and that is to divide R 2 C&S by its upper bound, which produces the R 2 attributed to Nagelkerke (1991) . But this correction is purely ad hoc, and it greatly reduces the theoretical appeal of the original R 2 C&S . I also think that the values it typically produces are misleadingly high, especially compared with what you get from a linear probability model. (Some might view this as a feature, however). So, with some reluctance, I’ve decided to cross over to the McFadden camp. As Menard (2000) argued, it satisfies almost all of Kvalseth’s (1985) eight criteria for a good R 2 . When the marginal proportion is around .5, the McFadden R 2 tends to be a little smaller than the uncorrected Cox-Snell R 2 . When the marginal proportion is nearer to 0 or 1, the McFadden R 2 tends to be larger. But there’s another R 2 , recently proposed by Tjur (2009), that I’m inclined to prefer over McFadden’s. It has a lot of intuitive appeal, its upper bound is 1.0, and it’s closely related to R 2 definitions for linear models. It’s also easy to calculate. The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it! The motivation should be clear. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Tjur also showed that his R 2 (which he called the coefficient of discrimination) is equal to the arithmetic mean of two R 2 formulas based on squared residuals, and equal to the geometric mean of two other R 2 ’s based on squared residuals. Here’s an example of how to calculate Tjur’s statistic in Stata. I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The dependent variable inlf is coded 1 if a woman was in the labor force, otherwise 0. A logistic regression model was fit with six predictors. logistic inlf kidslt6 age educ huswage city exper. predict yhat if e(sample) ttest yhat, by(inlf) The predict command produces fitted values and stores them in a new variable called yhat . (The if e(sample) code prevents predicted values from being calculated for cases that may be excluded from the regression model). The ttest command is the easiest way to get the difference in the means of the predicted values for the two groups (but you can ignore the p -values). The mean predicted value for those in the labor force was .680, while the mean predicted value for those not in the labor force was .422. The difference of .258 is the Tjur R 2 . By comparison, the Cox-Snell R 2 is .248 and the McFadden R 2 is .208. The corrected Cox-Snell is .332. Here’s the equivalent SAS code: proc logistic data=my.mroz; model inlf(desc) = kidslt6 age educ huswage city exper; output out=a pred=yhat; proc ttest data=a; class inlf; var yhat; run; One possible objection to the Tjur R 2 is that, unlike Cox-Snell and McFadden, it’s not based on the quantity being maximized, namely, the likelihood function.* As a result, it’s possible that adding a variable to the model could reduce the Tjur R 2 . But Kvalseth (1985) argued that it’s actually preferable that R 2 not be based on a particular estimation method. In that way, it can legitimately be used to compare predictive power for models that generate their predictions using very different methods. For example, one might want to compare predictions based on logistic regression with those based on a classification tree method. Another potential complaint is that the Tjur R 2 cannot be easily generalized to ordinal or nominal logistic regression. For McFadden and Cox-Snell, the generalization is straightforward. If you want to learn more about logistic regression, check out my book Logistic Regression Using SAS: Theory and Application , Second Edition (2012), or try my seminars on Logistic Regression Using SAS or Logistic Regression Using Stata. * Conjecture: I suspect that the Tjur R 2 is maximized when logistic regression coefficients are estimated by the linear discriminant function method. I encourage any interested readers to try to prove (or disprove) that. (For background on the relationship between discriminant analysis and logistic regression, see Press and Wilson (1984)). References : Cragg, J.G. and R.S. Uhler (1970) “The demand for automobiles.” The Canadian Journal of Economics 3: 386-406. Cox, D.R. and E.J. Snell (1989) Analysis of Binary Data . Second Edition. Chapman & Hall. Kvalseth, T.O. (1985) “Cautionary note about R 2 .” The American Statistician : 39: 279-285. McFadden, D. (1974) “Conditional logit analysis of qualitative choice behavior.” Pp. 105-142 in P. Zarembka (ed.), Frontiers in Econometrics . Academic Press. Nagelkerke, N.J.D. (1991) “A note on a general definition of the coefficient of determination.” Biometrika 78: 691-692. Maddala, G.S. (1983) Limited Dependent and Qualitative Variables in Econometrics . Cambridge University Press. Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis.” The American Statistician 54: 17-24. Mittlbock, M. and M. Schemper (1996) “Explained variation in logistic regression.” Statistics in Medicine 15: 1987-1997. Mroz, T.A. (1987) “The sensitiviy of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55: 765-799. Press, S.J. and S. Wilson (1978) “Choosing between logistic regression and discriminant analysis.” Journal of the American Statistical Association 73: 699-705. Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” The American Statistician 63: 366-372.

Комментариев нет:

Отправить комментарий