Logistic regression coefficient interpretation. I am currently reading a paper concerning voting location and voting preference in the 2000 and 2004 election. In it, there is a chart which displays logistic regression coefficients. From courses years back and a little reading up, I understand logistic regression to be a way of describing the relationship between multiple independent variables and a binary response variable. What I'm confused about is, given the table below, because the South has a logistic regression coefficient of .903, does that mean that 90.3% of Southerners vote republican? Because of the logistical nature of the metric, that this direct correlation does not exist. Instead, I assume that you can only say that the south, with .903, votes Republican more than the Mountains/plains, with the regression of .506. Given the latter to be the case, how do I know what is significant and what is not and is it possible to extrapolate a percentage of republican votes given this logistic regression coefficient. As a side note, please edit my post if anything is stated incorrectly. That the author has forced someone as thoughtful as you to have ask a question like this is compelling illustration of why the practice -- still way too common -- of confining reporting of regression model results to a table like this is so unacceptable. You can, as pointed out, try to transform the logit coefficient into some meaningful indication of the effect being estimated for the predictor in question but that's cumbersome and doesn't convey information about the precision of the prediction, which is usually pretty important in a logistic regression model (on voting in particular). Also, the use of multiple asterisks to report "levels" of significance reinforces the misconception that p-values are some meaningful index of effect size ("wow--that one has 3 asterisks!!"); for crying out loud, w/ N's of 10,000 to 20,000, completely trivial differences will be "significant" at p. The idea here is that in logistic regression, we predict not the actual probability that, say, a southerner votes Republican, but a transformed version of it, the "log odds". Instead of the probability $p$, we deal with $\log p/(1-p)$ and find linear regression coefficients for the log odds. So for example, let's assume that an urban Northeasterner has probability 0.3 of voting for a Republican. (This would of course be part of the regression; I don't see it reported in this table, although I assume it's in the original paper.) Now, $x = 1/(1+e^ )$ gives $z = \log $; that is, $f^ (x) = \log $, the "log odds" corresponding to $x$. These "log odds" are what behaves linearly; the log odds corresponding to $0.3$ are $\log 0.3/0.7 \approx -0.85$. So the log odds for an urban Southerner voting Republican are this (what Wikipedia calls the intercept, $\beta_0$) plus the logistic regression coefficient for the South, $0.903$ -- that is, $-0.85 + 0.904 = 0.05$. But you want an actual probability, so we need to invert the function $p \to \log p/(1-p)$. That gives $f(0.05) \approx 1/(1+e^ ) \approx 0.51$. The actual odds have gone from $0.43$ to $1$, to $1.05$ to $1$; the ratio $1.05/0.43$ is $e^ $, the exponential of the logistic regression coefficient. Furthermore, the effects for, say, region of the country and urban/suburban/rural don't interact. So the log odds of a rural Midwesterner voting Republican, say, are $-0.85 + 0.37 + 0.68 = +0.20$ according to this model; the probability is $f(0.20) = 1/(1+e^ ) = 0.55$. The coefficients in the logistic regression represent the tendency for a given region/demographic to vote Republican, compared to a reference category. A positive coefficent means that region is more likely to vote Republican, and vice-versa for a negative coefficient; a larger absolute value means a stronger tendency than a smaller value. The reference categories are "Northeast" and "urban voter", so all the coefficients represent contrasts with this particular voter type. In general, there's also no restriction on the coefficients in a logistic regression to be in [0, 1], even in absolute value. Notice that the Wikipedia article itself has an example of a logistic regression with coefficients of -5 and 2. Let me just stress the importance of what rolando2 and dmk38 both noted: significance is commonly misread, and there is a high risk of that happening with that tabular presentation of results. Paul Schrodt recently offered a nice description of the issue: Researchers find it nearly impossible to adhere to the correct interpretation of the significance test. The p-value tells you only the likelihood that you would get a result under the [usually] completely unrealistic conditions of the null hypothesis. Which is not what you want to know—you usually want to know the magnitude of the effect of an independent variable, given the data. That’s a Bayesian question, not a frequentist question. Instead we see—constantly—the p-value interpreted as if it gave the strength of association: this is the ubiquitous Mystical Cult of the Stars and P-Values which permeates our journals.(fn) This is not what the p-value says, nor will it ever. In my experience, this mistake is almost impossible to avoid: even very careful analysts who are fully aware of the problem will often switch modes when verbally discussing their results, even if they’ve avoided the problem in a written exposition. And let’s not even speculate on the thousands of hours and gallons of ink we’ve expended correcting this in graduate papers. (fn) The footnote also informs on another issue, mentioned by dmk38: “[the ubiquitous Mystical Cult of the Stars and P-Values] supplanted the earlier—and equally pervasive—Cult of the Highest R2, demolished… by King (1986).”
Комментариев нет:
Отправить комментарий