пятница, 16 августа 2019 г.

Logistic Regression Coefficients

Logistic regression interpretation of coefficients. I was recently asked to interpret coefficient estimates from a logistic regression model. It turns out, I'd forgotten how to. I knew the log odds were involved, but I couldn't find the words to explain it. Part of that has to do with my recent focus on prediction accuracy rather than inference. Still, it's an important concept to understand and this is a good opportunity to refamiliarize myself with it. Logistic regression models are used when the outcome of interest is binary. (There are ways to handle multi-class classification, too.) The predicted values, which are between zero and one, can be interpreted as probabilities for being in the positive class—the one labeled 1 . Logistic Function to Logit. To model the probability when \(y\) is binary—that is, \(p(X) = p(y=1 \mid X)\)—we use the logistic function defined as: where \(t\) is some function of the covariates, \(X\). Let's define \(t\) using matrix notation such that \(t = X\beta\), where \(\beta\) is actually \(\vec \). This can be rewritten as: This is known as the odds . Finally, we can take the log of both sides to get: The left-hand side is known as the log-odds or logit . Before we consider the coefficient estimates, let's take a moment to discuss odds. The odds of an event is the probability of that event divided by its complement: For an event with probability 0.75, the odds are: This means that the event is three times as likely to occur than not. As another example, consider an event with a 50% chance of happening. In this case, the odds are one to one—there is an equal chance of either event happening, which makes sense given the probability. Coefficients. Let's look at an example using Python. For this, we'll load the ccard data set from Statsmodels. (Note: all of the code for this example can be found here.) In this example, we'll use age and income to predict home ownership. The income variable, INCOME , is in 10,000s of dollars. (Note: we also add an intercept term.) Let's fit the model and view the summary output. The estimated coefficients are the log odds. By exponentiating these values, we can calculate the odds, which are easier to interpret. The odds for both age and income are above one, meaning that they are positively associated with home ownership in this small data set. Let's focus on income. We can interpret this as follows. For a $10,000 increase in income—recall that this corresponds to one unit—we expect the odds of home ownership to increase by almost two times (90%), holding everything else constant. Final Thoughts. Interpreting logistic regression coefficients amounts to calculating the odds, which corresponds to the likelihood that event will occur, relative to it not occurring. Special thanks to UCLA's Institute for Digital Research and Education for the excellent post on this topic.

Комментариев нет:

Отправить комментарий