пятница, 16 августа 2019 г.

Logistic Regression with Stata - Chapter 1 Introduction to Logistic Regression with Stata

Logistic Regression with Stata Chapter 1: Introduction to Logistic Regression with Stata. We will begin our discussion of binomial logistic regression by comparing it to regular ordinary least squares (OLS) regression. Perhaps the most obvious difference between the two is that in OLS regression the dependent variable is continuous and in binomial logistic regression, it is binary and coded as 0 and 1. Because the dependent variable is binary, different assumptions are made in logistic regression than are made in OLS regression, and we will discuss these assumptions later. Logistic regression is similar to OLS regression in that it is used to determine which predictor variables are statistically significant, diagnostics are used to check that the assumptions are valid, a test-statistic is calculated that indicates if the overall model is statistically significant, and a coefficient and standard error for each of the predictor variables is calculated. To illustrate the difference between OLS and logistic regression, let’s see what happens when data with a binary outcome variable is analyzed using OLS regression. For the examples in this chapter, we will use a set of data collected by the state of California from 1200 high schools measuring academic achievement. Our dependent variable is called hiqual . This variable was created from a continuous variable ( api00 ) using a cut-off point of 745. Hence, values of 744 and below were coded as 0 (with a label of "not_high_qual") and values of 745 and above were coded as 1 (with a label of "high_qual"). Our predictor variable will be a continuous variable called avg_ed , which is a continuous measure of the average education (ranging from 1 to 5) of the parents of the students in the participating high schools. After running the regression, we will obtain the fitted values and then graph them against observed variables. NOTE: You will notice that although there are 1200 observations in the data set, only 1158 of them are used in the analysis below. Cases with missing values on any variable used in the analysis have been dropped (listwise deletion). We will discuss this issue further later on in the chapter. In the graph above, we have plotted the predicted values (called "fitted values" in the legend, the blue line) along with the observed data values (the red dots). Upon inspecting the graph, you will notice that some things that do not make sense. First, there are predicted values that are less than zero and others that are greater than +1. Such values are not possible with our outcome variable. Also, the line does a poor job of "fitting" or "describing" the data points. Now let’s try running the same analysis with a logistic regression. As before, we have calculated the predicted probabilities and have graphed them against the observed values. With the logistic regression, we get predicted probabilities that make sense: no predicted probabilities is less than zero or greater than one. Also, the logistic regression curve does a much better job of "fitting" or "describing" the data points. Terminology. Now that we have seen an example of a logistic regression analysis, let’s spend a little time discussing the vocabulary involved. So let’s begin by defining the various terms that are frequently encountered, discuss how these terms are related to one another and how they are used to explain the results of the logistic regression. Probability is defined as the quantitative expression of the chance that an event will occur. More formally, it is the number of times the event "occurs" divided by the number of times the event "could occur". For a simple example, let’s consider tossing a coin. On average, you get heads once out of every two tosses. Hence, the probability of getting heads is 1/2 or .5. Next let’s consider the odds . In common parlance, probability and odds are used interchangeably. However, in statistics, probability and odds are not the same. The odds of an event happening is defined as the probability that the event occurs divided by the probability that the event does not occur. To continue with our coin-tossing example, the probability of getting heads is .5 and the probability of not getting heads (i.e., getting tails) is also .5. Hence, the odds are .5/.5 = 1. Note that the probability of an event happening and its compliment, the probability of the event not happening, must sum to 1. Now let’s pretend that we alter the coin so that the probability of getting heads is .6. The probability of not getting heads is then .4. The odds of getting heads is .6/.4 = 1.5. If we had altered the coin so that the probability of getting heads was .8, then the odds of getting heads would have been .8/.2 = 4. As you can see, when the odds equal one, the probability of the event happening is equal to the probability of the event not happening. When the odds are greater than one, the probability of the event happening is higher than the probability of the event not happening, and when the odds are less than one, the probability of the event happening is less than the probability of the event not happening. Also note that odds can be converted back into a probability: probability = odds / (1+odds). Now let’s consider an odds ratio . As the name suggests, it is the ratio of two odds. Let’s say we have males and females who want to join a team. Let’s say that 75% of the women and 60% of men make the team. So the odds for women are .75/.25 = 3, and for men the odds are .6/.4 = 1.5. The odds ratio would be 3/1.5 = 2, meaning that the odds are 2 to 1 that a woman will make the team compared to men. Another term that needs some explaining is log odds , also known as logit. Log odds are the natural logarithm of the odds. The coefficients in the output of the logistic regression are given in units of log odds. Therefore, the coefficients indicate the amount of change expected in the log odds when there is a one unit change in the predictor variable with all of the other variables in the model held constant. In a while we will explain why the coefficients are given in log odds. Please be aware that any time a logarithm is discussed in this chapter, we mean the natural log. probability : the number of times the event occurs divided by the number of times the event could occur (possible values range from 0 to 1) odds : the probability that an event will occur divided by the probability that the event will not occur: probability(success) / probability(failure) odds ratio : the ratio of the odds of success for one group divided by the odds of success for the other group: ( probability(success)A/probability(failure)A ) / ( probability(success)B/probability(failure)B ) log odds : the natural log of the odds. The orcalc command (as in o dds r atio calc ulation) can be used to obtain odds ratios. You will have to download the command by typing search orcalc . (see How can I use the search command to search for programs and get additional help? for more information about using search ). To use this command, simply provide the two probabilities to be used (the probability of success for group 1 is given first, then the probability of success for group 2). For example, At this point we need to pause for a brief discussion regarding the coding of data. Logistic regression not only assumes that the dependent variable is dichotomous, it also assumes that it is binary; in other words, coded as 0 and +1. These codes must be numeric (i.e., not string), and it is customary for 0 to indicate that the event did not occur and for 1 to indicate that the event did occur. Many statistical packages, including Stata, will not perform logistic regression unless the dependent variable coded 0 and 1. Specifically, Stata assumes that all non-zero values of the dependent variables are 1. Therefore, if the dependent variable was coded 3 and 4, which would make it a dichotomous variable, Stata would regard all of the values as 1. This is hard-coded into Stata; there are no options to over-ride this. If your dependent variable is coded in any way other than 0 and 1, you will need to recode it before running the logistic regression. (NOTE: SAS assumes that 0 indicates that the event happened; use the descending option on the proc logistic statement to have SAS model the 1’s.) By default, Stata predicts the probability of the event happening. Stata’s logit and logistic commands. Stata has two commands for logistic regression, logit and logistic . The main difference between the two is that the former displays the coefficients and the latter displays the odds ratios. You can also obtain the odds ratios by using the logit command with the or option. Which command you use is a matter of personal preference. Below, we discuss the relationship between the coefficients and the odds ratios and show how one can be converted into the other. However, before we discuss some examples of logistic regression, we need to take a moment to review some basic math regarding logarithms. In this web book, all logarithms will be natural logs. If log(a)=b then exp(b) = a. For example, log(5) = 1.6094379 and exp(1.6094379) = 5, where "exp" indicates exponentiation. This is critical, as it is the relationship between the coefficients and the odds ratios. We have created some small data sets to help illustrate the relationship between the logit coefficients (given in the output of the logit command) and the odds ratios (given in the output of the logistic command). We will use the tabulate command to see how the data are distributed. We will also obtain the predicted values and graph them against x , as we would in OLS regression. We use the expand command here for ease of data entry. On each line we enter the x and y values, and for the variable cnt , we enter then number of times we want that line repeated in the data set. We use the expand command to finish creating the data set. We can see this by using the list command. If list command is issued by itself (i.e., with no variables after it), Stata will list all observations for all variables. In this example, we compared the output from the logit and the logistic commands. Later in this chapter, we will use probabilities to assist with the interpretation of the findings. Many people find probabilities easier to understand than odds ratios. You will notice that the information at the top of the two outputs is the same. Wald test values (called z) and the p-values are the same, as are the log likelihood and the standard error. However, the logit command gives coefficients and their confidence intervals, while the logistic command give odds ratios and their confidence intervals. You will also notice that the logistic command does not give any information regarding the constant, because it does not make much sense to talk about a constant with odds ratios. (The constant ( _cons ) is displayed with the coefficients because you would use both of the values to write out the equation for the logistic regression model.) Let’s start with the output regarding the variable x . The output from the logit command indicates that the coefficient of x is 0. This means that with a one unit change in x , you would predict a 0 unit change in y . To transform the coefficient into an odds ratio, take the exponential of the coefficient: This yields 1, which is the odds ratio. An odds ratio of 1 means that there is no effect of x on y . Looking at the z test statistic, we see that it is not statistically significant, and the confidence interval of the coefficient includes 0. Note that when there is no effect, the confidence interval of the odds ratio will include 1. Next, let us try an example where the cell counts are not equal. In this example, we see that the coefficient of x is again 0 (1.70e-15 is approximately 0, with rounding error) and hence, the odds ratio is 1. Again, we conclude that x has no statistically significant effect on y . However, in this example, the constant is not 0. The constant is the odds of y = 1 when x = 0. The constant (also called the intercept) is the predicted log odds when all of the variables in the model are held equal to 0. Now, let’s look at an example where the odds ratio is not 1. Here we see that the odds ratio is 4, or more precisely, 4 to 1. In other words, the odds for the group coded as 1 are four times that as the odds for the group coded as 0. A single dichotomous predictor. Let’s use again the data from our first example. Our predictor variable will be a dichotomous variable, yr_rnd , indicating if the school is on a year-round calendar (coded as 1) or not (coded as 0). First, let’s tabulate and then graph the variables to get an idea of what the data look like. Because both of our variables are dichotomous, we have used the jitter option so that the points are not exactly one on top of the other. Now let’s look at the logistic regression. While we will briefly discuss the outputs from the logit and logistic commands, please see our Annotated Output pages for a more complete treatment. Let’s start at the top of the output. The meaning of the iteration log will be discussed later. Next, you will notice that the overall model is statistically significant (chi-square = 77.60, p = .00). This means that the model that includes yr_rnd fits the data statistically significantly better than the model without it (i.e., a model with only the constant). We will not try to interpret the meaning of the "pseudo R-squared" here except to say that emphasis should be put on the term "pseudo" and to note that some authors (including Hosmer and Lemeshow, 2000) discount the usefulness of this statistic. The log likelihood of the fitted model is -718.62623. The likelihood is the probability of observing a given set of observations, given the value of the parameters. The number -718.62623 in and of itself does not have much meaning; rather, it is used in a calculation to determine if a reduced model fits significantly better than the full model and for comparisons to other models. The coefficient for yr_rnd is -1.78. This indicates that a decrease of 1.78 is expected in the log odds of hiqual with a one-unit increase in yr_rnd (in other words, for students in a year-round school compared to those who are not). This coefficient is also statistically significant, with a Wald test value (z) of -7.30. Because the Wald test is statistically significant, the confidence interval for the coefficient does not include 0. As before, the coefficient can be converted into an odds ratio by exponentiating it: You can obtain the odds ratio from Stata either by issuing the logistic command or by using the or option with the logit command. You will notice that the only difference between these two outputs is that the logit command includes an iteration log at the top. Our point here is that you can use more than one method to get this information, and which one you use is up to you. The odds ratio is interpreted as a .1686011 change in the odds ratio when there is a one-unit change in yr_rnd . Notice that a .1686011 change is actually a decrease (because odds ratios less than 1 indicate a decrease; you can’t have a negative odds ratio). In other words, as you go from a non-year-round school to a year-round school, the ratio of the odds becomes smaller. In the previous example, we used a dichotomous independent variable. Traditionally, when researchers and data analysts analyze the relationship between two dichotomous variables, they often think of a chi-square test. Let’s take a moment to look at the relationship between logistic regression and chi-square. Chi-square is actually a special case of logistic regression. In a chi-square analysis, both variables must be categorical, and neither variable is an independent or dependent variable (that distinction is not made). In logistic regression, while the dependent variable must be dichotomous, the independent variable can be dichotomous or continuous. Also, logistic regression is not limited to only one independent variable. A single continuous predictor. Now let’s consider a model with a single continuous predictor. For this example we will be using a variable called avg_ed . This is a measure of the education achievements of the parents of the children in the schools that participated in the study. Let’s start off by summarizing and graphing this variable. Looking at the output from the logit command, we see that the LR-chi-squared is very high and is clearly statistically significant. This means that the model that we specified, namely avg_ed predicting hiqual , is significantly better than the model with only the constant (i.e., just the dependent variable). The coefficient for avg_ed is 3.91, meaning that we expect an increase of 3.91 in the log odds of hiqual with every one-unit increase avg_ed . The value of the Wald statistic indicates that the coefficient is significantly different from 0. However, it is not obvious what a 3.91 increase in the log odds of hiqual really means. Therefore, let’s look at the output from the logistic command. This tells us that the odds ratio is 49.88. This is the amount of change expected in the odds ratio when there is a one unit change in the predictor variable with all of the other variables in the model held constant. If we graph hiqual and avg_ed , you see that, like the graphs with the made-up data at the beginning of this chapter, it is not terribly informative. If you tried to draw a straight line through the points as you would in OLS regression, the line would not do a good job of describing the data. One possible solution to this problem is to transform the values of the dependent variable into predicted probabilities, as we did when we predicted yhat1 in the example at the beginning of this chapter. If we graph the predicted probabilities of hiqual against avg_ed , (a variable we will call yhatc ) we see that a line curved somewhat like an S is formed. This s-shaped curve resembles some statistical distributions and can be used to generate a type of regression equation and its statistical tests. To get from the straight line seen in OLS to the s-shaped curve in logistic regression, we need to do some mathematical transformations. When looking at these formulas, it becomes clear why we need to talk about probabilities, natural logs and exponentials when talking about logistic regression. Both a dichotomous and a continuous predictor. Now let’s try an example with both a dichotomous and a continuous independent variable. Interpreting the output from this logistic regression is not much different from the previous ones. The LR-chi-square is very high and is statistically significant. This means that the model that we specified is significantly better at predicting hiqual than a model without the predictors yr_rnd and avg_ed . The coefficient for yr_rnd is -1.09 and means that we would expect a 1.09 unit decrease in the log odds of hiqual for every one-unit increase in yr_rnd , holding all other variables constant in the model. The coefficient for avg_ed is 3.86 and means that we would expect a 3.86 unit increase in the log odds of hiqual with every one-unit increase in avg_ed , with all other variables held constant. Both of these coefficients are significantly different from 0 according the Wald test. Tools to assist with interpretation. In OLS regression, the R-square statistic indicates the proportion of the variability in the dependent variable that is accounted for by the model (i.e., all of the independent variables in the model). Unfortunately, creating a statistic to provide the same information for a logistic regression model has proved to be very difficult. Many people have tried, but no approach has been widely accepted by researchers or statisticians. The output from the logit and logistic commands give a statistic called "pseudo-R-square", and the emphasis is on the term "pseudo". This statistic should be used only to give the most general idea as to the proportion of variance that is being accounted for. The fitstat command gives a listing of various pseudo-R-squares. You can download fitstat over the internet (see How can I use the search command to search for programs and get additional help? for more information about using search ). As you can see from the output, some statistics indicate that the model fit is relatively good, while others indicate that it is not so good. The values are so different because they are measuring different things. We will not discuss the items in this output; rather, our point is to let you know that there is little agreement regarding an R-square statistic in logistic regression, and that different approaches lead to very different conclusions. If you use an R-square statistic at all, use it with great care. Next, we will describe some tools that can used to help you better understand the logistic regressions that you have run. These commands are part of an .ado package called spost9_ado (see How can I use the search command to search for programs and get additional help? for more information about using search ). (If you are using Stata 8, you want to get the spost .ado for that version.) The listcoef command gives you the logistic regression coefficients, the z-statistic from the Wald test and its p-value, the odds ratio, the standardized odds ratio and the standard deviation of x (i.e., the independent variables). We have included the help option so that the explanation of each column in the output is provided at the bottom. Two particularly useful columns are e^b, which gives the odds ratios and e^bStdX, which gives the change in the odds for a one standard deviation increase in x (i.e., yr_rnd and avg_ed ). The prtab command computes a table of predicted values for specified values of the independent variables listed in the model. Other independent variables are held constant at their mean by default. This command gives the predicted probability of being in a high quality school given the different levels of yr_rnd when avg_ed is held constant at its mean. Hence, when yr_rnd = 0 and avg_ed = 2.75, the predicted probability of being a high quality school is 0.1964. When yr_rnd = 1 and avg_ed = 2.75, the predicted probability of being a high quality school is 0.0759. Clearly, there is a much higher probability of being a high-quality school when the school is not on a year-round schedule than when it is. The "x = " at the bottom of the output gives the means of the x (i.e., independent) variables. Let’s try the prtab command with a continuous variable to get a better understanding of what this command does and why it is useful. First, we need to run a logistic regression with a new variable and calculate the predicted values. Then, we will graph the predicted values against the variable. The variable that we will use is called meals, and it indicates the percent of students who receive free meals while at school. Although this graph does not look like the classic s-shaped curve, it is another example of a logistic regression curve. It does not look like the curve formed using avg_ed because there is a positive relationship between avg_ed and hiqual , while there is a negative relationship between meals and hiqual . As you can tell, as the percent of free meals increases, the probability of being a high-quality school decreases. Now let’s compare this graph to the output of the prtab command. First you will need to set the matsize (matrix size) to 800. This will increase the maximum number of variables that Stata can use in model estimation. If you compare the output with the graph, you will see that they are two representations of the same things: the pair of numbers given on the first row of the prtab output are the coordinates for the left-most point on the graph and so on. If you try to make this graph using yr_rnd , you will see that the graph is not very informative: yr_rnd only has two possible values; hence, there are only two points on the graph. Note that the values in this output are different than those seen previously because the models are different. In this example, we did not include avg_ed as a predictor, and here avg_ed is not being held constant at its mean. The prchange command computes the change in the predicted probability as you go from a low value to a high value. We are going to use avg_ed for this example (its values range from 1 to5), because going from the low value to the high value on a 0/1 variable is not very interesting. Let’s go through this output item by item to see what it is telling us. The min->max column indicates the amount of change that we should expect in the predicted probability of hiqual as avg_ed changes from its minimum value to its maximum value. The 0->1 column indicates the amount of change that we should expect in the predicted probability of hiqual as avg_ed changes from 0 to 1. For a variable like avg_ed , whose lowest value is 1, this column is not very useful, as it extrapolates outside of the observable range of avg_ed . The -+1/2 column indicates the amount of change that we should expect in the predicted probability of hiqual as avg_ed changes from the mean – 0.5 to the mean + 0.5. (i.e., half a unit either side of the mean). In other words, this is the rate of change of the slope at the mean of the function (look back at the logistic function graphed above). The -+sd/2 column gives the same information as the previous column, except that it is in standard deviations. The MargEfct column gives the largest possible change in the slope of the function. The Pr(y|x) part of the output gives the probability that hiqual equals zero given that the predictors are at their mean values and the probability that hiqual equals one given the predictors at their same mean values. Hence, the probability of being a not high quality school when avg_ed is at its mean value is .8225, and the probability of being a high quality school is .1775 when avg_ed is at the same mean value. The mean and the standard deviation of the x variable(s) are given at the bottom of the output. Comparing models. Now that we have a model with two variables in it, we can ask if it is "better" than a model with just one of the variables in it. To do this, we use a command called lrtest , for likelihood ratio test. To use this command, you first run the model that you want to use as the basis for comparison (the full model). Next, you save the estimates with a name using the est store command. Next, you run the model that you want to compare to your full model, and then issue the lrtest command with the name of the full model. In our example, we will name our full model full_model . The output of this is a likelihood ratio test which tests the null hypothesis that the coefficients of the variable(s) left out of the reduced model is/are simultaneously equal to 0. In other words, the null hypothesis for this test is that removing the variable(s) has no effect; it does not lead to a poorer-fitting model. To demonstrate how this command works, let’s compare a model with both avg_ed and yr_rnd (the full model) to a model with only avg_ed in it (a reduced model). The chi-square statistic equals 11.40, which is statistically significant. This means that the variable that was removed to produce the reduced model resulted in a model that has a significantly poorer fit, and therefore the variable should be included in the model. Now let’s take a moment to make a few comments on the code used above. For the second logit (for the reduced model), we have added if e(sample) , which tells Stata to only use the cases that were included in the first model. If there were missing data on one of the variables that was dropped from the full model to make the reduced model, there would be more cases used in the reduced model. That exactly the same cases are used in both models is important because the lrtest assumes that the same cases are used in each model. The dot (.) at the end of the lrtest command is not necessary to include, but we have included it to be explicit about what is being tested. Stata "names" a model . if you have not specifically named it. For our final example, imagine that you have a model with lots of predictors in it. You could run many variations of the model, dropping one variable at a time or groups of variables at a time. Each time that you run a model, you would use the est store command and give each model its own name. We will try a mini-example below. These results suggest that the variables dropped from the full model to create model c should not be dropped (LR chi2(2) = 14.08, p = 0.0009). The results of the second lrtest are similar; the variables should not be dropped. In other words, it seems that the full model is preferable. We need to remember that a test of nested models assumes that each model is run on the same sample, in other words, exactly the same observations. The likelihood ratio test is not valid otherwise. You may not have exactly the same observations in each model if you have missing data on one or more variables. In that case, you might want to run all of the models on only those observations that are available for all models (the model with the smallest number of observations). A note about sample size. As we have stated several times in this chapter, logistic regression uses a maximum likelihood to get the estimates of the coefficients. Many of desirable properties of maximum likelihood are found as the sample size increases. The behavior of maximum likelihood with small sample sizes is not well understood. According to Long (1997, pages 53-54), 100 is a minimum sample size, and you want *at least* 10 observations per predictor. This does not mean that if you have only one predictor you need only 10 observations. If you have categorical predictors, you may need to have more observations to avoid computational difficulties caused by empty cells. More observations are needed when the dependent variable is very lopsided; in other words, when there are very few 1’s and lots of 0’s, or vice versa. In chapter 3 of this web book is a discussion of multicollinearity. When this is present, you will need a larger sample size. Conclusion. We realize that we have covered quite a bit of material in this chapter. Our main goals were to make you aware of 1) the similarities and differences between OLS regression and logistic regression and 2) how to interpret the output from Stata’s logit and logistic commands. We have used both a dichotomous and a continuous independent variable in the logistic regressions that we have run so far. As in OLS regression, categorical variables require special attention, which they will receive in the next chapter.

Комментариев нет:

Отправить комментарий