пятница, 13 сентября 2019 г.

Simple logistic regression - Handbook of Biological Statistics

Handbook of Biological Statistics. John H. McDonald. Search the handbook: Tests for nominal variables. Tests for one measurement variable. Tests for multiple measurement variables. ⇐ Previous topic|Next topic ⇒ Table of Contents. Simple logistic regression. Use simple logistic regression when you have one nominal variable and one measurement variable, and you want to know whether variation in the measurement variable causes variation in the nominal variable. When to use it. Use simple logistic regression when you have one nominal variable with two values (male/female, dead/alive, etc.) and one measurement variable. The nominal variable is the dependent variable, and the measurement variable is the independent variable. I'm separating simple logistic regression, with only one independent variable, from multiple logistic regression, which has more than one independent variable. Many people lump all logistic regression together, but I think it's useful to treat simple logistic regression separately, because it's simpler. Simple logistic regression is analogous to linear regression, except that the dependent variable is nominal, not a measurement. One goal is to see whether the probability of getting a particular value of the nominal variable is associated with the measurement variable; the other goal is to predict the probability of getting a particular value of the nominal variable, given the measurement variable. As an example of simple logistic regression, Suzuki et al. (2006) measured sand grain size on 28 beaches in Japan and observed the presence or absence of the burrowing wolf spider Lycosa ishikariana on each beach. Sand grain size is a measurement variable, and spider presence or absence is a nominal variable. Spider presence or absence is the dependent variable; if there is a relationship between the two variables, it would be sand grain size affecting spiders, not the presence of spiders affecting the sand. One goal of this study would be to determine whether there was a relationship between sand grain size and the presence or absence of the species, in hopes of understanding more about the biology of the spiders. Because this species is endangered, another goal would be to find an equation that would predict the probability of a wolf spider population surviving on a beach with a particular sand grain size, to help determine which beaches to reintroduce the spider to. You can also analyze data with one nominal and one measurement variable using a one-way anova or a Student's t –test, and the distinction can be subtle. One clue is that logistic regression allows you to predict the probability of the nominal variable. For example, imagine that you had measured the cholesterol level in the blood of a large number of 55-year-old women, then followed up ten years later to see who had had a heart attack. You could do a two-sample t –test, comparing the cholesterol levels of the women who did have heart attacks vs. those who didn't, and that would be a perfectly reasonable way to test the null hypothesis that cholesterol level is not associated with heart attacks; if the hypothesis test was all you were interested in, the t –test would probably be better than the less-familiar logistic regression. However, if you wanted to predict the probability that a 55-year-old woman with a particular cholesterol level would have a heart attack in the next ten years, so that doctors could tell their patients "If you reduce your cholesterol by 40 points, you'll reduce your risk of heart attack by X %," you would have to use logistic regression. Another situation that calls for logistic regression, rather than an anova or t –test, is when you determine the values of the measurement variable, while the values of the nominal variable are free to vary. For example, let's say you are studying the effect of incubation temperature on sex determination in Komodo dragons. You raise 10 eggs at 30 °C, 30 eggs at 32°C, 12 eggs at 34°C, etc., then determine the sex of the hatchlings. It would be silly to compare the mean incubation temperatures between male and female hatchlings, and test the difference using an anova or t –test, because the incubation temperature does not depend on the sex of the offspring; you've set the incubation temperature, and if there is a relationship, it's that the sex of the offspring depends on the temperature. When there are multiple observations of the nominal variable for each value of the measurement variable, as in the Komodo dragon example, you'll often sees the data analyzed using linear regression, with the proportions treated as a second measurement variable. Often the proportions are arc-sine transformed, because that makes the distributions of proportions more normal. This is not horrible, but it's not strictly correct. One problem is that linear regression treats all of the proportions equally, even if they are based on much different sample sizes. If 6 out of 10 Komodo dragon eggs raised at 30 °C were female, and 15 out of 30 eggs raised at 32°C were female, the 60% female at 30°C and 50% at 32°C would get equal weight in a linear regression, which is inappropriate. Logistic regression analyzes each observation (in this example, the sex of each Komodo dragon) separately, so the 30 dragons at 32°C would have 3 times the weight of the 10 dragons at 30°C. While logistic regression with two values of the nominal variable (binary logistic regression) is by far the most common, you can also do logistic regression with more than two values of the nominal variable, called multinomial logistic regression. I'm not going to cover it here at all. Sorry. You can also do simple logistic regression with nominal variables for both the independent and dependent variables, but to be honest, I don't understand the advantage of this over a chi-squared or G –test of independence. Null hypothesis. The statistical null hypothesis is that the probability of a particular value of the nominal variable is not associated with the value of the measurement variable; in other words, the line describing the relationship between the measurement variable and the probability of the nominal variable has a slope of zero. How the test works. Simple logistic regression finds the equation that best predicts the value of the Y variable for each value of the X variable. What makes logistic regression different from linear regression is that you do not measure the Y variable directly; it is instead the probability of obtaining a particular value of a nominal variable. For the spider example, the values of the nominal variable are "spiders present" and "spiders absent." The Y variable used in logistic regression would then be the probability of spiders being present on a beach. This probability could take values from 0 to 1. The limited range of this probability would present problems if used directly in a regression, so the odds, Y /(1- Y ), is used instead. (If the probability of spiders on a beach is 0.25, the odds of having spiders are 0.25/(1-0.25)=1/3. In gambling terms, this would be expressed as "3 to 1 odds against having spiders on a beach.") Taking the natural log of the odds makes the variable more suitable for a regression, so the result of a logistic regression is an equation that looks like this: You find the slope ( b ) and intercept ( a ) of the best-fitting equation in a logistic regression using the maximum-likelihood method, rather than the least-squares method you use for linear regression. Maximum likelihood is a computer-intensive technique; the basic idea is that it finds the values of the parameters under which you would be most likely to get the observed results. For the spider example, the equation is. Rearranging to solve for Y (the probability of spiders on a beach) yields. Y = e −1.6476+5.1215(grain size) /(1+ e −1.6476+5.1215(grain size) ) where e is the root of natural logs. So if you went to a beach and wanted to predict the probability that spiders would live there, you could measure the sand grain size, plug it into the equation, and get an estimate of Y , the probability of spiders being on the beach. There are several different ways of estimating the P value. The Wald chi-square is fairly popular, but it may yield inaccurate results with small sample sizes. The likelihood ratio method may be better. It uses the difference between the probability of obtaining the observed results under the logistic model and the probability of obtaining the observed results in a model with no relationship between the independent and dependent variables. I recommend you use the likelihood-ratio method; be sure to specify which method you've used when you report your results. For the spider example, the P value using the likelihood ratio method is 0.033, so you would reject the null hypothesis. The P value for the Wald method is 0.088, which is not quite significant. Assumptions. Simple logistic regression assumes that the observations are independent; in other words, that one observation does not affect another. In the Komodo dragon example, if all the eggs at 30°C were laid by one mother, and all the eggs at 32°C were laid by a different mother, that would make the observations non-independent. If you design your experiment well, you won't have a problem with this assumption. Simple logistic regression assumes that the relationship between the natural log of the odds ratio and the measurement variable is linear. You might be able to fix this with a transformation of your measurement variable, but if the relationship looks like a U or upside-down U, a transformation won't work. For example, Suzuki et al. (2006) found an increasing probability of spiders with increasing grain size, but I'm sure that if they looked at beaches with even larger sand (in other words, gravel), the probability of spiders would go back down. In that case you couldn't do simple logistic regression; you'd probably want to do multiple logistic regression with an equation including both X and X 2 terms, instead. Simple logistic regression does not assume that the measurement variable is normally distributed. McDonald (1985) counted allele frequencies at the mannose-6-phosphate isomerase (Mpi) locus in the amphipod crustacean Megalorchestia californiana, which lives on sandy beaches of the Pacific coast of North America. There were two common alleles, Mpi 90 and Mpi 100 . The latitude of each collection location, the count of each of the alleles, and the proportion of the Mpi 100 allele, are shown here:

Комментариев нет:

Отправить комментарий