Logistic regression and categorical covariates. A short post to get back – for my nonlife insurance course – on the interpretation of the output of a regression when there is a categorical covariate. Consider the following dataset. Let us run a logistic regression on that dataset. Here, the reference is modality . Which means that for someone with characteristics , we predict the following probability. where denotes the cumulative distribution function of the logistic distribution. For someone with characteristics , we predict the following probability. For someone with characteristics , we predict the following probability. (etc.) Here, if we accept (against ), it means that modality cannot be considerd as different from . A natural idea can be to change the reference modality, and to look at the -values. If we consider the following loop, we get. and if we simply want to know if the -value exceeds – or not – 5%, we get the following, The first column is obtained when is the reference, and then, we see which parameter should be considered as null. The interpretation is the following: and are not different from is not different from and are not different from and are not different from is not different from. Note that we only have, here, some kind of intuition. So, let us run a more formal test. Let us consider the following regression (we remove the intercept to get a model easier to understand) It is possible to use Fisher test to test if some coefficients are equal, or not (more generally if some linear constraints are satisfied) Here, we clearly accept the assumption that the first three factors are equal, as well as the last two. What is the next step? Well, if we believe that there are mainly two categories, and , let us create that factor, Here, all the categories are significant. So we do have a proper model.
Комментариев нет:
Отправить комментарий