пятница, 16 августа 2019 г.

Logistic Regression With R

Logistic Regression. If linear regression serves to predict continuous Y variables, logistic regression is used for binary classification. If we use linear regression to model a dichotomous variable (as Y ), the resulting model might not restrict the predicted Y s within 0 and 1. Besides, other assumptions of linear regression such as normality of errors may get violated. So instead, we model the log odds of the event $ln \left( P \over 1-P \right)$ , where, P is the probability of event. The above equation can be modeled using the glm() by setting the family argument to "binomial" . But we are more interested in the probability of the event, than the log odds of the event. So, the predicted values from the above model, i.e. the log odds of the event, can be converted to probability of event as follows: This conversion is achieved using the plogis() function, as shown below when we build logit models and predict. Example Problem. Lets try and predict if an individual will earn more than $50K using logistic regression based on demographic variables available in the adult data. In this process, we will: Import the data Check for class bias Create training and test samples Compute information value to find out important variables Build logit models and predict on test data Do model diagnostics. Import data. Check Class bias. Ideally, the proportion of events and non-events in the Y variable should approximately be the same. So, lets first check the proportion of classes in the dependent variable ABOVE50K . Clearly, there is a class bias, a condition observed when the proportion of events is much smaller than proportion of non-events. So we must sample the observations in approximately equal proportions to get better models. Create Training and Test Samples. One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations (>10K). Next it is desirable to find the information value of variables to get an idea of how valuable they are in explaining the dependent variable (ABOVE50K). Create WOE for categorical variables (optional) Optionally, we can create WOE equivalents for all categorical variables. This is only an optional step, for simplicity, this step is NOT run for this analysis. Compute Information Values. The smbinning::smbinning function converts a continuous variable into a categorical variable using recursive partitioning. We will first convert them to categorical variables and then, capture the information values for all variables in iv_df. Build Logit Models and Predict. A quick note about the plogis function: The glm() procedure with family="binomial" will build regression model on the given formula. When we use the predict function on this model, it will predict the log(odds) of the Y variable. This is not what we ultimately want because, the predicted values may not lie within the 0 and 1 range as expected. So, to convert it into prediction probability scores that is bound between 0 and 1, we use the plogis() . Decide on optimal prediction probability cutoff for the model. The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and o reduce the misclassification error. Lets compute the optimal score that minimizes the misclassification error for the above model. Model Diagnostics. The summary(logitMod) gives the beta coefficients, Standard error, z Value and p Value. If your model had categorical variables with multiple levels, you will find a row-entry for each category of that variable. That is because, each individual category is considered as an independent binary variable by the glm() . In this case it is ok if few of the categories in a multi-category variable don’t turn out to be significant in the model (i.e. p Value turns out greater than significance level of 0.5). Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4. Misclassification Error. Misclassification error is the percentage mismatch of predcited vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better is your model. Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. The above model has area under ROC curve 88.78%, which is pretty good. Concordance. Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance. In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. The above model with a concordance of 89.2% is indeed a good quality model. Specificity and Sensitivity. Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. Specificity can also be calculated as 1 − F a l s e P o s i t i v e R a t e . The above numbers are calculated on the validation sample that was not used for training the model. So, a truth detection rate of 31% on test data is good. Confusion Matrix. © 2016-17 Selva Prabhakaran. Powered by jekyll, knitr, and pandoc. This work is licensed under the Creative Commons License.

Комментариев нет:

Отправить комментарий