Logistic regression explained. I wrote an article sometime back, hope this is useful. Introduction to Logistic Regression. Logistic regression is generally used where the dependent variable is Binary or Dichotomous. That means the dependent variable can take only two possible values such as “Yes or No”, “Default or No Default”, “Living or Dead”, “Responder or Non Responder”, “Yes or No” etc. Independent factors or variables can be categorical or numerical variables. Please note that even though logistic (logit) regression is frequently used for binary variables (2 classes), it can be used for categorical dependent variables with more than 2 classes. In this case it’s called Multinomial Logistic Regression. Here we will focus on Logistic Regression with binary dependent variables as it is most commonly used. Applications of Logistic Regression- Logistic regression is used for prediction of output which is binary, as stated above. For example, if a credit card company is going to build a model to decide whether to issue a credit card to a customer or not, it will model for whether the customer is going to “Default” or “Not Default” on this credit card. This is called “Default Propensity Modeling” in banking lingo. Similarly an ecommerce company that is sending out costly advertisement / promotional offer mails to customers, will like to know whether a particular customer is likely to respond to the offer or not. In Other words, whether a customer will be “Responder” or “Non Responder”. This is called “Propensity to Respond Modeling” Using insights generated from the logistic regression output, companies may optimize their business strategies to achieve their business goals such as minimize expenses or losses, maximize return on investment (ROI) in marketing campaigns etc. Underlying Algorithm and Assumptions. The underlying algorithm of Maximum Likelihood Estimation (MLE) determines the regression coefficient for the model that accurately predicts the probability of the binary dependent variable. The algorithm stops when the convergence criterion is met or maximum number of iterations are reached. Since the probability of any event lies between 0 and 1 (or 0% to 100%), when we plot the probability of dependent variable by independent factors, it will demonstrate an ‘S’ shape curve. Let’s take an example- here we are predicting the probability of a given candidate to get admission in a school of his or her choice by the score candidates receives in the admission test. Since the dependent variable is binary/dichotomous- “Admission “or “No Admission”, we can use a logistic regression model to predict the probability of getting the “Admission”. Let’s first plot the data and analyse the shape to confirm that this is following an ‘S’ shape. Since the relationship between the Score and Probability of Selection is not linear but shows an ‘S’ shape, we can’t use a linear model to predict probability of selection by a score. We need to do Logit transformation of the dependent variable to make the correlation between the predictor and dependent variable linear. Logit Transformation is defined as follows- Logit = Log (p/1-p) = log (probability of event happening/ probability of event not happening) = log (Odds) Now we can model using regression to predict the probability of a certain outcome of the dependent variable. The regression equation that the model will try to come out is- Log (p/1-p) = b0+ b1*x1+b2*x2+ e. Where b0 is the Y intercept, e is the error in the model, b1 is the coefficient (slope) for independent factor x1, and b2 is the coefficient (slope) for independent factor x2 and so on… In the above example, the regression equation will look like this- Log (p/1-p) = b0 + b1*Score+ e. The model will generate the coefficients b0 and b1 that gives us the best model in terms of key metrics that we will be discussing later. Tools to Build Logistic Regression. R- Function glm() with the family = “ Logit” is frequently used. SAS- PROC Logistic is a dedicated procedure for running logistic regression with several differ. Key Metrics and Interpretation- Key metrics that enable comparison among different models and provide indicators of the model performance. Lorenz Curve and Gini Index- Captures the discriminatory power of the model in separating “Good” from “Bad” vs. random selection. Gini Index is the ratio of areas marked below [A/ (A+B)]. This measures how much better the model is performing compared to random selection. The Gini can range between 0% and 100%. Higher the Gini Index, better the model (higher separation between good and bad). Gini value of 0% would indicate a model which is no better than random or in other words no prediction power. On the other hand a Gini value of 100% will indicate a perfect model- it will accurately predict good and Bad 100%. Kolmogorov-Smirnov statistics (KS) - Similar to Gini, KS also captures the discriminatory power of the model in separating “Good” from “Bad”. It is the highest separation between the Cumulative Good Rate and Cumulative Bad Rate. Higher the KS, better is the model (higher separation between good and bad). KS values can range between 0 -100%, KS values greater than 20% are considered acceptable for a model. Receiver Operating Characteristic Curve (ROC) / Area Under Curve (AUC)- Gauges model’s performance in identifying “True Positive” as opposed to “False Positive”. It plots TPR vs FPR. The area under the ROC chart is an indication of this metric. AUC can range from 50% to 100%. Higher the AUC value, better is the prediction power of model. Lift, Confusion Matrix (Actual vs Predicted), Characteristic Stability Index, % Concordance etc. Here is the complete code and output from a logistic regression model built on the famous German Credit Card data. German Credit Risk | Kaggle.
Комментариев нет:
Отправить комментарий