пятница, 13 сентября 2019 г.

Simple Guide to Logistic Regression in R

Simple Guide to Logistic Regression in R. Introduction. Every machine learning algorithm works best under a given set of conditions. Making sure your algorithm fits the assumptions / requirements ensures superior performance. You can’t use any algorithm in any condition. For example: Have you ever tried using linear regression on a categorical dependent variable? Don’t even try! Because you won’t be appreciated for getting extremely low values of adjusted R² and F statistic. Instead, in such situations, you should try using algorithms such as Logistic Regression, Decision Trees, SVM, Random Forest etc. To get a quick overview of these algorithms, I’ll recommend Essentials of Machine Learning Algorithms. With this post, I give you useful knowledge on Logistic Regression in R. After you’ve mastered linear regression, this comes as the natural following step in your journey. It’s also easy to learn and implement, but you must know the science behind this algorithm. I’ve tried to explain these concepts in simplest possible manner. Let’s get started. What is Logistic Regression ? Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Derivation of Logistic Regression Equation. Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). In 1972, Nelder and Wedderburn proposed this model with an effort to provide a means of using linear regression to the problems which were not directly suited for application of linear regression. Infact, they proposed a class of different models (linear regression, ANOVA, Poisson Regression etc) which included logistic regression as a special case. The fundamental equation of generalized linear model is: Here, g() is the link function, E(y) is the expectation of target variable and α + βx1 + γx2 is the linear predictor ( α,β,γ to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor. GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model. The dependent variable need not to be normally distributed. It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE). Errors need to be independent but not normally distributed. Let’s understand it further using an example: We are provided a sample of 1000 customers. We need to predict the probability whether a customer will buy (y) a particular magazine or not. As you can see, we’ve a categorical outcome variable, we’ll use logistic regression. To start with logistic regression, I’ll first write the simple linear regression equation with dependent variable enclosed in a link function: Note: For ease of understanding, I’ve considered ‘Age’ as independent variable. In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). As described above, g() is the link function. This function is established using two things: Probability of Success(p) and Probability of Failure(1-p). p should meet following criteria: It must always be positive (since p >= 0) It must always be less than equals to 1 (since p 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph. Note: For model performance, you can also consider likelihood function. It is called so, because it selects the coefficient values which maximizes the likelihood of explaining the observed data. It indicates goodness of fit as its value approaches one, and a poor fit of the data as its value approaches zero. Logistic Regression Model in R. Considering the availability, I’ve build this model on our practice problem – Dressify data set. You can download it here. Without going deep into feature engineering, here’s the script of simple logistic regression model: This data require lots of cleaning and feature engineering. The scope of this article restricted me to keep the example focused on building logistic regression model. This data is available for practice. I’d recommend you to work on this problem. There’s a lot to learn. By now, you would know the science behind logistic regression. I’ve seen many times that people know the use of this algorithm without actually having knowledge about its core concepts. I’ve tried my best to explain this part in simplest possible manner. The example above only shows the skeleton of using logistic regression in R. Before actually approaching to this stage, you must invest your crucial time in feature engineering. Furthermore, I’d recommend you to work on this problem set. You’d explore things which you might haven’t faced before. Did I miss out on anything important ? Did you find this article helpful? Please share your opinions / thoughts in the comments section below.

Комментариев нет:

Отправить комментарий