Building A Logistic Regression in Python, Step by Step.

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X. Logistic Regression Assumptions. * Binary logistic regression requires the dependent variable to be binary. * For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome. * Only the meaningful variables should be included. * The independent variables should be independent of each other. That is, the model should have little or no multicollinearity. * The independent variables are linearly related to the log odds. * Logistic regression requires quite large sample sizes. Keeping the above assumptions in mind, let’s look at our dataset. Data Exploration. The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y). The dataset can be downloaded from here. The dataset provides the bank customers’ information. It includes 41,188 records and 21 fields. Input variables. 1. age (numeric) 2. job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”) 3. marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”) 4. education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”) 5. default: has credit in default? (categorical: “no”, “yes”, “unknown”) 6. housing: has housing loan? (categorical: “no”, “yes”, “unknown”) 7. loan: has personal loan? (categorical: “no”, “yes”, “unknown”) 8. contact: contact communication type (categorical: “cellular”, “telephone”) 9. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 10. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”) 11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model 12.

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14. previous: number of contacts performed before this campaign and for this client (numeric) 15. poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”) 16. emp.var.rate: employment variation rate — (numeric) 17. cons.price.idx: consumer price index — (numeric) 18. cons.conf.idx: consumer confidence index — (numeric) 19. euribor3m: euribor 3 month rate — (numeric) 20. nr.employed: number of employees — (numeric) Predict variable (desired target) y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”) Barplot
for the dependent variable. Gives this plot: Check the missing values. Customer job distribution. Gives this plot: Customer marital status distribution. Gives this plot: Barplot for credit in default. Gives this plot: Barplot for housing loan. Gives this plot: Barplot for personal loan. Gives this plot: Barplot for previous marketing campaign outcome. Gives this plot: Our prediction will be based on the customer’s job, marital status, whether he(she) has credit in default, whether he(she) has a housing loan, whether he(she) has a personal loan, and the outcome of the previous marketing campaigns. So, we will drop the variables that we do not need. Data Preprocessing. Create dummy variables, that is variables with only two values, zero and one.
In logistic regression models, encoding all of the independent variables as dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients. Drop the unknown columns. Perfect! Exactly what we need for the next steps. Check the independence between the independent variables. Gives this plot: Looks good. Split the data into training and test sets.

Check out training data is sufficient. Great! Now we can start building our logistic regression model. Logistic Regression Model. Fit logistic regression to the training set. Predicting the test set results and creating confusion matrix. The confusion_matrix() function will calculate a confusion matrix and return the result as an array. The result is telling us that we have 9046+229 correct predictions and 912+110 incorrect predictions. Compute precision, recall, F-measure and support. To quote from Scikit Learn: The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative. The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0. The F-beta score weights the recall
more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important. The support is the number of occurrences of each class in y_test. Interpretation: Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer’s preferred term deposits that were promoted. Classifier visualization playground. The purpose of this section is to visualize logistic regression classsifiers’ decision boundaries. In order to better vizualize the decision boundaries, we’ll perform Principal Component Analysis (PCA) on the data to
reduce the dimensionality to 2 dimensions. Gives this plot: Gives this plot: As you can see, the PCA has reduced the accuracy of our Logistic
Regression model. This is because we use PCA to reduce the amount of the dimension, so we have removed
information from our data. We will cover PCA in a future post. The Jupiter notebook used to make this post is available here. I would be pleased to receive feedback or questions on any of the above.
Комментариев нет:
Отправить комментарий