пятница, 16 августа 2019 г.

Logistic Regression using Python (scikit-learn) – Towards Data Science

Logistic Regression using Python (scikit-learn) One of the most amazing things about Python’s scikit-learn library is that is has a 4-step modeling pattern that makes it easy to code a machine learning classifier. While this tutorial uses a classifier called Logistic Regression, the coding process in this tutorial applies to other classifiers in sklearn (Decision Tree, K-Nearest Neighbors etc). In this tutorial, we use Logistic Regression to predict digit labels based on images. The image above shows a bunch of training digits (observations) from the MNIST dataset whose category membership is known (labels 0–9). After training a model with logistic regression, it can be used to predict an image label (labels 0–9) given an image. The first part of this tutorial post goes over a toy dataset (digits dataset) to show quickly illustrate scikit-learn’s 4 step modeling pattern and show the behavior of the logistic regression algorthm. The second part of the tutorial goes over a more realistic dataset (MNIST dataset) to briefly show how changing a model’s default parameters can effect performance (both in timing and accuracy of the model). With that, lets get started. If you get lost, I recommend opening the video above in a separate tab. The code used in this tutorial is available below. MNIST Logistic Regression (second part of tutorial code) Getting Started (Prerequisites) If you already have anaconda installed, skip to the next section. I recommend having anaconda installed (either Python 2 or 3 works well for this tutorial) so you won’t have any issue importing libraries. You can either download anaconda from the official site and install on your own or you can follow these anaconda installation tutorials below to set up anaconda on your operating system. Install Anaconda on Windows: Link. Install Anaconda on Mac: Link. Install Anaconda on Ubuntu (Linux): Link. Logistic Regression on Digits Dataset. Loading the Data (Digits Dataset) The digits dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below will load the digits dataset. Now that you have the dataset loaded you can use the commands below. to see that there are 1797 images and 1797 labels in the dataset. Showing the Images and the Labels (Digits Dataset) This section is really just to show what the images and labels look like. It usually helps to visualize your data to see what you are working with. Splitting Data into Training and Test Sets (Digits Dataset) We make training and test sets to make sure that after we train our classification algorithm, it is able to generalize well to new data. Scikit-learn 4-Step Modeling Pattern (Digits Dataset) Step 1. Import the model you want to use. In sklearn, all machine learning models are implemented as Python classes. Step 2. Make an instance of the Model. Step 3. Training the model on the data, storing the information learned from the data. Model is learning the relationship between digits (x_train) and labels (y_train) Step 4. Predict labels for new data (new images) Uses the information the model learned during the model training process. Predict for Multiple Observations (images) at Once. Make predictions on entire test data. Measuring Model Performance (Digits Dataset) While there are other ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use accuracy as our metric. To do this are going to see how the model performs on the new data (test set) accuracy is defined as: (fraction of correct predictions): correct predictions / total number of data points. Our accuracy was 95.3%. Confusion Matrix (Digits Dataset) A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. In this section, I am just showing two python packages (Seaborn and Matplotlib) for making confusion matrices more understandable and visually appealing. The confusion matrix below is not visually super informative or visually appealing. Method 1 (Seaborn) As you can see below, this method produces a more understandable and visually readable confusion matrix using seaborn. Method 2 (Matplotlib) This method is clearly a lot more code. I just wanted to show people how to do it in matplotlib as well. Logistic Regression (MNIST) One important point to emphasize that the digit dataset contained in sklearn is too small to be representative of a real world machine learning task. We are going to use the MNIST dataset because it is for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. One of the things we will notice is that parameter tuning can greatly speed up a machine learning algorithm’s training time. Downloading the Data (MNIST) The MNIST dataset doesn’t come from within scikit-learn. Now that you have the dataset loaded you can use the commands below. to see that there are 70000 images and 70000 labels in the dataset. Splitting Data into Training and Test Sets (MNIST) The code below splits the data into training and test data sets. The test_size=1/7.0 makes the training set size 60,000 images and the test set size of 10,000. Showing the Images and Labels (MNIST) Scikit-learn 4-Step Modeling Pattern (MNIST) One thing I like to mention is the importance of parameter tuning. While it may not have mattered much for the smaller digits dataset, it makes a bigger difference on larger and more complex datasets. While usually one adjusts parameters for the sake of accuracy, in the case below, we are adjusting the parameter solver to speed up the fitting of the model. Step 1. Import the model you want to use. In sklearn, all machine learning models are implemented as Python classes. Step 2. Make an instance of the Model. Please see the documentation if you are curious what changing solver does. Essentially, we are changing the optimization algorithm. Step 3. Training the model on the data, storing the information learned from the data. Model is learning the relationship between x (digits) and y (labels) Step 4. Predict the labels of new data (new images) Uses the information the model learned during the model training process. Predict for Multiple Observations (images) at Once. Make predictions on entire test data. Measuring Model Performance (MNIST) While there are other ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use accuracy as our metric. To do this are going to see how the model performs on the new data (test set) accuracy is defined as: (fraction of correct predictions): correct predictions / total number of data points. One thing I briefly want to mention is that is the default optimization algorithm parameter was solver = liblinear and it took 2893.1 seconds to run with a accuracy of 91.45%. When I set solver = lbfgs , it took 52.86 seconds to run with an accuracy of 91.3%. Changing the solver had a minor effect on accuracy, but at least it was a lot faster. Display Misclassified images with Predicted Labels (MNIST) While I could show another confusion matrix, I figured people would rather see misclassified images on the off chance someone finds it interesting. Getting the misclassified images’ index. Showing the misclassified images and image labels using matplotlib. Closing Thoughts. The important thing to note here is that making a machine learning model in scikit-learn is not a lot of work. I hope this post helps you with whatever you are working on. My next machine learning tutorial goes over PCA using Python. If you have any questions or thoughts on the tutorial, feel free to reach out in the comments below, through YouTube video page, or through Twitter!

Комментариев нет:

Отправить комментарий