LogisticRegression vs.
SGDClassifier. Week 3 of Andrew Ng's ML course on Coursera focuses on doing classification using the standard sigmoid hypothesis and the log-loss cost function, and multiclass classification is covered under the one-vs-rest approach. Here I use the homework data set to learn about the relevant python tools. Tools Covered:¶ SGDClassifier for general classification problems (like logistic regression) specifying a loss and penalty LogisticRegression for classification using log-loss and specifying a penalty (regularization) PolynomialFeatures for automated creation of higher order feature terms. Logistic Regression¶ Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant's scores on two exams and the admissions decision. Your task is to build a classifcation model that estimates an applicant's probability of admission based the scores from those two exams. Plotting the Data¶ SGDClassifier in sklearn for Classification Problems¶ In Week #2 homework we used the sklearn class SGDRegressor which does gradient descent for linear regression using a specified loss (and optional penalty). There was a sister class SGDClassifier which serves the same purpose but for logistic regression problems. The syntax for using this guy is pretty much identical to what we did with SGDRegressor so that's nice! Recall though that for logistic regression we don't use a least-squares loss as it is non-convex and doesn't capture the intuition of what we want our penalty to be. Instead a standard loss function, often called "log loss", is used which has the properties: when true $y$ is 1, cost increases from 0 to infinity logarithmically as $h$ goes from 1 to 0 identical but reversed behavior when true $y$ is 0 thus zero cost when $h(x) = y$ for either 0 or 1 and some moderate cost when $h(x)=0.5$ for either 0 or 1. For binary (label is either 1 or 0) logistic regression this can be captured by: \begin \textrm (h_\theta(x), y) = -y\log(h_\theta(x)) - (1-y)\log(1-h_\theta(x)), \end. where again the cost function we seek to minimize is. \begin J(\theta) = \frac \sum_1^m\frac \textrm (h_\theta(x_i), y_i). \end. Recall there were some niceties to be observed in passing data to the fitting objects - for the features it expects a matrix-like object X , of the appropriate dimensions, and for the labels it expects a vector-like object y with the same number of rows as X . [A quick but important aside: in class when Andrew writes $X$ for the data matrix he usually means this to include a column of all ones for $\theta_0$ (the intercept term). When using all these lovely sklearn tools it's not necessary to include this column explicitly in your $X$, it will be created for you.] Remember that a good rule of thumb for the number of passes over the training set is n_iter = np.ceil(10**6 / n) to get convergence. So let's set that parameter and then try fitting. Remember that calling the fit method doesn' return a vector of coefficients, but rather the fit information will be stored as attributes of our object. We probably should have expected similar weights on the two exams, since the data look pretty symmetric around the diagonal y=x . Let's see how well the fitted model does on the training set i.e. how many of the training set admissions it correctly predicts. How Well Does the First Order Model Do?¶ Just as with SGDRegressor , predicting new values is simplifed by the predict method. But remember that with logistic regression the typical hypothesis form $h_\theta(x)$ returns some value between 0 and 1 which we interpret as the probability that data point $x$ has corresponding label $y$ . The "prediction" usually then consists of assigning the label based on the whether that probability is greater or less than one half. We can access both these results with predict and a new method predict_proba . Let's numerically count the number of samples where the prediction differs from the actual $y$ value, and use that to compute the success percentage of the hypothesis that our classifier landed on. That's doing pretty well, but we'd like to have a clearer picture of where te model is falling down. We can take a look at all the training data points that were incorrectly predicted by copying them to their own dataframe, and then we can visualize them on top of the full dataset on our scatterplot. Visualizing the Decision Boundary¶ Unsurprisingly the hypothesis has trouble with observations that are near what a human would eyeball as the sensible decision boundary. We probably don't expect to get any improvement from more iterations, because we know that using the standard sigmoid hypothesis for classification, a data matrix X of two features, exam1 and exam2 will give a decision boundary that must be a straight line in that two-dimensional feature space. From inspecting the scatter plot it doesn't look like there exists a straight line in feature space that completely divides our two labels, so we can't expect any miracles from this algorithm. We can plot the decision boundary of the fit if we like. Remember, for typical logistic regression our hypothesis takes the sigmoid form: with $z = \theta^Tx$, and we will predict 1 (0) where the hypothesis $g$ is greater (less) than 0.5. This will occur when the linear model $z$ is greater versus less than 0, respectively. So our decision boundary in feature space is the line $\theta^Tx=0$. The vector of parameters $\theta_T$ is stored as an attribute of the our classifier object (though the constant term $\theta_0$ needs to be accessed separately as the "intercept"): This hypothesis then involves computing $z=\theta_0 + \theta_1x_1 + \theta_2x_2$, and the decision boundary of the hypothesis in feature space is where $z=0$. Let's plot this line on top of everything else: Improving on the First Order Model With Quadratic Terms¶ By eye I would say permitting some curvature to the decision boundary would do a somewhat better job. To get curved decision boundary in this exam1 x exam2 plane we need to add some additional features to the hypothesis that are not linear in the exams scores. To simplify I'll first write a function that does the fitting and then just returns the success rate so then we can test various models more easily. Let's try it - we'll create two new features which are quadratic in our original exam scores. So it's really not doing any better (actually worse) on the metric of predicting training examples. Just as a reminder though, we know the model will always do better at minimizing the cost function the more complex you make it. This means that even though we seem to have a worse success rate with categorizing the training data, with our new quadratic terms the hypothesis will be miscategorizing these points less "confidently". What if we add in the final second order term which is the cross-term $x_1x_2$? In this case it does make some serious gains: Automate Higher Order Terms with PolynomialFeatures ¶ As you might imagine, if you want to build a much higher order model then creating each feature individually would be tedious (remember, it's not just the pure terms but all the cross or "interaction" terms as well!). A nice tool for automating this is provided in sklearn called PolynomialFeatures . Just like the StandardScaler discussed last week, this helper tool has a transform method. In this case it takes in your existing features and returns new terms which are higher order transformations of those inputs. Alternative sklearn Tool LogisticRegression ¶ Just like in linear regression last week, there is a more specific class in sklearn that you can use for basic logistic regression. In contrast to SGDClassifier , with this tool the loss is fixed as "log-loss" but you get to specify the "solver" i.e. the algorithm used to do the fitting. Stochastic Gradient Descent is one option for the solver ( sag ), but the default solver liblinear actually uses something called coordinate descent which is an algorithm that doesn't involve computing the derivative of the cost function, only it's values. The documentation has a neat little table for choosing the right solver for your problem. This class handles multiclass classification (more than two labels) and offers you the one-versus-rest approach that Andrew discussed in class, as well as a different method called multinomial . Since this homework set doesn't address multiclass we'll leave it at that for now. Bizarrely, the use of the LogisticRegression class without a "penalty" is not permitted so let's refresh exactly what is meant by penalty and why we use it. Regularized Logistic Regression¶ As Andrew discussed, if you continue adding features left and right to obtain better and better fits to the training data (as you are guaranteed to do) you are likely overfitting and your model won't capture the underlying relationships in the system and thus won't generalize well to new data points. A very useful way to combat this is to allow your form of hypothesis to be complex, but to fit that hypothesis using regularized regression. With regularization, the loss function includes a penalty term which dampens the influence of the features by preferring small parameter values in the fit. Formally the cost function looks like. \begin J_ ^ (\theta) = \frac \bigg[\sum_1^m\frac (h(x_i)-y_i)^2 + \lambda\sum_1^n\theta_j^2\bigg]. \end. The two terms in the expression capture two competing goals - to fit the training set well (cost function term), and to find a simple, smooth relationship that will generalize to new data well (penalty term). But note if your $\lambda$ is too large you will wind up underfitting ($h_\theta(x) \approx \theta_0$). The nature of how you penalize the "largeness" of your $\theta$ values actually matters a decent bit - the two common penalties are $L1$ and $L2$, and this quora response gives a breathtakingly clear explanation of the difference. The default penalty for sklearn s LogisticRegression is $L2$ , so let's take a look at how this guy does on the same problem, now that we know what it does.
Комментариев нет:
Отправить комментарий