пятница, 16 августа 2019 г.

Logistic Regression with TensorFlow - Stats on Steroids!

Logistic regression tensorflow. In this tutorial we will go through the basics of logistic regression and then build a classifier using TensorFlow. Logistic regression is actually a classification method. It is the most common statistical method after linear regression, a basic component in other methods (neural networks) and thus when we have some data with discrete response variable, logistic regression is a good starting point. We will start by generating some data with binary response variable. Then we will define the model and go through the fitting process. Finally, we will build the logistic regression classifier from scratch in TensorFlow and use it to predict the classes of the training dataset. We start by creating a single feature \(x\) that will be distributed uniformly. In order to produce the response variable we construct a simple linear model of the form \(y^* =2+ 3x +\epsilon \, \) and then we assign the observations with above average values to the first class and the rest to the second class, like a step function. We can now plot the generated data. The response variable in our case is binary i.e it is a Bernoulli variable. So, it realizes the value \(1\) with probability \(p\) and the value of \(0\) with probability \(1-p\). Our strategy will be to predict the probability of achieving the value of \(1\) and then use some decision rule to classify the data point in the first or second class. For example, we could say that if the predicted probability of class \(1\) is greater than 50% then we will classify the data point in the class \(1\), otherwise it will be classified in the \(0\) class. We can use various ways to model the relationship between \(P(y=1|x)\) and the features. For example we could use the Linear Probability Model. In this case we assume a linear relationship of the following form: $$P(y=1|x)=\beta_0+\beta_1x$$ While this problem is quite simple and easy to fit, there are some issues. For example, we can not interpret the predictions as probabilities since they can take values greater than \(1\) and less than \(0\), as shown in the figure below. So instead of assuming a linear relationship, we will use a nonlinear transformation of the linear model that is bounded between zero and one. One such function is the logistic function: $$g(x)=\frac >$$ and looks like this: The predictions will be bounded in \((0,1)\) converging smoothly to \(0\) as the argument goes to minus infinity and to \(1\) as the argument goes to plus infinity. So if \(g(x)>0.5\) we assign the data point to class \(1\), otherwise we assign it to class \(0\). Question : When does \(g(x)\) become greater than \(0.5\)? Answer : Since \(g(0)=0.5\) the rule \(g(x)>0.5\) is translated to \(x>0\). In our case the argument is the linear model \(\beta_0+\beta_1x\) so the decision rule can be written as: $$\beta_0+\beta_1 x >0 \Rightarrow x > -\frac $$ If we include an additional feature and repeat the process the new decision boundary is an affine function of the form \(x_1=ax_2+b\). In general, logistic regression provides an affine decision boundary. Finally, we can express our model in the following way: $$P(y=0|x)=1-g(\beta_0+\beta_1 x)$$ or equivalently: $$P(y|x)=g(\beta_0+\beta_1 x)^y(1-g(\beta_0+\beta_1 x))^ $$ The learning process. We will use the maximum likelihood method to estimate the coefficients. The likelihood of the data is their joint density: Assuming our sample is i.i.d we can write the likelihood function as: $$L(b)=f(y_1|x_1;b)\,f(y_2|x_2;b)\dots f(y_n|x_n;b)=\prod f(y_i \,|\, x_i;b)$$ Equivalently, we can maximize the loglikelihood: $$l(b)=log(L(b))=\sum y_i\,log(g(\beta_0+\beta_1 x))+(1-y_i)\,log(1-g(\beta_0+\beta_1 x))$$ We will use the gradient ascent algorithm. The only difference with the gradient descent algorithm is a ‘+’ in the update process instead of ‘-‘, since this is a maximization problem and the coefficients should move in the same direction with the gradient. You can find a detailed presentation of the gradient descent algorithm here. So the updates will be given by this formula: where the partial derivatives can be computed by the formula below: Implementation. What is TensorFlow? Quoting the official website, “TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.” In terms of usage, the higher level idea of TensorFlow is that you first create a graph that contains all the operations that will be implemented and then you run the graph in a session. No operation is implemented outside of the session and that holds for tensors too. For details you can visit the official API. Building the graph. Variables and placeholders.

Комментариев нет:

Отправить комментарий