Logistic regression loss function. I read two about versions of the loss function for logistic regression, which of them is correct and why? 1. $l(\beta) = \sum\limits_ ^ \Big(-y_i\beta^Tx_i+ln(1+e^ )\Big)$ $$\beta = (w, b)\text \beta^Tx=w^Tx +b$$ 2. $L(z_i)=log(1+e^ )$ From my college course:, $$z_i = y_if(x_i)=y_i(w^Tx_i + b).$$ I know that the first one is an accumulation of all samples and the second one is for a single sample, but I am more curious about the difference in the form of two loss functions. Somehow I have a feeling that they are equivalent. The relationship is as follows: $l(\beta) = \sum_i L(z_i)$. Define a logistic function as $f(z) = \frac > > = \frac >$. They possess the property that $f(-z) = 1-f(z)$. Or in other words: If you take the reciprocal of both sides, then take the log you get: Subtract $z$ from both sides and you should see this: At the moment I am re-reading this answer and am confused about how I got $-y_i\beta^Tx_i+ln(1+e^ )$ to be equal to $-y_i\beta^Tx_i+ln(1+e^ )$. Perhaps there's a typo in the original question. In the case that there wasn't a typo in the original question, @ManelMorales appears to be correct to draw attention to the fact that, when $y \in \ $, the probability mass function can be written as $P(Y_i=y_i) = f(y_i\beta^Tx_i)$, due to the property that $f(-z) = 1 - f(z)$. I am re-writing it differently here, because he introduces a new equivocation on the notation $z_i$. The rest follows by taking the negative log-likelihood for each $y$ coding. See his answer below for more details. OP mistakenly believes the relationship between these two functions is due to the number of samples (i.e. single vs all). However, the actual difference is simply how we select our training labels. In the case of binary classification we may assign the labels $y=\pm1$ or $y=0,1$. As has already been stated, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$. If we pick the labels $y=0,1$ we may assign. which can be written more compactly as $\mathbb (y|z) =\sigma(z)^y(1-\sigma(z))^ $. As always, it is easier to maximize the log likelihood. As a loss function, this corresponds to minimizing the negative of the log likelihood. For $m$ samples $\ $, after taking the natural logarithm and some simplification we find that. \begin \begin l(z)=-\log\big(\prod_i^m\mathbb (y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb (y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^ ) \end \end. Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$. It is clear then that we can assign. It is also clear that $\mathbb (y=0|z)=\mathbb (y=-1|z)=\sigma(-z)$. Following the same steps as before we minimize in this case the loss function. \begin \begin L(z)=-\log\big(\prod_j^m\mathbb (y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb (y_j|z_j)\big)=\sum_j^m\log(1+e^ ) \end \end. Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent: The case $y_i=1$ is trivial to show. If $y_i \neq 1$, then $y_i=0$ on the left hand side and $y_i=-1$ on the right hand side.
Комментариев нет:
Отправить комментарий