пятница, 16 августа 2019 г.

Logistic regression in Excel 1

Just another DEV blog. about software development and all that extremely funny and interesting stuff. Logistic regression in Excel. Yes, that is weird :) If you need to deal with statistics you have to use a special software like Mathlab or Statistica. But if you are limited in your choice and Excel is the only instrument you have, this manual is for you :) Linear regression. First of all, Excel already has the "Regression" add-in which allows you to perform a simple lineral regression analysis: Unfortunately, logistic regression isn't supported by that add-in. But there is always a way to workaround a problem! We're starting our journey from an another add-in which name is "Solver". This add-in allows us to solve different minimization/maximization tasks. Here is an example of a simple maximization problem: we have a furniture factory which produces 2 models of cabinet (A and B). Each model requires a different quantity of resources (wood, time) and generates different income. Also the factory has a limited quantity of available woods and (of course) time. Using simplex-like methods Solver allows us to maximise a target function (the sum of potential income, in current case) by determining the quantity of products for each model that should be produced. The same idea (solving a maximization task) could be used to compute logistic regression. Let's see how it can be done. Logistic regression. Imagine that you are an owner of a company and you have a database of clients. Some of them terminated the contract with your company during the last year and you want to predict which clients are thinking about leaving you right now :) Let's say that you think that decision to stay with your company depends on client's sex and age. Then you can extract the follwing data for the previous year (Trainig Set): Our next step makes a proposal about how the objective function should look like. In common case it looks like $z = \theta^T * x = \theta_1 * x_1 + .. + \theta_n * x_n$. We assume that there is a linear dependency between decision to leave and sex/age of a client: $z = A_0 + A_1 * Age + A_2 * Gender$. In other words, let's just put $\theta_i = A_i$ (theta is a canonical letter for this case, but using it in Excel is a little bit difficult). The calculation below requires some "initial" values for $A_i$ values. For the moment, let's put them eqauls 1 (the values will differ from the ones you see here). After that we have to calculate the logit function ; The main advantage of this function is tending to 0 for x 0. So, we can say that $P(y=1|x)=f(z(x))$ and $P(y=0|x)=1-f(z(x))$. Taking into account that y belongs to , $P(y|x)=f(z(x))^ *(1-f(z(x)))^ $ (Bernoulli distribution). Hence, our task has been reduced to the selection of the theta ($A_i$) parameters of the objective function Z to maximize P probability. $$argmax_\theta \sum P(y|x) = argmax_\theta \sum P(y=y_i|x=x_i)$$ There are two important moments: the Solver can search for a local maximum only, so you have to guess "valid" initial values of $A_i$ variables; the boundary conditions (Y=1,X=1 and X=0,Y=0) must be treated separately (=IF(OR(AND(A5=1,L5=1),AND(A5=0,L5=0)),1,(L5^A5)*(1-L5)^(1-A5))); Finally, when we found $A_i$ values, we can resote the objective function Z = -25 + 0.586549634 * Age - 1.66748138716445 * Gender . Let's check the result on our training set: Where the "Has Terminated" column above is the logit function (ROUND(1/(1+EXP(-T5)),0)) . Now we're sure that our parameters are correct and we can use them to "predict" the future. So, let's do that! Conclusion. As you can see, the way was a little bit tricky. But anyway, we've managed to pass it! :)

Комментариев нет:

Отправить комментарий