Logistic Regression (with Gradient Ascent)

2 minute read

Most of this information (especially notation) comes from Stanford’s CS229 lecture notes and the rest comes from Bishop’s “Pattern Recognition and Machine Learning”. If I skip some explanation on parts, such as maximum likelihood, you can refer to one of those resources.

At its core, logistic regression is a binary classifer. In the past, you’ve most likely seen regression models of the most simple case where $h(x) = \beta^TX + \beta_0$ and the model output is a real number. In classification problems, however, we wish to predict discrete class labels. Actually, the model output is a probability of a categorical outcome (model output value lies in [0,1]). To do this, we use a generalization of the model in which we transform the linear function of $\beta$ using a non-linear function $f(\cdot)$. In machine learning, this is called an activation function.

You can think of logistic regression as predicting the outcome between two categorical variables, maybe blue/green or 0/1 but it won’t be used to output a continuous value such as a houseprice of $1.31M

Stated above, logistic regression outputs a probability of a categorical variable. That means that the output is bounded within [0,1]. Why is it bounded? Because you cannot have a -3.20% chance of success or 140% chance of failure.

How do we force our outputs to be within that 0 to 1 range? With the sigmoid function :)

Sigmoid function: $g(x) = \frac{1}{1+e^{-x}}$, with the curve:

../assets/images/sigmoid.png

Technically, our hypothesis equation should then be:

\[h_\theta(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^TX}}\] \[\text{where } g(z) = \frac{1}{1+e^{-x}}\]

Now that we have our logsitic regression function, we need to learn the parameters $\theta$.

This is done through maximum likelihood. The motive behind maximum likelihood is that we should choose $\theta$ to make the data as highly probable as possible.

\[L(\theta) = \prod_{i=1}^m p(y^{(i)} | x^{(i)}; \theta)\] \[L(\theta) = \prod_{i=1}^m (h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})}\]

We need to maximize $L(\theta)$, which is difficult. Instead, we maximize any strictly increasing function of $L(\theta)$, the log likelihood.

\[\ell(\theta) = log(L(\theta)) = \sum_{i=1}^{m}y^{(i)}log(h(x^{(i)})) + (1-y^{(i)})log(1-h(x^{(i)}))\]

We maximize this via gradient ascent.

Our update will be as follows (the plus sign is correct – this is gradient ascent, remember, not gradient descent):

\[\theta := \theta + \alpha \nabla_\theta \ell(\theta)\] \[\text{where } \frac{\partial}{\partial \theta_j}\ell(\theta) = (y-h_\theta(x))x_j\] \[\text{[derived in CS229 notes]}\]

So, finally:

\[\theta := \theta + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\]

import numpy as np
import matplotlib.pyplot as plt


def sigmoid(x):
    return 1/(1+np.exp(-x))

def log_likelihood(features, target, weights):
    return np.sum(target*np.log(sigmoid(np.dot(features, weights))) + (1-target)*np.log(1-sigmoid(np.dot(features,weights))))

def logistic_regression(features, target, iterations, alpha, fit_intercept=True, show_weights=True):
    if fit_intercept:
        intercept = np.ones((target.shape[0], 1))
        features = np.concatenate((intercept,features), 1)
        
    m = features.shape[1]
    weights = np.zeros(m)
    for step in range(iterations):
        curr = np.dot(features, weights)
        preds = sigmoid(curr)
        
        
        error = target - preds
        gradient = np.dot(features.T, error)
        weights += alpha*gradient
        
        # show cost 
        if (step%1000) == 0:
            print(log_likelihood(features, target, weights))
            
    if show_weights:
        print("weights: {}".format(weights))
        
    return weights

TO DO: Update this post with some data and results.

Share on

Twitter Facebook Google+ LinkedIn

Alexander Hughes

Logistic Regression (with Gradient Ascent)

Share on

You may also enjoy

Discrete Random Variables

Pca Word Embedding Visualization

Maximum Likelihood Estimation of Coin Flip

Linear Regression From Scratch