The Classification Cookbook Chapter 1

Serving Up Logistic Regression From Scratch

Posted by James Rigby on November 21, 2019

Welcome to Chapter 1 of The Classification Cookbook. This post is going to focus on the inner workings of logistic regression. The first portion of the post is going to break the logistic model into small elements, writing a function for each of the chunks. This is meant to illustrate how logistic regression is actually estimated. Stay tuned for Chapter 2, which will illustrate logistic regression from a bayesian perspective including how to speed up Bayesian analyses using approximations to the posterior. This post is intended for readers who really want to know the inner workings of logistic regression is estimated and may be less practical than some of my other posts.

Data Overview

Before beginning, let’s motivate this example with a real-world problem. Your consulting firm was recently hired to build an analytics tool that integrates with the company’s Business Intelligence (BI) software to understand the factors that influence employee recruitment. The primary criterion of interest is job acceptances - candidates' decisions to accept or reject an offer made by the organization. You decide to begin this project by exploring a logit model (logistic regression) because you hear that it is easily scalable and pretty robust!

The BI platform provides you with information from recent job analyses, salary surveys, labor market research, and job offer information. Here's a peek at the data.

job_id applicant_id salary job_alt difficulty flexibility market_sal decision
12 1 25.00807 1 1 1 23.42748 1
39 2 28.52853 2 2 1 27.61328 1
8 3 27.26684 1 4 1 21.40710 0
37 4 19.09980 4 3 4 17.79660 1
1 5 23.96467 1 2 1 24.36497 0
49 6 27.38086 4 4 3 18.26968 1

Approaching Classification from a Linear Perspective

Why is linear regression not the most appropriate technique for categorical outcomes? If we code job acceptance as 0 or 1 we can predict the outcome using a linear regression model. Furthermore, if the goal of our analysis is to make predictions about who will accept a job offer and who will reject the job offer, we could create a decision boundary to classify applicants. For example, we could classify job applicants with a predicted value above .5 as accepters and classify those with a score below .5 and rejecters.

What does .5 mean though? Some people may be tempted to say that .5 is the probability that an applicant accepts a job offer. That interpretation is incorrect. Imagine that we tried to generalize this model to make predictions for a new set of jobs. The average salary for these positions is a little bit higher. When we make predictions for the new point (depicted as a cross). The predicted value is 1.08. Given that probabilities are bounded by 0 and 1, the linear model’s prediction is clearly not a probability. In fact, the predicted values are relatively meaningless.

Logistic Regression: A Probabilistic Approach to Classification

We can still approach classification from a linear perspective, however, we need to transform the function to accommodate the binary data using a link function. Logistic regression does this by assuming the log odds of the categorical outcome is linearly related to the independent variables.

\[log\frac {p(y=1)}{p(y=0)} = X\beta+e\]

We can rearrange this function to depict the model in probablistic terms.

\[\frac {p(y=1)}{p(y=0)}= e^{X\beta+e}\]

\[\frac {p(y=1)}{1-p(y=1)}= e^{X\beta+e}\]

\[p(y=1)= (e^{X\beta+e}-e^{X\beta+e}p(y=1)) \]

\[p(y=1)+e^{X\beta+e}p(y=1)= (e^{X\beta+e}) \]

\[p(y=1)(1+e^{X\beta+e})= (e^{X\beta+e}) \]

\[p(y=1) = \frac{(e^{X\beta+e})}{(1+e^{X\beta+e})} \]

This can be written in a slightly more compact form

\[p(y=1) = \frac{1}{(1+e^{-(X\beta+e)})}\]

Notice that there is still a linear component \(X\beta + e\), however, it is transformed to ensure that the predicted values are bounded by 0 and 1.

How does this new function look? Unlike the linear classifier (red line) the logistic model’s (blue line) predicted values can be directly interpreted as the probability of an employee accepting a job given the predictors.

Zooming out provides a better illustration of how the two functions differ. The logit model asymptotes at 0 and 1 while the linear model is, well, linear. Interestingly, the decision boundaries for the two models seem to align at a .5 threshold. However, if we adjusted the decision boundary above or below .5, the two models would diverge.

Let’s write functions that define the functional form of the logistic regression. To make its relation to the linear model explicit, I defined a function called v which is the estimate of the exponentiated portion of the model (i.e., \(X\beta\)). I then define a function called p_hat which transforms the linear output to estimate probabilities.

# Linear combination of predictors
v<-function(X, beta){
  v<-X%*%beta
  v
}

# Estimated Probability through sigmoid transformation
p_hat<-function(v){

  p<-1/(1+exp(-v))
  
  p
}

Maximum Likelihood Estimation

Just like linear regression selects the regression coefficients that minimize the sum of squared error, logistic regression has an error criteria, often referred to as an objective function. Logistic regression is fit using maximum likelihood estimation. Broadly speaking, maximum likelihood tries to find the parameters (\(\beta\)) that maximize the conditional probability of the outcome (i.e, \(p(y|\beta, X)\)). In practice, the negative log-likelihood is typically minimized, but maximum likelihood sounds better than negative minimum negative log-likelihood, right? The negative log-likelihood function for logistic regression is written below.

\[L(\beta) = -\sum_{i=1}^n\ y_i\ ln(\ \hat p_i(y = 1)\ )+(1-y_i)\ ln(\ 1-\hat p_i(y = 1)\ )\]

What does the likelihood function mean? Let’s break the function into two pieces by splitting it at the plus sign.

Part 1

\[y_i\ ln(\ \hat p_i(y = 1)\ )\] This portion of the likelihood function multiplies the observed outcome 4 (\(y_i\)) by the log of the expected probability. As probabilities fall between 0 and 1 and the log of numbers closer to 0 becomes more negative, this portion of the likelihood is largest when 1) the predicted probability is high and 2) the observed outcome was one. In other words, this portion of the equation is the largest when observations are correctly classified as belonging to class 1. Observations that are 0 do not contribute to this part of the likelihood. The plot below illustrates how this function behaves across classes as the predicted probability increases.

Part 2

\[(1-y_i)\ ln(\ 1-\hat p_i(y = 1)\ )\]

The second half of the likelihood formula follows the same logic. The major difference is that it focuses on incorporating the other class (\(y_i = 0\)). This portion of the likelihood is maximized when 1) the predicted probability is closest to 0 and 2) the observed class was 0. Thus, this portion is maximized when the model accurately predicts class 0. The plot below depicts this function.

Taken together, the likelihood function is the most positive for the set of \(\beta\) that do a the best job discriminating between the classes. The function below writes the likelihood function. Its sole arguments are the observed values y and the predicted probabilities p which can be generated by p_hat().

# Implementation of the likelihood
ll<-function(y, p){
  -sum(y*log(p)+(1-y)*log(1-p))
}

Newtonian Optimization to Solve for Beta

How do we solve for the \(\beta\) that maximize the likelihood of our observed data? One method is to use Newtonian optimization to solve for the maximum of our likelihood function. Newtonian methods rely on a second-order Taylor expansion that approximates the likelihood function. This is just a fancy way of creating a function that approximates the likelihood function but is easier to maximize or minimize. So long as the objective function is second differentiable we can iteratively update our estimates for \(\beta\) using the following equation:

\[\beta_{new}= \beta_{old}-H_\beta^{-1}\Delta\beta\]

where \(\Delta\beta\) is the vector of first derivatives of the likelihood function with respect to \(\beta\) and \(H_\beta\) is equal to the hessian matrix, or matrix of second derivatives, of the likelihood function with respect to the \(\beta\).

This function sets the derivatives of the quadratic approximation to 0, therby solving for the approximation’s minimum or maximum. This function sets the derivatives of the quadratic approximation to 0, thereby solving for the approximation’s minimum or maximum. The better the approximation maps onto the likelihood function, the quicker the model converges!

To visually illustrate what Newtonian optimization does, Let’s try to optimize the following equation.

\(y = .25x^2\)

To begin finding the minimum for this function, we need to initialize the algorithm by picking a starting value. Let's pick a point that’s really far off from the stationary point (the minimum).

After doing some simple calculus, I found that the derivative of the objective function in this example is equal to \(.5x\) while the second derivative is equal to \(.5\). To verify my calculations are correct, I plot the objective below. As a reminder, the first derivative is equal to the tangent line, or the instantaneous slope of the objective function at the point \(x_n\).

Let’s take a look at the Taylor expansion of this equation about our starting point. A Taylor expansion of an arbitrary function \(f(x)\) at the point \(x_n\) is equal to

\[f_T(x)= f(x_n)+f'(x_n) \Delta x+ \frac{1}{2} f''(x_n) \Delta x^2\]

We can plug in our starting value and the derivatives to define and plot the quadratic approximation.

\[f_T(-20) = 100 + (-10 \times (x+20)) + \frac{1}{2} (.5 \times (x+20)^2)\]

Plotting this function over the objective function reveals that it is a fantastic approximation to the objective function, which may not be surprising given that the true form of the objective is quadratic. In fact, if you combine like terms in the Taylor expansions, you would find that the approximation is the objective function.

We can apply the update rule by applying the following formula

\[x_{new} = x_{old}-\frac{\Delta x}{\Delta \Delta x}\] \[x_{new} = -20-\frac{-10}{.5}\]

Given the perfect congruence between the Taylor expansion and the objective function, this optimization problem converges in one iteration. This only occurs when the function you are trying to maximize or minimize is perfectly represented by the Taylor expansion. Unfortunately, the likelihood function requires multiple Taylor expansions to converge but this isn’t a huge deal. We just need to know a little bit of programming to get the job done!

Because the likelihood in logistic regression is second differentiable, we can apply Newtonian methods to solve for the beta! The first derivative of the likelihood function with respect to beta can be estimated using the following formula:T

\[\Delta\beta = -X^\top (y-p(y=1))\] I define a funciton, deriv() that implements the above equation.

deriv<-function(X, y, p){
  t(X)%*%(y-p)
}

The second derivative’s scalar representation may seem a little bit tricky, but the matrix calculations (how I implement my R code below) is actually quite easy! I’ll show both derivations below. The scalar representation can be calculated as follows:

\[\frac{\delta^2L(\beta)}{\delta\beta_j\delta\beta_l} = \sum_{i = 1}^n\ x_{ij}\ x_{il}\ p_i(y = 1)\ (1-p_i(y = 1))\]

This solves for element \(j, l\) of the hessian matrix.

The entire hessian can be solved using the following matrix algebra.

\[H_{\beta} = X^\top S X\] where S is a diagnol matrix with \(p(y=1)(1-p(y=1))\) along the diagonal.

# Diaganol Matrix S
S_mat<-function(p){
  var<-p*(1-p)
  dim<-length(var)
  s<-matrix(0, nrow = dim, ncol = dim)

  diag(s)<-var
  
  s
}


# Calculate Information Matrix
second_deriv<-function(X, S){
  t(X)%*%S%*%X
}

Iteratively Reweighted Least Squares

If you read many methodological papers on logistic regression it is only a matter of time before you come across the term “Iteratively Reweighted Least Squares”. An interesting property of the Newton step described above is that it simplifies to another optimization method called iteratively reweighted least squares. We could really get into the weeds with this method, but there is a fantastic post on stacks exchange that explains its derivation and relation to Newtonian methods. You can find the link here. The author, jld (Joseph) has a fantastic blog.

Assuming that those interested followed up by reading the blog post, I will just show the update rule below

Formulaically, the iterative least-squares step is equal to:

\[\beta_{new} = H_\beta^{-1}X^\top S z\] where z is defined as \(X \beta+S^{-1}(y-p(y=1))\).

Those who read the blog post may be confused by the notation, but this derivation is equivalent to that written by jld except my notation is based on Hastie, Tibshirani, and Friedman’s (2001) derivation in Elements of Statistical Learning. Both are accurate but use different notational conventions! The code below defines the necessary functions.

# Define z
z<-function(X, theta, S, y, p){
  X%*%theta+solve(S)%*%(y-p)
}
# Update the regression weights
update_beta<-function(i_mat, X, S, z){
  
 (solve(-i_mat))%*%t(X)%*%S%*%z

  }

When Do We Stop?

Iterative methods require some stopping criteria. The default for many proprietary statistical packages is the relative gradient convergence criterion. To ensure that the model doesn’t get stuck in a perpetual loop, I also include a maximum iterations option as a safeguard, but typically models converge quickly.

relative_grad_crit<-function(deriv, i_mat, ll){
  abs((t(deriv)%*%solve(i_mat)%*%deriv)/(abs(ll)+1e-6))
}

Hypothesis Testing

If you are interested in making inferences about the relationship between the predictors and outcome as well as predictions, deriving the information matrix is the final piece to the puzzle. The information matrix is important because, based on asymptotic normal theory \(\hat \beta \sim MVN(\beta, \hat\Sigma_\beta)\) where \(\Sigma_\beta\) s equal to the inverse of the information matrix. This means that the standard errors of beta, which are used in Wald’s tests for significance, are equal to the square root of the diagonal of \(\Sigma_{\beta}\).

information_mat<-function(sec_deriv){
  -sec_deriv
}

Putting it All Together

Below I create a function that pulls together all of the pieces of logistic regression ou. I included comments that should explain the rationale behind why I include different pieces of code.

log_reg<-function(X, y, max_iter = 20, tol = 1e-8){
  
  # Initalize a counter i, that counts how many iterations have ran (used in tandem with max_iter criterea)
  i<-1

  # Provide starting values for the regression weights which initialize the optimization
  theta<-rep(0, times = ncol(X))

  # Before starting the iterative loop, I initialize all of the important values based on starting values

  # Solves for the predicted probability given starting values
  v_i<-v(X = X, beta = theta)
  p<-p_hat(v = v_i)
  
  # Solves for the first and second derivaties of likelihood function given start values
  deriv_i<-deriv(X, y, p)
  S<-S_mat(p = p)
  s_deriv<-second_deriv(X = X, S = S)
  i_mat<-information_mat(sec_deriv = s_deriv)
  
  # Initializes the stopping critereon
  grad_crit<-relative_grad_crit(deriv_i, i_mat, ll(y, p))
  
  # Implementing Iteratively Reweighted Least Squares
  # This is the iterative component that runs until stopping criterea is reached
  while(i<max_iter&grad_crit>tol){
    
  # Update beta using IRLS step
  z_i<-z(X = X, theta = theta, S = S, y = y, p = p)
  theta<-update_beta(i_mat = i_mat, X = X, S = S, z = z_i)
  
  # Solve for VI and P with new values
  v_i<-v(X = X, beta = theta)
  p<-p_hat(v = v_i)
  
  # Solves for the first and second derivaties
  deriv_i<-deriv(X, y, p)
  S<-S_mat(p = p)
  s_deriv<-second_deriv(X = X, S = S)
  i_mat<-information_mat(sec_deriv = s_deriv)
  
  # update ll
  ll_new<-ll(y = y, p = p)
  
  # Return a message with the iteration value
  message("Iteration: ", i, ll_new)
  
  # move one iteration forward
  i<-i+1
  
  # Calculate stoping critereon
  grad_crit<-relative_grad_crit(deriv_i, i_mat, ll(y, p))
}

# I store the final values in the model
final_ll<-ll(y = y, p = p)
final_beta<- theta

# Format the betas so they have names
names(final_beta)<-c(colnames(X))

# Return the Values
list(betas = final_beta, 
     LL = -final_ll, # Returns negative final ll because ll is producing negative ll
     vcov = -solve(i_mat))
}

How Does Our Model Compare with glm()?

Let's fit a model using the standard glm function and compare it to our results. Below, I output the coefficients and standard errors from both models. Not bad, right?

# Our model ----------------------------------------
# Prep Matrices
X<-dec[3:7]
y<-dec["decision"]

X<-cbind(1, X)
X<-as.matrix(X)
y<-as.matrix(y)

colnames(X)[1]<-"(Intercept)"

our_reg<-log_reg(X = X, y = y, max_iter = 100)

# Their model --------------------------------------
their_reg<-glm(decision~., data = dec%>%select(salary:decision), family = "binomial")

data.frame(our_coefs = as.vector(our_reg$betas),
           our_se = sqrt(diag(our_reg$vcov)),
           their_coefs = coef(their_reg),
           their_se = sqrt(diag(vcov(their_reg))))%>%
  kable(format = "html")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
our_coefs our_se their_coefs their_se
(Intercept) -0.1935129 1.2099556 -0.1935129 1.2099512
salary 0.3384038 0.0666611 0.3384038 0.0666605
job_alt 1.0078720 0.1899510 1.0078720 0.1899494
difficulty -1.8396173 0.2718959 -1.8396174 0.2718920
flexibility 0.2497007 0.1358562 0.2497007 0.1358555
market_sal -0.2805268 0.0524514 -0.2805268 0.0524508
logLik(their_reg)
## 'log Lik.' -115.5387 (df=6)
our_reg$LL
## [1] -115.5387

Text by James Rigby. Photo by Element5 Digital from Pexels