The Classification Cookbook: Chapter 3

Identifying Latent Categories using Latent Class Analysis

Posted by James Rigby on March 21, 2020

In chapters 1 and 2 of the classification cookbook, I illustrated how to apply algorithms to solve for the conditional probabilities of an observed categorical outcome. In these situations, our outcome of interest was known. In other words, for each training example we were able to link our predictors to a labeled category. What about situations where the class of interest is unknown? For example, imagine that a practitioner was trying to evaluate the effectiveness of a training exercise. To do this, they administered a knowledge test to 1000 participants. The test was made up of 20 true/false items. You can find the script used to generate this data at the bottom of this post.

Sample of Participant Data
id item_1 item_10 item_11 item_12 item_13 item_14 item_15 item_16 item_17
1 1 0 1 1 1 0 1 0 1
2 1 1 1 0 0 0 1 0 0
3 1 1 0 1 1 0 0 0 0
4 0 1 1 0 1 1 1 1 1
5 1 0 1 0 1 0 1 0 1
6 0 0 1 1 0 1 1 1 0

After some creative data checks, the researcher isn’t convinced that all participants were paying attention when taking the test. 330 respondents selected the FALSE response option (0) for every single item. Furthermore, there is a large portion of respondents getting scores below the rate expected by chance (shown in the shaded areas below). Together, this suggests that our researcher has some inattentive respondents! How can we classify inattentive respondents based on their response patterns?

##### Data Exploration #####
g_test%>%
  select(-id)%>%
  rowSums()%>%
  table()%>%
  as.data.frame()%>%
  rename(n_true = ".")%>%
  knitr::kable(caption = "Number of Trues")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
Number of Trues
n_true Freq
0 330
4 1
5 2
6 12
7 24
8 34
9 56
10 65
11 73
12 99
13 122
14 110
15 57
16 12
17 3
p<-g_test_long%>%
  mutate(ans_correct = correct == answer)%>%
  group_by(id)%>%
  summarise(score = sum(ans_correct))%>%
  count(score)%>%
  ggplot(aes(x = score, y = n))+
  geom_histogram(stat = "identity")+
  geom_rect(xmin =0, xmax = 10.5, ymin = -50, ymax = 370, alpha = .3)+
  labs(x = "Test Score",
       y = "Frequency",
       title = "Distribution of Scores",
       caption = "Shaded Area is Less than Chance")+
  theme_bw()+
  theme(panel.grid = element_blank())

ggplotly(p)

Enter in Latent Class Analysis

One approach would be to treat “Inattentive Respondent” as a latent categorical variable. For those unfamiliar with latent variable modeling, a latent variable simply is one that is unobserved. There are several prepackaged solutions available in R to conduct latent class analysis. The process of fitting these models is relatively straight forward.

  1. Fit several models with a different number of latent classes (Model Enumeration)
    1. Evaluate the fit of the different models using comparative fit indices (AIC, BIC)
    2. Select and interpret model

For those of you here for the practical “how to”, I illustrate those steps in the code chunk below and provide annotated code. For those interested in the nitty gritty math, don’t worry. I will illustrate how to conduct a latent class analysis from scratch below.

##### Prepackaged LCA #####
# The following LCA is conducted in poLCA
# poLCA requires the data be either non-zero integers or factors
# This code converts the test data into factors
poLCA_data<-g_test%>%
  mutate_at(
    vars(
      starts_with("item")
      ), 
    as.factor
    )


# Defining formula
# poLCA uses a formula interface to define model parameters
# This formula says latent classes will be defined by varying intercepts on the 20 TRUE and FALSE Items
f<-cbind(item_1, item_2, item_3, item_4, item_5, item_6, item_7, item_8, item_9, item_10,
         item_11, item_12, item_13, item_14, item_15, item_16, item_17, item_18, item_19, item_20)~1

# This code iterates over different class solutions estimating between 1 and 5 latent classes
# The verbose argument suppresses output until we explicitly call for it to be printed
# the nrep specifies the number of random starts fit for each class solution
# Latent class models are prone to converging to local minima
# Models must be run multiple times with different start values to have confidence in the solution
model_list<-map(1:5, ~poLCA(f, poLCA_data, nclass=., verbose = FALSE, nrep = 10))


# The code below Extracts fit statistics
fit<-map_dfr(1:5,.f = function(x){
  data.frame(Classes = x,
             AIC = model_list[[x]]$aic,
             BIC = model_list[[x]]$bic,
             G_sq = model_list[[x]]$Gsq,
             Chi_sq = model_list[[x]]$Chisq)
})

# Printing fit statistics
fit%>%
  knitr::kable()%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
Classes AIC BIC G_sq Chi_sq
1 25523.94 25622.10 16181.018 9202103
2 17616.44 17817.66 8231.521 16474600
3 15713.90 16018.18 6286.972 1047107
4 15702.93 16110.27 6234.002 1047076
5 15704.57 16214.98 6193.646 1046711
# Plotting fit statistics
p<-fit%>%
  pivot_longer(-Classes, names_to = "measure")%>%
  ggplot(aes(x=Classes, y = value))+
  geom_line()+
  facet_wrap(~measure, scales = "free")+
  labs(x = "# of Latent Classes",
       y = "Fit Statistic",
       title = "Class Enumeration Statistics"
       )+
  theme_bw()+
  theme(panel.grid = element_blank())

ggplotly(p)

Taken together, the model fit statistics suggest that a three class solution is the most appropriate fit for our data. Apparently, simple non-attentiveness doesn’t sufficiently describe the participants response patterns. Let’s plot the conditional item response probabilities. These can be interpreted as the probability of endorsing response option r for item t, given membership in latent class k. The conditional probabilities can give insight into what latent categories we are recovering.

##### Conditional Response Probabilities #####
cond_prob<-map_dfr(names(model_list[[3]]$probs), function(x){
  as.data.frame(
    model_list[[3]]$probs[[x]]
    )%>%rownames_to_column(var = "Class")%>%
    mutate(item = x)
  })%>%
  left_join(key%>%
              mutate(item = paste("item", item, sep = "_")))%>%
  select(item, Class, prob_0 = `0`, prob_1 = `1`, correct)%>%
  mutate(item = if_else(correct == 0, paste0(item, "*"), item),
         Class = str_replace(str_remove(Class, ":"), "c", "C"), 
         Class_guide = case_when(Class == "Class 1 " ~ paste0(Class, "p = ", round(model_list[[3]]$P[1], 2)), 
                           Class == "Class 2 " ~ paste0(Class, "p = ", round(model_list[[3]]$P[2], 2)),
                           Class == "Class 3 " ~ paste0(Class, "p = ", round(model_list[[3]]$P[3], 2))),
         p = round(prob_1, digits = 2))


p<-cond_prob%>%
  ggplot(aes(x = Class, y = p, fill = Class_guide))+
  geom_bar(position = "dodge", stat = "identity")+
  facet_wrap(~item, ncol = 4)+
  labs(y = "Class Conditional Probability",
       x = "",
       title = "Class Conditional Probabilities Across Items",
       subtitle = "Asterisk denote items where response option 1 was incorrect")+
  ylim(c(0,1))+
  theme_bw()+
  guides(fill = guide_legend(title = ""))+
  theme(panel.grid = element_blank())

ggplotly(p, tooltip = c("x", "y"))

The conditional item response probabilities suggest that the first group is associated with attentive respondents. This is clear because respondents have a high probability of selecting 1 when 1 is in fact the correct answer. They also have a low probability of selecting 1 when 1 is not the correct answer. The second class is associated with random respondent. These respondents have a 50/50 shot at selecting either 1 or 0 on either answer. Finally, Class 3 is associated with respondents who straight-lined the test. These respondents always selected 0.

Respondents can be classified into one of these three groups based on their posterior responsibilities to the latent classes. These are the conditional probabilities of belonging to latent class k, given the respondent’s response patterns. In this case class separation is high so it is pretty clear which group each respondent belonged to.

What Did We Just Do?

This will take some unpacking. In short, we fit a series of models that made differing assumptions about the number of unobserved groups that characterize the observed data. In this example we have binary (yes/no) type data. This is appropriately characterized by a binomial distribution. For each of our models, we allowed parameters to vary between groups.

The binomial distribution is fully parameterized by \(p\) - the probability of observing an event. In our case \(p\) represents the probability of a respondent answering “yes” to an item. Therefore, we allowed \(p\) to very across groups.

Let’s take a look at how we can recover \(p\) using maximum likelihood estimation in the one group case. We can solve for the vector of parameters \(p\) by evaluating the likelihood of observing our data given the \(p\) parameters, assessing the gradient, and then updating. Thus, the optimization problem is

\[\max_\hat p \prod_{j=1}^J\hat p^{y_j}(1-p)^{1-y_j}\] \[s.t.\ 0\le p \le 1\] where j indexes the items.

The constraint makes the optimization problem slightly harder. One way to drop this constraint would be to convert the optimization problem into a continuous form by employing a logit. This is identical to how we accommodated discrete outcomes in my previous chapter on the logistic regression.

\[\max_\hat \theta \prod_{j=1}^J \frac {1}{1+e^{-\theta_j}}^{y_j} (1-\frac {1}{1+e^{-\theta_j}})^{1-y_j}\]

To illustrate that the maximum likelihood approach does a reasonable job recovering proportions, I also calculate \(p\) using the closed form solution.

\[p = \frac{\#\ of\ 1s}{total\ n}\]

##### Maximum likelihood approach to proportions #####
# Defining Logit function to relax constraint
logit<-Vectorize(
  function(x){
  1/(1+exp(-x))
  }
)

# Defining LL Function
binom_ll<-function(theta, item_mat){
  
  # Initializing ll_vec object for likelihood storage
  ll_vec<-c()
  
  for(i in 1:ncol(item_mat)){
    
    # storing loglikelihood of y given theta in ll_vec
    ll_vec[i]<-prod(dbinom(x = item_mat[,i],size =  1, prob = logit(theta[i])))
  }
  
  # returning -2*ll for optim
  -2*sum(log(ll_vec))
}

# Converting Dataframe to matrix for convenience
g_test_mat<-as.matrix(g_test[2:21])

# Initializing theta
theta<-rep(.1, 20)

# Optimizing over theta
solutions<-optim(theta, fn = binom_ll, item_mat = g_test_mat, method = "BFGS")

# Verification using a normal person's approach to probability
standard<-colSums(g_test_mat)/nrow(g_test_mat)

# Solving probabilities
probabilities<-data.frame(ML_Estimates = logit(solutions$par), 
                          Observed_Proportion = standard)

# Printing table
probabilities%>%
  knitr::kable(digits = 2)%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
ML_Estimates Observed_Proportion
item_1 0.47 0.47
item_10 0.48 0.48
item_11 0.47 0.47
item_12 0.20 0.20
item_13 0.48 0.48
item_14 0.47 0.47
item_15 0.47 0.47
item_16 0.47 0.47
item_17 0.49 0.48
item_18 0.20 0.20
item_19 0.20 0.20
item_2 0.21 0.21
item_20 0.48 0.48
item_3 0.48 0.48
item_4 0.47 0.47
item_5 0.47 0.47
item_6 0.48 0.48
item_7 0.49 0.48
item_8 0.20 0.20
item_9 0.22 0.22

In the one class case, there is no need to use maximum likelihood. Proportions are quicker and easier! However, when there are multiple unobserved latent classes, we must rely on the Expectation Maximization (EM) algorithm to solve the maximum likelihood problem. The (EM) algorithm was developed by Dempster and colleagues in the 70s as a means to solve maximum likelihood problems with missing data. Latent class models can be conceptualized as a missing data problem where we did not observe the latent class variable.

The EM algorithm consists of two steps. The expectation step evaluates the likelihood function using the current best guess of the model parameters. The maximization step updates the parameters maximizing the previously calculated likelihood. In fact, the maximization step is more concerned with using the relative likelihood of belonging to class k given respondents rather than the log likelihood. These steps may seem abstract, but my hope is that as we apply the algorithm to the latent class problem things will become more concrete.

In quasi-code the EM looks something like this.

Input y, n_class, eps;

  Initialize p_jk numeric;
  Initialize lambda numeric;
  
  Repeat Until diff < eps;
    
    # Expectation Step
    ll_ind = expectation_ll_ind(x, p_jk, lambda);
    
    # Maximization Step
    post = maximize_posterior(ind_ll);
    lambda = maximize_lambda(post);
    p_jk = maximize_p(ind_ll, y);
    
    new_ll = sum(log(rowSums(ll_ind));
    
    diff = new_ll-old_ll;
    old_ll = new_ll;
  end;

Return post, lambda, p_jk, new_ll;

Defining the Likelihood Function

The likelihood function is not terribly different from the one group case! The biggest difference is that, instead of estimating \(j\) parameters, we are estimating \(j\times k + k\) parameters. This is because we are estimating a conditional response probability for every item, for every group. Furthermore, we are estimating mixing proportion (\(\lambda_k\)) that weight the likelihoods. These \(\lambda_k s\) represent the proportion of the sample that belong to latent class \(k\). Thus, the likelihood function simply uses the standard binomial likelihood for each latent class but weights the contribution of that group’s likelihood to the overall likelihood by it’s representation within the sample.

\[LogLik = \sum_{i=1}^N log\sum_{k=1}^K \lambda_k\prod_{j=1}^Jp(y_j|k)^{y_j} (1-\hat p(y_j|k))^{1-y_j}\]

Latent class analysis is a subset of a broader array of mixture models. It is interesting to note that the above likelihood formula defines a template across different mixture models! For example, latent profile analysis, replaces the binomial density with a normal density. Recognizing this principle will help you as you develop more general clustering algorithms.

Let’s define a function to help us estimate the likelihood function. Many of the updates in the maximization step utilize the inner summation of the likelihood function. specifically

\[ind\_ll_{k} = \lambda_k\prod_{j=1}^J\hat p(y_j|k)^{y_j} (1-\hat p(y_j|k))^{1-y_j}\] These can be interpreted as the vector of values denoting the relative likelihood of each respondent belonging to latent class k. Because of its usefulness, for now, I will only define a function that stores these values in an \(n\) by \(k\) matrix. We can handle the other summations of the likelihood within the body of the main latent class function.

expectation_ind_ll<-function(k, n, x, p_kj, lambda){
      map_dfc(1:k, function(j) {
        map_dfr(1:n, function(i){
          tmp<-data.frame(id = 1)
          tmp[paste0("class_",j)] <- lambda[j] * prod(dbinom(x[i,], size = 1, prob = p_kj[[j]]))
          tmp[paste0("class_",j)]
        }
        )
      })
    }

Updating Parameters

The section above provides the foundation for solving the expectation step of the EM algorithm. How can we use this information to solve the Maximization step? The intuition behind updating the mixing proportions (\(\lambda_k\))is relatively straight forward so let’s start with that.

The mixing proportions are the proportion of the sample that belongs to latent class k. If we had observed the latent classes, we could simply use fractions to solve for the mixing proportion.

\[\lambda_k = \frac{n_k}{total\ n}\]

Unfortunately, we don’t know which participants belong to each latent class. This makes things a bit difficult! Since it’s unclear what group each participant belongs to we could use the conditional probabilities to solve for the number of observations in latent class k.

\[\lambda_k = \frac{\sum _{i=1}^np(k|y_i)_i}{total\ n}\]

Luckily, we have all the information we need to solve for this in our likelihood formula. Remember the unnormalized likelihood’s I mentioned above? These are the relative likelihood of an individual belonging to class k given their responses! All we need to do to solve for \(p(k|y_i)\) is normalize these values.

\[p(k|y_i)_i = \frac{\lambda_k\prod_{j=1}^Jp(y_j|k)^{y_j} (1- p(y_j|k))^{1-y_j}}{\sum_{k=1}^K\lambda_k\prod_{j=1}^J p(y_j|k)^{y_j} (1- p(y_j|k))^{1-y_j}}\]

Since these conditional probabilities are interesting from a substantive perspective, I will write two functions to solve for lambda and the posterior responsibilities.

maximization_post<-function(ind_ll){
  ind_ll/rowSums(ind_ll)
}

maximization_lambda<-function(post){
  colSums(post)/nrow(post)
}

Finally, we can update the conditional item response probabilities. Given that each individual has some probability of belonging to each latent class, to update the \(p(y_j|k)\) we compute a sum the individual responses to weighted by the conditional probability of the person belonging to a given group \(p(k|y_i)\). Next we divide that value by the expected number of people in that group. This expectation can be solved by summing over the posterior responsibilities or multiplying the mixing proportion by n. 

\[p(y_j|k) = \frac{p(k|y_i)y}{\sum _{i = 1}^np(k|y_i)}\]

maximization_p<-function(post, x, k){
  tmp<-list()
  
  for(i in 1:k){
    tmp[[i]]<- colSums(post[,i]*x)/sum(post[,i])
  }
  tmp
}

Calculating Standard Errors

The last thing we need to do is estimate standard errors. This is rather tricky, but we can rely on some of our previous knowledge to derive the standard error matrix. First, it is important to note the models relationship to logistic regression. Recall that the formula for logistic regression looks like this:

\[p(y|x)=\frac{1}{1+e^{-X^\top\beta}}\]

We can rewrite the latent class model in terms of log odds so that it takes a form similar to the logistic regression model.

\[p(y_j|k) = \frac{1}{1+e^{-\theta_{jk}}}\]

This means that the same partial derivatives apply to this model as the previous model! All we need to do is apply a transformations on the information matrix to rescale the standard errors to be in the probability form. As a reminder the derivatives for the logistic model were:

\[\Delta \beta = X^\top(y-p(y|X))\]

We can recover the Hessian matrix by taking the inner product of the matrix individual’s contribution to the first derivative for each parameter with itself. This matrix is sometimes referred to as the score matrix. In the latent class model X is an implicit 1, and each observations contribution is weighted by their individual contribution their probability of belonging to latent class associated with the parameter. Formulaically, the derivative with respect to theta is equal to:

\[\Delta \theta_{jk} = p(k|y)\sum_{i=1}^N(y_{ij}-p(y_j|k))\] Thus their score is equal to

\[s_{ijk} = p(k|y)(y_{ij}-p(y_j|k))\]

We also need to solve for the derivative of the mixing proportions, which in log odds form ends up being:

\[\Delta \lambda_k = p(k|y)- \lambda_k\]

This is all stored in an N by (K x J + (K-1)) matrix.

Following Machlachlan & Peel (2000), the Hessian matrix of parameter estimates is equal to

\[H = [S^\top S]^{-1}\]

vcov_log_odds<-function(x, p_kj, lambda, post){
  N<-nrow(x)
  K<-length(p_kj)
  J<-ncol(x)
  
  score_mat<-matrix(nrow = N, ncol = K*J+K-1)
  
  for(k in 1:K){
    for(j in 1:J){
      
      score_mat[,(J*(k-1)+j)]<-post[,k]*(x[,j]-p_kj[[k]][j])
      
  
    }
  }
  
  if(K > 1){
    for(k in 2:K){
      score_mat[,(J*K+k-1)]<-post[,k]-lambda[k]
    }
  }
  
  ginv(t(score_mat)%*%score_mat)
  }

This is useful if we were interested in the standard error of \(\theta_{jk}\), however, users may be more interested in using the conditional response probabilities. In order to rescale the hessian to the raw probability format, we use the delta method. The delta method proves useful when recovering variance of a transformed parameter. A description of the application method of the delta method is beyond the scope of this article (it would take another couple of pages), but I would point you to the user manual of poLCA and Machlachlan and Peel’s book on finite mixture models. Both are excellent resources.

vcov_p_kj<-function(p_kj, vcov.log_odds, K, J){
  
  vcov.log_odds<- vcov.log_odds[1:(K*J), 1:(J*K)]
  
  
  # Empty Block Diaganol Jacobian
  jac<-matrix(0, nrow = (K*J)+J*K, ncol = (K*J))
  
  #Defining Counters for moving through block diagnol matrix
  rpos <- cpos <- 1
  
  for (k in 1:K){
    for (j in 1:J){

      # defining block diaganol for item-class combonation
      p_kj_vec<-c(p_kj[[k]][j], 1-c(p_kj[[k]][j]))
      Jsub <- -(p_kj_vec %*% t(p_kj_vec))
      diag(Jsub)<-p_kj_vec *(1-p_kj_vec)
      Jsub<- Jsub[,-1]
      
      jac[rpos:(rpos+1), cpos]<- Jsub
      
      rpos<-rpos+2
      cpos<-cpos+1
    }
  }
  jac %*% vcov.log_odds %*% t(jac)
  
}

Putting Everything Together

my_lca<-function(x, k = 2, epsilon = 1e-08, maxit = 10000, verb = TRUE){
  #Ensuring x is matrix
  x<-as.matrix(x)
  
  if(!all(apply(g_test_mat, 2, function(x){length(unique(x))})==2)){
    stop("This function was not written for non-binary LCA. It needs a few more tweaks to generalize the algorithm.")
  }
  
  n <- nrow(x) #observations
  p <- ncol(x) #variables
  
  #Initializing Values
  lambda<-rep(1/k, times = k)
  
  p_kj<-list()
  
  for(i in 1:k){
  p_kj[[i]]<-runif(p, min = .01, max = .99)
  }
  
  
  # Initalizing while() values
  diff <- 1
  iter <- 0 
  ll<-0
  
  restarts<-0
  
  while((diff>epsilon| iter<=5 )& iter < maxit){
    
    if(verb & iter != 0){
          cat("iteration=", iter, "diff=", diff, "log-likelihood",
        ll, "\n")
    }
    
  # Solving Log Likelihoods
  ind_ll<-expectation_ind_ll(k, n, x, p_kj, lambda)
  new_ll<-sum(log(rowSums(ind_ll)))

  
  #Getting posteriors for initial values
  post<-maximization_post(ind_ll)
  lambda<-maximization_lambda(post)
  p_kj<-maximization_p(post, x, k)
  
  # Getting liklihoods
  diff<-new_ll-ll
  ll<-new_ll
  iter<-iter+1
  }
  
  if(diff<epsilon){
    cov_code <- 0
  } else{
    cov_code <- 1
  }
  
  vcov.log_odds<-vcov_log_odds(x, p_kj, lambda, post)
  
  vcov.p_kj<-vcov_p_kj(p_kj, vcov.log_odds, k, p)
  
  vcov_diag<-diag(vcov.p_kj)
  
  # Near perfect class separation can result in small below 0 variances
  vcov_diag[vcov_diag<0]<-0
  
  
  se_p_kj<-sqrt(vcov_diag)[seq(1, length(vcov_diag)-1, by = 2)]
  
  se_p_kj<-matrix(se_p_kj, ncol = 3)
  
  
  p_kj<-as.data.frame(p_kj)%>%
    rownames_to_column(var = "item")

  colnames(p_kj)[2:(k+1)]<-paste0("class_", 1:k)
  
  se_p_kj<-cbind(p_kj$item, se_p_kj)
  
  colnames(se_p_kj)<-colnames(p_kj)

    list(run_stats = data.frame(likelihood = ll,
                              iterations = iter,
                              convergence = cov_code),
       item_probs = p_kj,
       item_prob_se = se_p_kj,
       lambda = lambda,
       post = post)
}

Taking the Function for a Spin

Let’s compare results! The mixing proportions and conditional probabilities between are function and poLCA are nearly identical!

m1<-my_lca(g_test_mat, k = 3, verb = TRUE)
## iteration= 1 diff= -16487.78 log-likelihood -16487.78 
## iteration= 2 diff= 6633.942 log-likelihood -9853.839 
## iteration= 3 diff= 1551.991 log-likelihood -8301.848 
## iteration= 4 diff= 476.2723 log-likelihood -7825.576 
## iteration= 5 diff= 28.64074 log-likelihood -7796.935 
## iteration= 6 diff= 1.814964 log-likelihood -7795.12 
## iteration= 7 diff= 0.1572329 log-likelihood -7794.963 
## iteration= 8 diff= 0.01354843 log-likelihood -7794.949 
## iteration= 9 diff= 0.001167651 log-likelihood -7794.948 
## iteration= 10 diff= 0.000100642 log-likelihood -7794.948 
## iteration= 11 diff= 8.674428e-06 log-likelihood -7794.948 
## iteration= 12 diff= 7.476474e-07 log-likelihood -7794.948 
## iteration= 13 diff= 6.443952e-08 log-likelihood -7794.948
p<-m1$item_probs%>%
  pivot_longer(-item, names_to =  "class")%>%
  mutate(class_fill = case_when(
    class == "class_1" ~ paste("Class 1 p =", round(m1$lambda[1], 2)),
    class == "class_2" ~ paste("Class 2 p =", round(m1$lambda[2], 2)),
    class == "class_3" ~ paste("Class 3 p =", round(m1$lambda[3], 2))
  ))%>%
  left_join(key%>%
              mutate(item = paste("item", item, sep = "_")))%>%
  mutate(item = if_else(correct == 0, paste0(item, "*"), item),
         p = round(value, digits = 2))%>%
  ggplot(aes(x = class, y = p, fill = class_fill))+
  geom_bar(stat = "identity")+
  facet_wrap(~item, ncol = 4)+
  labs(y = "Class Conditional Probability Under From Scratch Model",
       x = "",
       title = "Class Conditional Probabilities Across Items",
       subtitle = "Asterisk denote items where response option 1 was incorrect")+
  ylim(c(0,1))+
  theme_bw()+
  guides(fill = guide_legend(title = ""))+
  theme(panel.grid = element_blank())

ggplotly(p, tooltip = c("x", "y"))

The home brewed standard errors are pretty close to the poLCA standard errors as well!

m1$item_prob_se
##       item      class_1              class_2              class_3
##  [1,] "item_1"  "0.0301122072762775" "0.0168658464002584" "0"    
##  [2,] "item_10" "0.0289089264472541" "0.0171157940312791" "0"    
##  [3,] "item_11" "0.0288477263987862" "0.0171292904718211" "0"    
##  [4,] "item_12" "0.0295584393782964" "0.0166568529786356" "0"    
##  [5,] "item_13" "0.029075914881287"  "0.0173250529884975" "0"    
##  [6,] "item_14" "0.0289488107050378" "0.015992255495603"  "0"    
##  [7,] "item_15" "0.0287712645232088" "0.0162425333245048" "0"    
##  [8,] "item_16" "0.0289753210802592" "0.0173659155025247" "0"    
##  [9,] "item_17" "0.0288426871042576" "0.0169143839218697" "0"    
## [10,] "item_18" "0.0290942711768932" "0.0177866465896614" "0"    
## [11,] "item_19" "0.0289900933887822" "0.0179809173917378" "0"    
## [12,] "item_2"  "0.0294455072695228" "0.0178123926819375" "0"    
## [13,] "item_20" "0.0291951647520383" "0.0162715512820208" "0"    
## [14,] "item_3"  "0.0292978275057935" "0.0154512825733154" "0"    
## [15,] "item_4"  "0.0291433069984354" "0.0168189138391922" "0"    
## [16,] "item_5"  "0.0288202049345414" "0.0174157177522282" "0"    
## [17,] "item_6"  "0.0291948658639602" "0.0161727970653175" "0"    
## [18,] "item_7"  "0.0286634713766607" "0.015620835049537"  "0"    
## [19,] "item_8"  "0.0285864482877933" "0.0171355661331623" "0"    
## [20,] "item_9"  "0.0290307466123001" "0.0189758819553674" "0"

Conclusion

I hope this post has provided you with a useful illustration of latent class analysis. Come back later and check out my other work! I have grand plans for my next few posts. Recently, I have been doing some work with naive Bayes models; a simple yet powerful classification algorithm that easily scales to massive data sets. The underlying principles for latent class analysis and naive Bayes are remarkably similar so perhaps I will cover naive Bayes next time! You can expect a post on that in the near future. I also am hoping to cover the adaboost algorithm. In it I will introduce the concept of boosting. Variations of adaboost are often considered the best off-the-shelf classifier algorithm currently available. Whether covering naive Bayes or adaboost, I look forward to working on my next post!

Data Generation Code

set.seed(12345)

n<-20000

key<- data.frame(item = 1:20, 
                 correct = sample(c(0,1), size = 10, replace = TRUE))

g_test_long<-expand.grid(id = 1:1000, 
                    item = 1:20)%>%
  left_join(key)%>%
  mutate(answer = case_when(
      id<=330 ~ sample(c(0,1), size = n, replace = TRUE),
      id<=660 ~ 0,
      id>660&correct == 1 ~ sample(c(1, 0), size =n, prob = c(.9, .1), replace = TRUE), # overspecified logic but illustrative
      TRUE ~ sample(c(1, 0), size =n, prob = c(.1, .9), replace = TRUE)
    ),
    item = paste0("item_", item))


Friday, December 6, 2019 9:27 PM