EM for normal mixtures

In this homework we will write a gaussian mixture estimator and apply it to data from the US. As in the other homeworks we will be using R. The way this will work is that I will provide pieces of code that you will need to put together to solve the overall task.

The goal in this homework is to estimate a model of earnings using three consecutive obersvations \((Y_1,Y_2,Y_3)\). As we covered in class, we are going to use a latent heterogeneity model. We then write down a model for \(\{Y_t,\eta_t\}_t=1..3\).

In the first part of the homework we consider a normal mixture model. The latent variable \(\eta_i\) is drawn from a distributions with probabilities \(p_k\), then the wages are drawn independently according to time specific normal distributions centered at \(\mu_{kt}\) with variance \(\sigma^2_{kt}\). As we covered in class, the likelihood is given by:

\[ L(Y_{1},Y_{2},Y_{3};\theta)=Pr[Y_{1}{=}y_{1},Y_{2}{=}y_{2},Y_{3}{=}y_{3};\theta]=\sum_{k=1}^{K}p_{k}\prod_{t=1}^{3}\phi(Y_{t};\mu_{kt},\sigma_{kt}) \]

#some imports
require(gtools)
require(data.table)
require(ggplot2)
require(reshape)
require(readstata13)

Expectation-maximization algorithm for Gaussian Mixture

The alogirthm consists of the expecation and the maximization step. I provide some guidance on how to code them. We are going to write a mixture EM estimator for a gaussian mixture with K components. Each components has its own mean \(\mu_k\) and variance \(\sigma_k\). Each component also has a proportion that we will call \(p_k\). The data will be a sequence of wages, we only need 3 consecutive observations, so we will focus on that.

The Expectation step

We can write the posterior probability for a given \(k\) given \(Y_1,Y_2,Y_3\) and taking our current parameters \(p_k,\mu_{kt},\sigma_{kt}\) as given. The posterior probabilities \(\omega_{ik}\) are given by

\[ \omega_{ik} = Pr[k|Y_{i1},Y_{i2},Y_{i3}] = \frac{p_k \prod_t \phi(Y_{it},\mu_{kt},\sigma_{kt}) }{ \sum_{l} p_{l} \prod_t \phi(Y_{it},\mu_{lt},\sigma_{lt})} \]

this gives us the posterior probabilities that we can use in the maximization step. Some guidance on computing the likelihood on the computer.

usually better to compute everything in logs, and compute the sum on \(k\) using a logsumexp function that removes the highest value. This avoids underflow .
I recommend using the close form expression of log normal direclty.

lognormpdf <- function(Y,mu=0,sigma=1)  {
  -0.5 * (  (Y-mu) / sigma )^2   - 0.5 * log(2.0*pi) - log(sigma)  
}

logsumexp <- function(v) {
  vm = max(v)
  log(sum(exp(v-vm))) + vm
}

And here is an example of how I would implement it where A are the means and S are the standard deviations:

   tau = array(0,c(N,nk))
   lpm = array(0,c(N,nk))
   lik = 0
   for (i in 1:N) {
      ltau = log(pk)
      lnorm1 = lognormpdf(Y1[i], A[1,], S[1,])
      lnorm2 = lognormpdf(Y2[i], A[2,], S[2,])
      lnorm3 = lognormpdf(Y3[i], A[3,], S[3,])
      lall = ltau + lnorm2 + lnorm1 +lrnorm3
      lpm[i,] = lall
      lik = lik + logsumexp(lall)
      tau[i,] = exp(lall - logsumexp(lall))
   }

The maximization step

Given our \(\omega_{ik}\) we can procede to update our parameters using our first order conditions on the \(Q(\theta | \theta^{(t)})\) function. I will let you write the code to update the \(p_k\) term. For the mean and variance, my favorite way of implementing is to stack up the \(Y_{it}\) and duplicate them for each \(k\). Something along this lines:

  DY1     = as.matrix(kronecker(Y1 ,rep(1,nk)))
  DY2     = as.matrix(kronecker(Y2 ,rep(1,nk)))
  DY3     = as.matrix(kronecker(Y3 ,rep(1,nk)))
  Dkj1    = as.matrix.csr(kronecker(rep(1,N),diag(nk)))
  Dkj2    = as.matrix.csr(kronecker(rep(1,N),diag(nk)))  
  Dkj3    = as.matrix.csr(kronecker(rep(1,N),diag(nk)))

then you easily recover the means and variances using the posterior weights with the following expression:

  rw     = c(t(tau))
  fit    = slm.wfit(Dkj1,DY1,rw)
  A[1,]  = coef(fit)[1:nk]
  fit_v  = slm.wfit(Dkj1,resid(fit)^2/rw,rw)
  S[1,] = sqrt(coef(fit_v)[1:nk])

where slm.wfit is in the SparseM package. Note how you have to scale the residuals when using this function. You can edit this code to recover all means and variances at once!

Question 1: Write a function that takes the data in, and estimates the parameters of the mixture model. To that end, you can use the supplied code if you want. You need to write a loop that alternates between the expectation and the maximization step. You need to then add a termination condition. You can for instance check that the likelihood changes by very little.

Checking things with H and Q functions

We now want to make sure that the code is working. However there are, as usual many sources of mistakes when writing the code. We know that the likelihood will always increase when running the EM. However it is often the case that the likelihood increases even when the code is incorrect. A stronger check is to implement the \(Q\) and \(H\) functions of the EM algorithm which both are increasing at each EM step.

Question 2: Extend your EM function to include the computation of the \(Q\) and \(H\) functions at every step. Note that this function needs to take in both the previous and new parameters (or at least the last \(\omega_{ik}\)). They are very simple expression of lpm and taum. If you compute them under \(\theta^(t+1)\) and \(\theta^(t)\) they are given by:

  Q1 = sum( ( (res1$taum) * res1$lpm ))
  Q2 = sum( ( (res1$taum) * res2$lpm ))
  H1 = - sum( (res1$taum) * log(res1$taum))
  H2 = - sum( (res1$taum) * log(res2$taum))

Testing your code

Here is a simple function that will generate random data for you to estimate from. Your code should take a starting such model structure and update its parameters. This way we can easily check whether it matches in the end.

model.mixture.new <-function(nk) {
  
  model = list()
  # model for Y1,Y2,Y3|k 
  model$A     = array(3*(1 + 0.8*runif(3*nk)),c(3,nk))
  model$S     = array(1,c(3,nk))
  model$pk    = rdirichlet(1,rep(1,nk))
  model$nk    = nk
  return(model)
}

and here is code that will simulate from it:

model.mixture.simulate <-function(model,N,sd.scale=1) {
  
  Y1 = array(0,sum(N)) 
  Y2 = array(0,sum(N)) 
  Y3 = array(0,sum(N)) 
  K  = array(0,sum(N)) 
  
  A   = model$A
  S   = model$S
  pk  = model$pk
  nk  = model$nk

  # draw K
  K = sample.int(nk,N,TRUE,pk)

  # draw Y1, Y2, Y3
  Y1  = A[1,K] + S[1,K] * rnorm(N) *sd.scale
  Y2  = A[2,K] + S[2,K] * rnorm(N) *sd.scale
  Y3  = A[3,K] + S[3,K] * rnorm(N) *sd.scale

  data.sim = data.table(k=K,y1=Y1,y2=Y2,y3=Y3)
  return(data.sim)  
}

Here is code that simulates a data set:

  model = model.mixture.new(3)
  data = model.mixture.simulate(model,10000,sd.scale=0.5) # simulating with lower sd to see separation
  datal = melt(data,id="k")
  ggplot(datal,aes(x=value,group=k,fill=factor(k))) + geom_density() + facet_grid(~variable) + theme_bw()

Question 3: Simulate from this model, use your function to estimate the parameters from the data. Show that you do receover all the parameters (plot esimated values verus true vales). Finally, also report the sequence of values of the likelihood, the \(H\) function and the \(Q\) function. When you update your estimator, make sure your function returns a list structure similar to the one presented right here. This way you can use the simulation code right away.

Estimating on PSID data

Get the prepared data from Blundell, Pistaferri and Saporta. To load this data you will need to install the package readstata13. You can do that by running:

install.packages('readstata13')

then you can load the data

  require(readstata13)
  data = data.table(read.dta13("~/Dropbox/Documents/Teaching/ECON-24030/lectures-laborsupply/homeworks/data/AER_2012_1549_data/output/data4estimation.dta"))

we start by computing the wage residuals

  fit = lm(log(log_y) ~ year + marit + state_st ,data ,na.action=na.exclude)
  data[, log_yr := residuals(fit)]

we then want to create a data-set in the same format as before. We can do this by selecting some given years and using the cast function.

  # extract lags
  setkey(data,person,year)
  data[, log_yr_l1 := data[J(person,year-2),log_yr]]
  data[, log_yr_l2 := data[J(person,year-4),log_yr]]

  # compute difference from start
  fdata = data[!is.na(log_yr*log_yr_l1*log_yr_l2)][,list(y1=log_yr_l2,y2=log_yr_l1,y3=log_yr)]

This gives 4941 to estimate your model!

Question 4: Use your function to estimate the mixture model on this data. Try to estimate for different number of mixture component (3,4,5). For each set of parameters, report how much of the cross-sectional dispersion in wages can be attibuted to permanent heterogeneity \(\eta_i\) and how much to the rest. Finally, we want to assess the fit of the model, to do that, simulate from your estimated model, then plot the quantiles from your simulated data in the cross-section versus the one in the data.

Estimate model with auto-correlation

Here we want to make things more interesting and consider the following model:

\[ Y_{it} - \rho Y_{it-1} | k \sim N(\mu_{kt},\sigma_{kt}) \]

The code does not need any modicification if we do things conditional on \(\rho\). You just need to feed in the wages differences out already by rho. You run your algorithm on \(Y_2 - \rho Y_1\), \(Y_3 - \rho Y_2\) and \(Y_4 - \rho Y_3\). Here we use one additional time period. Here is the code to prepare your data:

  # extract lags
  setkey(data,person,year)
  data[, log_yr_l3 := data[J(person,year-2),log_yr]]

  # compute difference from start
  fdata = data[!is.na(log_yr*log_yr_l1*log_yr_l2)][,list(y1=log_yr_l3,y2=log_yr_l2,y3=log_yr_l1,y4=log_yr)]

Question 5: Use your function estimate the mixture model in differences using \(\rho=0.6\). Next run your code on a grid for \(\rho\) between 0 and 1. Report the likelihood plot over the values of \(\rho\) and report the maximum likelihood estimator of \(\rho\).

Bonus questions, NP estimation using joint diagonalization

In class we saw that one could try to directly esimate the mixture components using SVD and eigen value decompositions. Multiple values of \(y_3\) can be pulled together into a joint diagonalization. The first step is to construct the \(A(y3)\) matrix for several values of \(y_3\). Cut the values of \(y3\) into groups and construct the matrix within each group (use 10 groups). You also need to discretize \(y_1\) and \(y_2\), use 20 point of supports. This should give you 10 different matrices.

The compute the \(A(\infty)\) matrix which is the joint distribution of \((Y_1,y_2)\) unconditional of \(y_3\). Compute the SVD decomposition of this matrix such that \(A(\infty) = U S V^\intercal\). We do have that \(A(\infty) = F D(\infty) F^\intercal\).

Next construct the matrices \(\tilde{A}(y_3) = S^{1-/2} U^\intercal A(y_3) V S^{1-/2}\). However only select a few of vectors in \(U\) and \(V\). Select the K vectors associated with the highest values in \(S\). \(\tilde{A}(y_3)\) matrices should be \(k\times K\).

Note that if the model is stationary, then \[ \begin{align} \tilde{A}(y_3) & = S^{1-/2} U^\intercal F D(y_3) F^\intercal S^{1-/2} \\ & = S^{1-/2} U^\intercal F D(\infty) D(\infty)^{-1} D(y_3) F^\intercal S^{1-/2} \\ & = Q D(\infty)^{-1} D(y_3) Q^{-1} \\ \end{align} \]

And so at this point, to gain in efficiency, we can use a joint diagonalization of the all the \(\tilde{A}(y_3)\) matrices at once. The ffdiag command in the joitnDiag package to perform this decomposition given a set of \(\tilde{A}(y_3)\) matrices. It will return the \(Q\) matrix.

Bonus question: Implement this procedure, run it on simulated data (impose the stationarity) and plot the true CDF versus the estimated CDF. Finally, run it on the data and report the estimated CDF!

Fork me on GitHub