# Probabilistic Factor Analysis Methods

Joseph H. Sakaya & Suleiman A. Khan

15 - 19 May, 2017

This tutorial presents an overview of probabilistic factor analysis
I cannot conceal the fact here
that in the specific application
of these rules, I foresee many
things happening which can
cause one to be badly mistaken if
he does not proceed cautiously. *James Bernoulli*
methods as used in practice. It uses the Stan programming language
for model specification and inference. We will study the
Bayesian variants of four models viz. PCA, CCA, GFA and MTF in addition
to topic modelling. After corroborating these models on artificial data,
we will apply them on real world data sets and examine their behaviour in terms of
interpretability.

## Contents

## Getting Started

Stan is a Turing-complete probabilistic programming language used for performing
statistical inference of Bayesian models. A minimal
Stan model is defined by program blocks: `data`

, `parameters`

, `transformed parameters`

and `model`

. Of these, all except the `model`

block are optional. Consider, for instance, inferring a Gaussian with an unknown mean and variance, defined as
\begin{align}
\mu &\sim \mathcal{N}(0,10)\\
x &\sim \mathcal{N}(\mu, \sigma^2)
\end{align}
This is easily represented in Stan as follows,

` ````
gaussian <- "
data {
int<lower=1> n;
vector[n] x;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, 10);
x ~ normal(mu, sigma);
} "
library("rstan")
n <- 1000
x <- rnorm(n, 5, 10)
data <- list(x=x, n=n)
m <- stan_model(model_code = gaussian)
samples <- sampling(m, data=data, iter=10000, chains=1)
mu <- mean(extract(samples)$mu)
sigma <- mean(extract(samples)$sigma)
```

The structure of this snippet is simple. First, generate or load
existing data; then, initialise and indicate the parameters to
monitor; lastly, run the inference. Stan offers several inference
alternatives including efficient sampling strategies such as
Hamiltonian Monte Carlo (HMC) as well as optimisation-based
inference like variational Bayes (no need to bother with the
details though). Note that we have not specified any prior for
`sigma`

. Stan assumes uniform prior over such parameters
by default, rendering the model non-conjugate. Stan
can perform inference on non-conjugate models.
This allows you to pay more attention to modelling
your data faithfully rather than fretting over the ugly details
of inference that such non-conjugate models otherwise
entail.

### Required reading

In order to familiarise yourself with Stan spend some time with
part two, **II Stan Modeling Language**, of the Stan reference. Most constructs in Stan are similar to programming languages you already know — you need to spend just enough time to be conversant in:

- Primitive data types such as
`int`

and`real`

. - Linear algebra data types such as
`vector`

and`row_vector`

and`matrix`

. - Placing
`upper`

and`lower`

constraints on the data types. - Sampling statements such as
`y ~ normal(mu, sigma)`

or`y ~ gamma(alpha, beta)`

. -
`for`

loops and traversal order of linear algebra data types. - Rewriting code in vectorized form.
- What the various blocks mean as well as the use of optional blocks such as the
`transformed parameters`

block. - Marginalisation of discrete parameters. Read chapter
**12. Finite Mixture**from the Stan reference.

If your answer to the above is in the affirmative, feel free to handle the questions. We provide starter code only in R (although we promise help with Python as well). Feel free to ask for help anytime.

### Using R and Python in the Cubbli machines

R users can simply use the terminal or RStudio. The package `rstan`

comes already installed. Use `require(rstan)`

to import the library.

Python users can use `pystan`

installed in the virtual environment:

$ source /cs/group/r-course/pystan/bin/activate

Alternatively, you're free to install your own versions of `rstan`

and `pystan`

.

### Marking scheme

The exercises sessions held from May 15 to May 18 carry a total of 10 points. Each session gives you 2.5 points.

### Questions

Implement the following models with Stan. Use vectorized code and the
`transformed parameters`

block where appropriate. You will need
at least 10 of 15 points to complete today's exercise.

- Beta-Bernoulli model
*[½ pt]*\begin{align} y &\sim \text{Bern}(\theta)\\ \theta &\sim \text{Beta}(\alpha, \beta) \end{align} Starter code and solution. - Beta-Binomial model
*[½ pt]*\begin{align} y &\sim \text{Bin}(N, \theta) \\ \theta &\sim \text{Beta}(\alpha, \beta) \end{align} Starter code and solution. - Poisson-Gamma model
*[½ pt]*\begin{align} y &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Gamma}(\alpha_0, \beta_0) \\ \end{align} Starter code and solution. - Normal with location
*[½ pt]*\begin{align} y &\sim \mathcal{N}(\mu, \sigma_0^2) \\ \mu &\sim \mathcal{N}(\mu_0, \sigma_0^2) \end{align} Starter code and solution. - Normal with location and scale
*[2 pts]*\begin{align} y &\sim \mathcal{N}(\mu, \sigma^2) \\ \mu &\sim \mathcal{N}(\mu_0, \sigma_0^2) \\ \text{and, }\sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \text{or, }\sigma &\sim \text{logNormal}(\mu_0, \sigma_0^2)\\ \text{or, }\sigma &\sim \text{Cauchy}(\mu_0, \sigma_0) \\ \end{align} Starter code and solution. - Multivariate normal with location and scale
*[2 pts]*\begin{align} \mathbf y &\sim \mathcal{N}(\boldsymbol\mu, \sigma^2 \mathbf I) \\ \boldsymbol\mu &\sim \mathcal{N}(\boldsymbol\mu_0, \sigma_0^2 \mathbf{I}) \\ \sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \end{align} Starter code and solution. - Bayesian linear regression
*[2 pts]*\begin{align} \tau &\sim \text{Gam}(\alpha_0, \beta_0)\\ \sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \mathbf w &\sim \mathcal{N}(0, \tau^{-1} \mathbf I)\\ y &\sim \mathcal{N}(\mathbf w^{\top} \mathbf x, \sigma^2)\\ K &= \text{number of dimensions} \end{align} Starter code and solution. - Bayesian linear regression with ARD prior
*[2 pts]*\begin{align} \alpha_{i=1\ldots K} &\sim \text{Gam}(\alpha_0, \beta_0)\\ \sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \mathbf w &\sim \mathcal{N}(0, \boldsymbol \alpha^{-1} \mathbf I)\\ y &\sim \mathcal{N}(\mathbf w^{\top} \mathbf x, \sigma^2)\\ K &= \text{number of dimensions} \end{align} Starter code and solution. - Bayesian logistic regression with ARD prior
*[2 pts]*\begin{align} \alpha_{i=1\ldots K} &\sim \text{Gam}(\alpha_0, \beta_0)\\ \mathbf w &\sim \mathcal{N}(0, \boldsymbol \alpha^{-1} \mathbf I)\\ y &\sim \text{Bern}(\sigma(\mathbf w^{\top} \mathbf x))\\ K &= \text{number of dimensions} \end{align} Starter code and solution. Take a moment to appreciate the fact that you can solve in a trivial piece of code what once constituted a paper. - Normal mixture with isotropic covariance and marginalisation
*[3 pts]*\begin{align} \mathbf{\pi} & \sim \operatorname{Dir}(\alpha_0) \\ \mathbf{\sigma}^2_{i=1 \dots K} & \sim \text{InvGam}(\alpha_0, \beta_0) \\ \boldsymbol{\mu}_{i=1 \dots K} & \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma_0^2 \mathbf I) \\ \mathbf{z}_{i = 1 \dots N} & \sim \operatorname{Cat}(\mathbf{\pi}) \\ \mathbf{x}_{i=1 \dots N} & \sim \mathcal{N}(\boldsymbol{\mu}_{z_i}, {\sigma^2_{z_i}} \mathbf I) \\ K &= \text{number of mixing components} \\ N &= \text{number of data points} \end{align} Starter code and solution. Modify your code to replace the isotropic covariance with diagonal covariance.

## Principal Component Analysis

Matrix factorisation is the decomposition of a matrix into a product of two or more matrices,
with a mind to capture the interaction between two entities in terms of lower-dimensional latent
features. It involves modelling correlations in
higher-dimensional data as a lower-dimensional, oriented subspace.
Consider a matrix $\mathbf{X} \in \Re^{n \times d}$
generated by a linear transformation of a zero-mean, unit-variance
Gaussian matrix $\mathbf{Z} \in \Re^{n \times d}$ and some additional
zero mean diagonal covariance Gaussian noise $\boldsymbol{\epsilon}$,
where $k \leq d$,
$$ \mathbf{x}_n = \mathbf{W}\mathbf{z}_n + \boldsymbol{\epsilon}_n$$
$$ \mathbf{z}_n \sim \mathcal{N}(0, \mathbf{I}_k)\qquad \boldsymbol \epsilon_n \sim \mathcal{N}(0, \mathbf{\Psi}) $$
where $\mathbf{W} \in \Re^{k \times d}$ constitutes the linear
transformation, also known as the *factor loadings* or the
*projection* matrix. Marginalising over $\mathbf{z}_n$, it is not
hard to see that $\mathbf{x}_n \sim \mathcal{N}(0,\mathbf{WW}^\top +
\mathbf{\Psi})$. By constraining the noise to be a diagonal
covariance matrix, $\mathbf{x}_n$ becomes conditionally
independent for a given $\mathbf{z}_n$. This results in
the $\boldsymbol \epsilon_n$ capturing the independent variance unique to a axis, while $\mathbf{W}$ captures the rest of the correlations
presumed to be of interest. The components of the resulting latent
vectors $\mathbf{z}_n$ are uncorrelated. Factor analysis differs from
traditional PCA in that it differentiates between the variance and the
correlations. This can be rectified by replacing $\mathbf \Psi$ with an
isotropic covariance, in which case the model simply reduces to
PCA.
Thus, PCA can be easily re-interpreted as a linear-Gaussian latent variable model.
However, a fully Bayesian treatment of PCA involves introducing priors over $\mathbf{W}$ and $\boldsymbol{\alpha}$.

*Figure 1: *Hinton diagrams of the matrix $W$ in which each element is depicted as a square (blue for negative, red for positive) with its intensity proportional to the magnitude of that element. It is seen that the extraneous dimensions are effectively set to zero and the model accurately deduces the true number of components to be $5$.

### Optional reading

Bishop's paper on Bayesian PCA provides an excellent description of the model.

### Questions

To claim 2.5 points today, you have to implement Bayesian PCA with Stan and test it on a gene expression measurement dataset.

The generative model for Bayesian PCA with ARD prior is given as follows: \begin{align} \mathbf{x}_n &\sim \mathcal{N}(\mathbf W\mathbf z_n, \tau^{-1} \mathbf I_D)\\ \mathbf{W}_k|\alpha_k &\sim \mathcal{N}(0, \alpha_k^{-1} \mathbf I_D)\\ \alpha_k &\sim \text{Gam}(\alpha_0, \beta_0)\\ \tau &\sim \text{Gam}(\alpha_0, \beta_0)\\ \mathbf z &\sim \mathcal{N}(0, \mathbf I_K)\\ N &= \text{number of data points}\\ D &= \text{dimension of data point} \\ K &= \text{latent dimension} \\ \end{align}

- Write the Stan snippet of the generative model.
- Run both the NUTS sampler and VB on the toy data set and reproduce Figure 1.
- Vectorize the code.

You will need no more than 2 `transformed parameters`

block and the other in the `model`

block.

*Figure 2: *PCA latent variables identify a well-known biological process in cancer cells

Consider a matrix of gene expression measurements of a set of genes $(D = 1106)$ over a set of drugs $(N = 78)$. Each cell in the matrix indicates if the gene (column) is up-regulated (positive) or down-regulated (negative). It is hypothesised that such data can be used to understand the mechanisms (genes) through with drugs affect the cells. The matrix is the response of a given type of cancer cell and will constitute our input matrix. Run the model on the data set. For R users, access to the data set is already provided in the starter code. Python users can access it here.

To interpret the results, inspect $\boldsymbol \alpha$ and choose a component $k$ , explaining your choice. For the selected component $k$,

- Collect the row indices of the 10 highest and lowest scores of the $k^{th}$ component of $\mathbf Z$ , and
- The column indices of the 20 highest scores of the $k^{th}$ component of $\mathbf W$ .

Subset the original data matrix $\mathbf X$ , using the row and column indices, and plot the heat map of the resulting sub-matrix similar to the one shown in Figure 2. Convince yourself that this particular strategy lends itself well to interpretation.

Starter code and solution.

### Extra (optional)

*Figure 3: * The log of the estimated ARD matrix shown as Hinton diagrams. Blue corresponds to active components while red corresponds to inactive ones.

*Figure 4: * The true latent components used for generating the data. The first two components are shared between the two views, while the last two are specific to one view each.

*Figure 5: * The estimated latent components of our model. Note that the components can be invariant to order

The generative story of Bayesian PCA for continuous data goes as follows,
\begin{align}
\mathbf{W}_k &\sim \mathcal{N}(0, \tau^{-1} \mathbf I_D)\\
\mathbf z &\sim \mathcal{N}(0, \mathbf I_K)\\
\mathbf{x}_n &\sim \mathcal{N}(\mathbf W\mathbf z_n, \tau_m^{-1} \mathbf I_D)\\
\end{align}
Is the claim *Latent Dirichlet Allocation (LDA) is PCA for discrete data* true or false? Use the generative model for LDA to justify your position either way.

## Canonical Correlation Analysis and Group Factor Analysis

For two data vectors $\mathbf x^1$ and $\mathbf x^2$, efficient inference on Bayesian
CCA can be done by introducing a group-wise sparsity prior for
the projection matrix, as was explained in class. Recall that
the IBFA model,
\begin{align}
\mathbf x^{(m)} &\sim \mathcal{N}(\mathbf A^{(m)}\mathbf z + \mathbf B^{(m)}\mathbf z^{(m)}, \tau^{(m)} \mathbf I)\\
\mathbf z^{(m)} &\sim \mathcal{N}(\mathbf 0,\mathbf I)\\
\mathbf z &\sim \mathcal{N}(\mathbf 0,\mathbf I)
\end{align}
captures the shared correlation common to both data sets with
the shared latent variable $\mathbf{z}$ and models the variation within
a data set with the view-specific latent variable $\mathbf z^{(m)}$. Now, by
introducing the group-wise sparsity on $\mathbf W$ as,
$$\mathbf W = \begin{bmatrix}
\mathbf A^{(1)} &\mathbf B^{(1)} & \mathbf 0\\
\mathbf A^{(2)} & \mathbf 0 & \mathbf B^{(1)}
\end{bmatrix} $$
$$\mathbf \Sigma = \begin{bmatrix}
\mathbf (\tau^{(1)})^{-1} \mathbf I & \mathbf 0\\
\mathbf 0 & \mathbf (\tau^{(2)})^{-1} \mathbf I\\
\end{bmatrix} $$
and concatenating $\mathbf z = [\mathbf z^{(1)} ,\mathbf z^{(2)}]$, IBFA can alternatively be written
as,
\begin{align}
\mathbf x &\sim \mathcal{N} (\mathbf{Wz}, \mathbf\Sigma)\\
\mathbf z &\sim \mathcal{N}(\mathbf 0,\mathbf I )
\end{align}
*In other words, we are now effectively analysing the feature-wise concatenation of the data sources with a single latent variable model with diagonal noise covariance, and a projection
matrix with a specific structure, enforced by introducing an
ARD prior for each view. *

Bayesian GFA can be interpreted as an extension of factor models, where the task is to model relationships between data sets (or views), describing dependencies between views rather than the independent variables. Alternatively, it can also be thought of as an extension of Bayesian CCA to more than 2 views.

"The figure above illustrates group factor analysis of three data sets or views. The feature-wise concatenation of the data sets $\mathbf X_m$ is factorised as a product of the latent variables $\mathbf Z$ and factor loadings $\mathbf W$. The factor loadings are group-wise sparse, so that each factor is active only in some subset of views (or all of them).
from *Bayesian Group Factor Analysis*, Virtanen et al.
The factors active in just one of the views model the structured noise, variation independent of all other views, whereas the rest model the dependencies. The nature of each of the factors is learnt automatically, without needing to specify the numbers of different factor types (whose number could be exponential in the number of views) beforehand."

### Optional Reading

You can refer to the Bayesian CCA as well as the Bayesian GFA paper for more details.

*Figure 6: * The ARD matrix $\alpha$ used in data generation is shown. Blue corresponds to active components and red to inactive ones.

*Figure 7: * The log of the estimated ARD matrix. Inactive components are in red.

### Questions

You will have to complete at the very least the Bayesian CCA model to claim 2.5 points for today's exercise session. Bayesian GFA is trivially extendable.

Bayesian CCA is formally expressed as follows: \begin{align} \mathbf x_n^{(m)} &\sim \mathcal{N} ( \mathbf W^{(m)} \mathbf z_n , ( \tau^{( m )} )^{-1}\mathbf I)\\ \mathbf W_k^{(m)}| \alpha_k^{(m)} &\sim \mathcal{N} ( 0, ( \alpha_k )^{-1} \mathbf I)\\ \alpha_k^{(m)} &\sim \text{Gamma}( \alpha_0 , \beta_0 )\\ \tau^{( m )} &\sim \text{Gamma}( \alpha_0 , \beta_0 )\\ \mathbf z_n &\sim \mathcal{N} ( 0, \mathbf I_k ) \end{align}

- Write the Stan snippet for Bayesian CCA. Start with the Stan code for Bayesian PCA (you can access the solution to Bayesian PCA).
- Reproduce the ARD matrix in Figure 3.
- Reproduce the latent components shown in Figure 5.
- If we were to ignore the structure in $\mathbf W$ and assume a single $\tau$ across all views, what would this model reduce to?

Starter code and solution.

The formulation of Bayesian GFA is the same as for CCA, except $m$ is extended to more than two views.

- If your Stan code for Bayesian CCA is correct, it should be easily extendable to Bayesian GFA.
- The data generation uses the ARD matrix in Figure 6.
- Reproduce the ARD matrix in Figure 7. Remember that the model is invariant to order, so you don't need to reproduce it
*exactly*.

Starter code and solution.

## Tensor factorization

Canonical decomposition(CP) factorizes a
tensor into low-dimensional latent variables in all the
modesA tensor of mode 2 is a matrix. For this exercise, we use a tensor of mode 3.
*Figure 8: * Illustration of tensor factorization. The tensor $\mathcal{Y}$ can be factorized into a low-dimensional component space $\mathbf X$, $\mathbf W$ and $\mathbf U$ which represents relationships across the drugs, genes
and cancers.
as shown in Figure 8.
The latent variables are also commonly referred to as factors or components.
For the tensor $\mathcal{Y} \in \mathbb{R}^{N \times D \times L}$, CP identifies latent variables
with $K$ components in all the modes $\mathbf X \in \mathbb{R}^{N \times K}$,
$\mathbf W \in \mathbb{R}^{D \times K}$ and $\mathbf U \in \mathbb{R}^{L \times K}$ as
$$\mathcal{Y} \approx \sum_{k=1}^{K} \mathbf x_k \circ \mathbf w_k \circ \mathbf u_k$$
where $\circ$ is the outer product.

*Since you are not expected to know about tensor factorization methods until tomorrow, you can imagine
tensor factorization to be an extension of Bayesian PCA for tensors. *
*Figure 9: * Illustration of Bayesian tensor factorization with ARD. The latent components of $\mathbf X$ (top), $\mathbf W$ (middle) and $\mathbf U$ (bottom) are shown, with the ARD switching off the extraneous components.

### Questions

Given tensor $\mathcal{Y} \in \mathbb{R}^{N \times D \times L}$, the Bayesian formulation of the CP factorization for a three-mode tensor into corresponding latent variables (or components) $\mathbf X \in \mathbb{R}^{N \times K}$, $\mathbf W \in \mathbb{R}^{D \times K}$ and $\mathbf U \in \mathbb{R}^{L \times K}$, with ARD priors on $\mathbf W$, is given as follows:

\begin{align} {y}_{n,d,l} &\sim \mathcal{N} \left( \sum_{k=1}^K \left(x_{n,k} \cdot w_{d,k} \cdot u_{l,k}\right) , \tau^{-1}\right)\\ {x}_{n,k} &\sim \mathcal{N}(0,1 )\\ {u}_{l,k} &\sim \mathcal{N}(0,1 )\\ {w}_{d,k} &\sim \mathcal{N}(0, \alpha_k^{-1} )\\ \tau &\sim \text{Gam}(\alpha_0, \beta_0)\\ \alpha_k &\sim \text{Gam}(\alpha_0, \beta_0)\\ \end{align} An ideal way to write a Stan solution for this is as follows:

- Rewrite the matrix multiplication $\mathbf Z * \mathbf W'$ in the Bayesian PCA snippet as a sum of outer products over $\sum_{k=1}^{K} \mathbf z_k \circ \mathbf w_k$.
- Then extend the outer product sum to the 3 matrices $\mathbf X$, $\mathbf W$ and $\mathbf U$ as $\sum_{k=1}^K (x_{n,k} \cdot w_{d,k} \cdot u_{l,k})$
- Run the Stan code on the toy data set and reproduce the plots in Figure 9.
- If you are able to, try and optimize the code.

Starter code and solution.

## Epilogue

We trust that the exercises have proven to be of value in furthering your understanding of probabilistic factor models. Naturally, learning a new language from scratch and implementing complicated models is challenging, so some of the exercises might have proven harder than expected — which is why we provided you with rather copious (and admittedly buggy) starter codes.

Do remember to leave feedback about the exercises on the course feedback form. If you are at all interested in probabilistic programming languages and black-box inference methods, now is a great time to work in these areas. There are several such popular languages under active development:

You have mostly been using variational Bayes(VB) for inference. In contrast to MCMC methods, VB is extremely scalable and efficient. If you want to know more about the latest trends in such inference methods, you are welcome to look up the NIPS workshop held every year on approximate inference.

That is all, folks.