Let’s start with 1., because it’s complicated…
The “point” of learning a representation (be it with PCA or scVI or anything else) is that you are saying “I think there is structure in this data, where some observations are more similar to each-other than others, which induce correlations among the observed variables.” If you know some samples should be similar, e.g. because they have same treatment, or they are from the same batch, you can then expand this into “Ok let’s account for those sources of variation, what unknown structure remains in the data?”. This is what batch correction / integration does.
It is easier to think about this in linear terms like limma/edgeR does, so let’s do that. Without prior knowledge, you say Y = WX , learn W and X by PCA (just as an example) and typically look at X . If you know e.g. batches B as a design matrix, you can do Y = WX + VB . You learn the weights in V , (because you know B ) then learn both W and X and can look at X , which now has the variation not explained by B . In the simple linear case, this is the same as first doing Y = V B , learn V , create Y' := Y - VB , then do the PCA by Y' = WX (this is called the Frisch–Waugh–Lovell theorem).
So conceptually, what you can do (for some reason), is say that you have two kinds of design matrices of known information: B for batches and C for conditions Set up Y = VB + UC , learn V and U , then create Y' := Y - VB , and again do PCA by Y' = WX . This way, X will contain variation that would have been explained by UC (in addition to remaining variation).
So is this a good idea? I’m not sure. For one, what does the learned UC term actually do? It predicts the mean of your conditions for all your genes. All the cell-cell variation is contained in the unknown variation term ( X ). So it’s not clear what you gain by putting these differences in means into the PCA as opposed to just analyzing the means you learn in the U matrix (which is exactly what a DE analysis gives you).
Of course, an interesting thing with single cell data, is that you might expect that weights in U would be different for different cell types. But of course, you don’t know these cell types. This is where the scVI batch correction differs from linear batch correction: by doing Y = f(X, B) instead of Y = WX + VB the function f can include interactions between the representation X (which ideally reflect cell types!) and the batches B .
In linear language, you can think of this as Y = WX + XVB + VB , which will be more complicated to learn than the standard non-interacting setting described above (the dimensionalities probably don’t add up right here, but just imagine it conceptually). And similarly with the C parts.
Now I will skip to 3.
Explicitly including variation in the reprsentation learned by scVI can hypothetically be done. In fact, this is done in scANVI, the cell label transfer model. Here one representation X_1 is learned so that Y = f_1(X_1, B) , and another representation X_2 is learned so that X_1 = f_2(X_2) and C = g(X_2) , where the function g is a classifier, and in this case C contains known cell type labels.
Here the utility of the X_2 representation is that it will be better than X_1 for the task of predicting cell types.
Now, if I would put in some other information in C , I’m not clear how I would interpret what the variation in X_2 would represent, or how I would use it. But there might be some interesting things to learn from it.
As a slight aside, I have worked on one related problem. I had a time course dataset with a discrete number of time points. I could have just investigated the gene expression over those time points by averaging the expression in each time point. But I wanted to account for the fact that cells in each time point might be slightly out of sync. So I specifically made a 1-dimensional representation T , but with a strong prior on the known time points, encoding my assumption that “mostly”, cells are in sync, except for a few outliers, which are allowed to “beat” the prior. To avoid writing even more paragraphs, it’s covered briefly in the middle of this talk: Sanger Institute - Valentine Svensson - YouTube. It’s a similar problem with a different solution, and a different motivation for why I would want a latent variable that reflects my known design matrix.
Yeah I think it works pretty often, and there’s no harm trying. What made me move from linear methods to scVI for integration / batch correction is that I was seeing that different cell types were differently affected by batch membership. The linear correction would work for most cell types, but not for e.g. T cells. Or vice versa. Because the linear methods assume that all genes are equally affected by the ‘batch’ in all cell types.
Linear methods in general can do it, but there might not be functionality implemented due to it not being clear what you would use the information for. For example, a number of tools (like scTransform or glmPCA) work on Pearson residuals, and for those (in theory) the linear tricks should work.
Hope this is more clarifying than confusing!