scVI uses whatever genes are present in the adata object. Empirically, we tend to see better dimensionality reduction results when adata only contains highly variable genes, which is what we do in our tutorials.
This is also the case for most if not all integration/batch-correction methods. E.g., this paper shows using HVG leads to better integration
You would likely see the same phenomenon with a simple PCA.
I wonder whether there is a possibility to choose which genes will be used for the model. Let’s assume I compute HVGs using scanpy but don’t subset my anndata object (that is, HVGs will be tagged instead of the removal of non-variable genes) the model will use all genes instead of the HVGs. Therefore, it is necessary to subset the anndata before model setup if one wishes to use the HVGs for training. However, it would be desirable not to subset the data and keep the original anndata object intact.
I think if there was a possiblity to make scvi only use the tagged HVGs, that would increase compatibility with scanpy when using multiple layers for raw counts, normalized counts and simplify later sub-clustering approaches which would require to re-compute the HVGs (which is not possible when only keeping the HVGs from level 1 clustering).
If you put all genes the .raw, and do analysis using a subset of genes that results in information stored in .obs or .obsm, then you can recover all genes using adata.raw.to_adata(). This will leave all the new obs and obsm in the anndata.
adata.layers["counts"] = adata.X.copy() # preserve counts
adata.raw = adata # freeze the state in `.raw`
leads to adata.raw.X being log normalized. The counts can’t be retrieved from .raw. raw is invariant to adata being subset at the gene level.
We considered this pattern, but it proved to be difficult to implement as we then needed to keep a gene-level mask around. It also was slower as every minibatch of data that goes through the model needed to be subset. While there is some inconvenience on the user-end for managing data, we felt that a what you see is what you get approach is also more straightforward to use in a pipeline.
I’m fine with ‘losing’ the lowly variable genes for follow up analyses. What I do worry about is that scVI’s library size estimation will next take place on only the remaining variable genes. Is this worry justified or is library size estimation done on adata.raw.X?
I second this question. As I understand the sequencing depth variation is modeled as unobserved random variable l_n. And in Fig. Supp. 13a Lopez et al. 2018 you show that for homogenous cell population the inferred size factor correlates with the sequencing depth. But I am not sure if all genes were used for the analysis.
Would it make a difference (using HVG vs all genes) for size factor estimates in heterogeneous cell population and maybe for zero estimates too?
If we can get some clarification that would be awesome!
Indeed the library size estimation will only consider the genes included in the model. In previous versions of the older scvi package, this was a latent variable with an informative prior. In newer versions we take the observed library size (again, of the genes used in the model).
In a future release, we can include a size_factor_key for manual setting of the size factor if that’s a desired feature.
That sounds like a great feature! I’ve developed a pattern where I manually calculate total counts and store in e.g. .obs['total_counts_all']. Then after I do gene selection I make a new .obs['total_counts_sel']. This way I can use these ratios when comparing predicted gene expressions from scVI with raw data, to see if the model is missing some variation pattern in the data that I can see using unused metadata. If I could pass size_factor_key = 'total_counts_all' to the model I think it would be easier to relate predicted expression to the raw data.
Follow-up question. Normally, when I sub-cluster a single cluster from a whole dataset, I get a better finer resolution of sub-clusters than these same cells in the initial clustering of the whole dataset.
My question is does this logic apply to scVI? So, for instance, could I run scVI on HVGs calculated on whole tissue, then in the latent representation identify cell A, then take these barcodes back to the HVG selection step and re-run scVI on a group of more cell A-specific HVGs?