I am performing data integration with scVI. I concatenated 28 samples from different batch id before HVG identification. What is the best way to select HVG? is it better to select only the ones that overlap between different samples? or is it better to include also batch-specific HVG?
Before doing any kind of integration, it is crucial to select a feature space that is shared across as many of your samples. In single-cell we do this by finding the HVG of each sample, assuming that these samples have similar cell types and therefore the genes we find will be shared across samples. Therefore, to select HVG across many samples you need to first find them for each sample independently, and then select the genes that are marked as highly variable across as many samples as possible, here’s some code on how to do it:
# Compute HVG per batch sc.pp.highly_variable_genes(adata, batch_key='batch') # Select an arbitrary number of features num_hvg_genes = 3000 # Sort by how many times a gene is marked as HVG and select top genes hvg = adata.var.sort_values('highly_variable_nbatches').tail(num_hvg_genes).index # Update gene's metadata adata.var['highly_variable'] = [g in hvg for g in adata.var.index]
The number of HVG you select is completely arbitrary, usually I check the distribution of
highly_variable_nbatches and try to select a reasonable number of features. Once you have selected your feature space (the HVG), you can apply any integration method such as
Hope this is helpful!
often times the results from the integration are used for additional steps in an analysis. And for many tools, it is useful to use the full gene set, rather than the HVGs only.
As suggested here, if that is the goal one should run scVI on the full feature space
Is it sensible to apply HVG selection on scVI corrected (and normalized counts)?
Maybe. But this will very highly depend on what analysis you are doing. For the standard workflow of HVG → PCA → Cluster → Annotate, I would say no, it would not make sense to select genes based on how variable posterior predicted expression levels from scVI are. In that example case the HVG and PCA steps will not include uncertainty and variation that you are interested in.
In addition, the HVG detection methods implemented are designed to work on raw data, and the kind of variation that is inherent in this. The posterior predictions from scVI explicitly do not include these sources of variation, so simply applying the same method on these predictions will not work as expected.
There are cases when you can quantify spread and variation of predicted values to rank and classify different genes in different contexts. For example, you might have done the entire scVI workflow, then you have a cluster that you are studying. It might be interesting to learn which genes are ‘stable’ or ‘variable’ in that cluster, so you can quantify the standard deviation of the log predicted posterior expression levels.