Selection of HVG in scVI

terzoli · November 15, 2022, 3:59pm

I am performing data integration with scVI. I concatenated 28 samples from different batch id before HVG identification. What is the best way to select HVG? is it better to select only the ones that overlap between different samples? or is it better to include also batch-specific HVG?

PauBadiaM · November 16, 2022, 8:43am

Hi @terzoli

Before doing any kind of integration, it is crucial to select a feature space that is shared across as many of your samples. In single-cell we do this by finding the HVG of each sample, assuming that these samples have similar cell types and therefore the genes we find will be shared across samples. Therefore, to select HVG across many samples you need to first find them for each sample independently, and then select the genes that are marked as highly variable across as many samples as possible, here’s some code on how to do it:

# Compute HVG per batch
sc.pp.highly_variable_genes(adata, batch_key='batch')

# Select an arbitrary number of features
num_hvg_genes = 3000

# Sort by how many times a gene is marked as HVG and select top genes
hvg = adata.var.sort_values('highly_variable_nbatches').tail(num_hvg_genes).index

# Update gene's metadata
adata.var['highly_variable'] = [g in hvg for g in adata.var.index]

The number of HVG you select is completely arbitrary, usually I check the distribution of highly_variable_nbatches and try to select a reasonable number of features. Once you have selected your feature space (the HVG), you can apply any integration method such as scvi.

Hope this is helpful!

zvittorio · December 20, 2022, 4:10pm

Hi @PauBadiaM

often times the results from the integration are used for additional steps in an analysis. And for many tools, it is useful to use the full gene set, rather than the HVGs only.
As suggested here, if that is the goal one should run scVI on the full feature space

Is it sensible to apply HVG selection on scVI corrected (and normalized counts)?

Valentine_Svensson · December 20, 2022, 6:50pm

Hi Zvittario,

Maybe. But this will very highly depend on what analysis you are doing. For the standard workflow of HVG → PCA → Cluster → Annotate, I would say no, it would not make sense to select genes based on how variable posterior predicted expression levels from scVI are. In that example case the HVG and PCA steps will not include uncertainty and variation that you are interested in.

In addition, the HVG detection methods implemented are designed to work on raw data, and the kind of variation that is inherent in this. The posterior predictions from scVI explicitly do not include these sources of variation, so simply applying the same method on these predictions will not work as expected.

There are cases when you can quantify spread and variation of predicted values to rank and classify different genes in different contexts. For example, you might have done the entire scVI workflow, then you have a cluster that you are studying. It might be interesting to learn which genes are ‘stable’ or ‘variable’ in that cluster, so you can quantify the standard deviation of the log predicted posterior expression levels.

Topic		Replies	Views
Usage of HVG in scVI scvi-tools gene-selection , scvi	12	2278	March 1, 2022
HVG selection with multiple batches scRNA-seq	3	989	July 4, 2022
Gene filtering prior to batch correction scRNA-seq scrna-seq , integration	2	744	July 9, 2021
Understanding scVI integration inside R with Seurat v5 & SCTransform scvi-tools integration	1	171	April 6, 2025
All genes or highly variable genes? scvi-tools gene-selection , scvi , totalvi	10	3678	March 31, 2022

Selection of HVG in scVI

Related topics