Selection of HVG in scVI

I am performing data integration with scVI. I concatenated 28 samples from different batch id before HVG identification. What is the best way to select HVG? is it better to select only the ones that overlap between different samples? or is it better to include also batch-specific HVG?

Hi @terzoli

Before doing any kind of integration, it is crucial to select a feature space that is shared across as many of your samples. In single-cell we do this by finding the HVG of each sample, assuming that these samples have similar cell types and therefore the genes we find will be shared across samples. Therefore, to select HVG across many samples you need to first find them for each sample independently, and then select the genes that are marked as highly variable across as many samples as possible, here’s some code on how to do it:

# Compute HVG per batch
sc.pp.highly_variable_genes(adata, batch_key='batch')

# Select an arbitrary number of features
num_hvg_genes = 3000

# Sort by how many times a gene is marked as HVG and select top genes
hvg = adata.var.sort_values('highly_variable_nbatches').tail(num_hvg_genes).index

# Update gene's metadata
adata.var['highly_variable'] = [g in hvg for g in adata.var.index]

The number of HVG you select is completely arbitrary, usually I check the distribution of highly_variable_nbatches and try to select a reasonable number of features. Once you have selected your feature space (the HVG), you can apply any integration method such as scvi.

Hope this is helpful!