HVG selection with multiple batches

andrewjkwok · April 28, 2021, 3:54pm

Hi,

First, I wanted to say thank you very much for this suite of tools - it looks very powerful and I’m very happy to be analysing my data with it. I had a question about highly variable gene (HVG) selection when it comes to having multiple batches.

In the manuscript, it’s mentioned that:
“In the case where the PBMC datasets are integrated, the 4,000 HVGs are selected by merging HVGs computed on each dataset separately as in the Seurat v3 method.”

However, in the tutorials e.g. the one for “Atlas-level integration and label transfer” (or others), the datasets are concatenated before Seurat v3 HVG selection is performed. Can I clarify what the exact process is that’s recommended?

For example, when I’m analysing my own data, after doing HVG selection on my concatenated anndata object and then run scvi.data.setup_anndata, it registers my anndata object with 4000 vars (I chose to select 4000 HVGs), but looking at the adata.var table, those genes are not necessarily all highly variable for all the batches. Is that what is nevertheless recommended, or is it actually suggested to only use the HVGs which are highly variable in all the batches?

Thanks!

adamgayoso · April 29, 2021, 4:05am

In the mentioned tutorial we see

sc.pp.highly_variable_genes(
    adata,
    flavor="seurat_v3",
    n_top_genes=2000,
    layer="counts",
    batch_key="batch",
    subset=True
)

Where the concatenation happens beforehand, and the “batch” key is given to the HVG function. The way this function merges genes is by looking at how many batches it was considered HVG. If there are 2 batches and only 1000 genes that are HVG in both, this function will then look at genes that are HVG in one batch.

In practice this has typically worked out for us, but you could run this function without subset=True and then inspect the results and filter yourself. We do know that having batch-specific genes can worsen integration.

andrewjkwok · April 29, 2021, 1:11pm

Thanks for the quick reply. That makes sense. I’ll proceed that way in that case and see how it goes!

tkapello · July 4, 2022, 3:58pm

Maybe I can contribute another thought to this discussion. How would you choose the HVGs for batch correction? I would assume that selecting the same number of HVGs per batch would do justice, but I also came across hvg_batch from the scib package which does something similar to what described above, but does guarantee that all batches (if many) will be considered. What are your thoughts here?

Topic		Replies	Views
Selection of HVG in scVI scvi-tools scvi	3	1115	December 20, 2022
Scanpy HVG batch_key scanpy	1	1030	March 16, 2023
Usage of HVG in scVI scvi-tools gene-selection , scvi	12	2243	March 1, 2022
Gene filtering prior to batch correction scRNA-seq scrna-seq , integration	2	738	July 9, 2021
Error in highly variable gene selection scanpy scrna-seq , gene-selection	8	3884	March 21, 2022

HVG selection with multiple batches

Related topics