Hello everyone!
I have a question on scanpy and the selection of the highly variable genes before the downstream integration step with scVI.
In my dataset I have two main variables: “donor” and “batch_ID”. Each donor (X, Y, Z, …) corresponds to more than one sample sequenced (Xa, Xb, Xc, …), so the variable “donor” groups more than one sample. While, instead, the “batch_ID” variable corresponds to each sample sequenced.
In the code of HVG selection, sc.pp.highly_variable_genes, there is the option to set the batch_key. Should I set the batch_key equal to “batch_ID” or to “donor”?
Another question, should I use the same variable of HVG selection as the one on which I perform the batch correction for the integration? I mean, if I decide to select the HVGs on the “donor” instead of on “batch_ID”, should I set the batch correction before the training of the adata on “donor” as well? Or the selection of HVGs is independent from the choice of the variable to correct the batch for the integration?
And last one, if I want to make a subset of clusters to focus the analysis on specific type of cells (like, starting from total PBMCs, I want to focus on myeloid cells == adata_reduced), should I recalculate the HVGs and perform again the training of the adata_reduced? If yes, the batch_key in the HVG code on adata_reduced should be the same as the one chosen for the adata?
Thanks so much!!
Paolo
Hey Paolo,
a partial answer that might help. the scanpy
documentation states:
“If specified, highly-variable genes are selected within each batch separately and merged.
This simple process avoids the selection of batch-specific genes and acts as a
lightweight batch correction method. For all flavors, genes are first sorted
by how many batches they are a HVG. For dispersion-based flavors ties are broken
by normalized dispersion. If flavor = 'seurat_v3'
, ties are broken by the median
(across batches) rank based on within-batch normalized variance.”
meaning the batch_key
parameter is here to balance the HVGs between the different batches - and isn’t really a batch correction scheme. If you want your HVGs to be meaningful across many of the batch/donor combinations, you can create a joint key:
(adata.obs['donor_batch_ID'] = pd.Categorical(adata.obs['donor'] + adata.obs['batch_ID'] )
)
Regarding re-selecting HVGs: it all depends on what you are looking to find. If what you are interested in is genes that are generally variable and how they are differentially expressed in myeloid cells, than keep the HVGs. If you are interested in the myeloid HVGs than recalculating. In my experience, when i don’t recalculate the HVGs I have a lot of genes that are relatively constant throughout the group, which makes the analysis harder and less informative.
One other thing, after batch correction there doesn’t seem to be a reason to use batch_key
.