Excluding Ig and ribosomal genes from HVG selection in scVI, best practice?

Hi everyone,

I’m working with single-cell RNA-seq data from CD45⁺ immune cells (mostly lymphoid lineages) and integrating multiple batches using scVI, which so far has given the best batch correction results.

We’re now reprocessing the data after adjusting QC thresholds, and I came across some recent papers where they state:

“Prior to PCA, nearest neighbor clustering, and UMAP representations, some genes were filtered from inclusion including those associated with Ig loci (Igk, Igl, or Igh), ribosomal proteins (Rps or Rpl), mitochondrial (mt-), sex (Xist),…”

My questions are:

  1. Would it make sense to exclude these genes before computing HVGs, so that they never influence the latent space learned by scVI?
    Or is it better to compute HVGs normally, then remove these specific genes after HVG selection (e.g., set highly_variable=False for them)?

  2. Once we have obtained broad cell type annotations (T cells, B cells, myeloid, etc.),
    is it advisable to subset, recalculate HVGs within one lineage (e.g. T cells), and retrain a new scVI model for finer clustering?
    Or is it acceptable to rely on the latent embeddings from the original full scVI model for the subcluster analysis?

Any insights or examples of good practice would be appreciated.

Thanks in advance!

1 Like

Hey,

  1. Generally, the latter is the way to go to not affect the biological signal and other downstream tasks, such as DE. So, keeping them, but not letting them affect the latent space. However, the decision may also be influenced by the problem you are trying to solve.

  2. I think you can only gain by performing this, considering (1).

In any case, you can always compare the different strategies by running scib-metrics on the generated latent space(s) and DE to validate your expected results.

1 Like