Hi everyone,
I’m working with single-cell RNA-seq data from CD45⁺ immune cells (mostly lymphoid lineages) and integrating multiple batches using scVI, which so far has given the best batch correction results.
We’re now reprocessing the data after adjusting QC thresholds, and I came across some recent papers where they state:
“Prior to PCA, nearest neighbor clustering, and UMAP representations, some genes were filtered from inclusion including those associated with Ig loci (Igk, Igl, or Igh), ribosomal proteins (Rps or Rpl), mitochondrial (mt-), sex (Xist),…”
My questions are:
-
Would it make sense to exclude these genes before computing HVGs, so that they never influence the latent space learned by scVI?
Or is it better to compute HVGs normally, then remove these specific genes after HVG selection (e.g., sethighly_variable=Falsefor them)? -
Once we have obtained broad cell type annotations (T cells, B cells, myeloid, etc.),
is it advisable to subset, recalculate HVGs within one lineage (e.g. T cells), and retrain a new scVI model for finer clustering?
Or is it acceptable to rely on the latent embeddings from the original full scVI model for the subcluster analysis?
Any insights or examples of good practice would be appreciated.
Thanks in advance!