Hi scvi-tools team,
I would like to ask whether it is methodologically appropriate to train scVI using a feature set that combines highly variable genes with a small number of curated marker genes.
I have been working with multiple adult human skeletal muscle single-nucleus RNA-seq datasets. In our datasets, the majority of nuclei are myonuclei, usually more than 85 percent, which makes the data structure quite different from more heterogeneous datasets such as PBMCs. Some biologically important structures that we know are present in the data, such as fiber-type differences, hybrid myonuclei, NMJ-related myonuclei, MTJ-related myonuclei, and atrophic-like myonuclei, may depend on known skeletal muscle marker genes that are not always selected by a standard HVG workflow.
I noticed that the input feature set used for scVI training can influence the downstream latent representation and UMAP structure. My current strategy is to start from 2,000 HVGs selected by Seurat FindVariableFeatures, and then add a small set of curated skeletal muscle marker genes that are known to be highly variable across the dataset. This is partly motivated by recent benchmarking work showing that feature selection methods can affect the performance of scRNA-seq data integration and querying, for example Zappia et al., Nature Methods 2025.
In my case, the added marker genes are first checked to ensure that they are present in the dataset and show sufficient expression and variability. Examples include MYH7 as a marker for type I myonuclei, RUNX1 for atrophic-like myonuclei, and PAX7 for satellite cells. These genes are sometimes not included among the top 2,000 HVGs, but they are enriched in distinct clusters across samples and are known to be biologically relevant in our tissue context.
The motivation is to preserve tissue-specific biological structure while still using scVI for batch correction and latent representation learning. Compared with using only Seurat-selected HVGs or Scanpy seurat_v3 HVGs, the HVG plus marker-gene strategy appeared to improve the resolution of biologically interpretable populations, such as hybrid myonuclei and atrophic-like myonuclei. I also checked these populations using marker gene visualization, such as dot plots, statistical testing, and, where available, independent biological evidence from the same biopsy samples, such as staining results.
However, I am concerned that adding curated marker genes may introduce prior biological bias into the latent space, or make the workflow methodologically less appropriate.
My questions are:
-
Is it acceptable to include curated marker genes together with HVGs for scVI training, especially if these marker genes are detected and highly variable in the dataset?
-
Would this be considered a reasonable tissue-informed feature selection strategy for adult skeletal muscle snRNA-seq, especially when the added marker genes are supported by independent biological evidence or prior knowledge from the same tissue context?
-
Would benchmarking HVG-only scVI against HVG plus marker-gene scVI be an appropriate way to evaluate this strategy? If so, what tests or metrics would you recommend to show that the added marker genes preserve meaningful biological structure without introducing artifacts?
Any advice would be greatly appreciated.
Best regards,
Brian