Using curated marker genes together with HVGs for scVI feature selection in adult skeletal muscle snRNA-seq

p.l.b.chen · May 11, 2026, 9:51pm

Hi scvi-tools team,

I would like to ask whether it is methodologically appropriate to train scVI using a feature set that combines highly variable genes with a small number of curated marker genes.

I have been working with multiple adult human skeletal muscle single-nucleus RNA-seq datasets. In our datasets, the majority of nuclei are myonuclei, usually more than 85 percent, which makes the data structure quite different from more heterogeneous datasets such as PBMCs. Some biologically important structures that we know are present in the data, such as fiber-type differences, hybrid myonuclei, NMJ-related myonuclei, MTJ-related myonuclei, and atrophic-like myonuclei, may depend on known skeletal muscle marker genes that are not always selected by a standard HVG workflow.

I noticed that the input feature set used for scVI training can influence the downstream latent representation and UMAP structure. My current strategy is to start from 2,000 HVGs selected by Seurat FindVariableFeatures, and then add a small set of curated skeletal muscle marker genes that are known to be highly variable across the dataset. This is partly motivated by recent benchmarking work showing that feature selection methods can affect the performance of scRNA-seq data integration and querying, for example Zappia et al., Nature Methods 2025.

In my case, the added marker genes are first checked to ensure that they are present in the dataset and show sufficient expression and variability. Examples include MYH7 as a marker for type I myonuclei, RUNX1 for atrophic-like myonuclei, and PAX7 for satellite cells. These genes are sometimes not included among the top 2,000 HVGs, but they are enriched in distinct clusters across samples and are known to be biologically relevant in our tissue context.

The motivation is to preserve tissue-specific biological structure while still using scVI for batch correction and latent representation learning. Compared with using only Seurat-selected HVGs or Scanpy seurat_v3 HVGs, the HVG plus marker-gene strategy appeared to improve the resolution of biologically interpretable populations, such as hybrid myonuclei and atrophic-like myonuclei. I also checked these populations using marker gene visualization, such as dot plots, statistical testing, and, where available, independent biological evidence from the same biopsy samples, such as staining results.

However, I am concerned that adding curated marker genes may introduce prior biological bias into the latent space, or make the workflow methodologically less appropriate.

My questions are:

Is it acceptable to include curated marker genes together with HVGs for scVI training, especially if these marker genes are detected and highly variable in the dataset?
Would this be considered a reasonable tissue-informed feature selection strategy for adult skeletal muscle snRNA-seq, especially when the added marker genes are supported by independent biological evidence or prior knowledge from the same tissue context?
Would benchmarking HVG-only scVI against HVG plus marker-gene scVI be an appropriate way to evaluate this strategy? If so, what tests or metrics would you recommend to show that the added marker genes preserve meaningful biological structure without introducing artifacts?

Any advice would be greatly appreciated.

Best regards,
Brian

cane11 · May 12, 2026, 5:43am

This is completely fine in my experience. It can be a lot of effort though and might be biased (maybe you missed some cell-types). To make you less biased would stratified downsampling and traditional HVG not be also an option here - take 10k cells from each coarse cluster before running scVI and use Seurat HVG selection on this balanced subset of the data.

Topic		Replies	Views
Excluding Ig and ribosomal genes from HVG selection in scVI, best practice? scvi-tools integration , scvi , clustering	1	182	October 22, 2025
Usage of HVG in scVI scvi-tools gene-selection , scvi	12	2773	March 1, 2022
Selection of HVG in scVI scvi-tools scvi	3	1370	December 20, 2022
All genes or highly variable genes? scvi-tools gene-selection , scvi , totalvi	10	4640	March 31, 2022
Gene filtering prior to batch correction scRNA-seq scrna-seq , integration	2	872	July 9, 2021

Using curated marker genes together with HVGs for scVI feature selection in adult skeletal muscle snRNA-seq

Related topics