Seed labeling vs Integration and label transfer: highly variable genes


Thanks for a great set of tools and for great documentation.

I am somewhat confused between the seed labeling tutorial and the one on Integration and label transfer. I noticed that the seed labeling tutorial does not bother with calculating and ‘sub-seting’ highly variable genes. Is there a specific reason for this?

As far as I can tell, seed labeling and label transfer are the same process, except for the presence of batches in the latter. Another question on here, did use batched data in seed labeling. Am I missing some crucial difference between the two processes.

Thanks a bunch,

Hi Anand,

If you can ever get away with not selecting genes that is great! When you do gene selection on one dataset and want apply the analysis on a different dataset there will be issues of ‘overfitting’ the analysis to the first dataset. Of course, single cell omics data is huge, so to speed things up you often want reduce the dataset size. If variation in gene expression can be explained by independent noise given all the data you have at the time, that is a very reasonable way to reduce the dataset size.

In my work I also have a rule of thumb to never use more genes than I have observations. I haven’t properly benchmarked the ability to learn gene variation from different dataset sizes, but I want to be sure that I’m not having issues due to not enough data to build a usable model.

So there are two factors: If data is too large, things will be slow with too many genes. If data is too small the number of genes you can learn about is limited. So I’m thinking the authors of the tutorials balanced these factors. (If you have the time and the data, use as many genes as you can!)


Thanks for the answer Valentine! I noticed that the # of cells in the seed labeling tutorial is ~43k. It makes sense that they did not need to go for highly variable genes.