Seed labeling vs Integration and label transfer: highly variable genes

Anand · January 19, 2022, 10:50pm

Hello,

Thanks for a great set of tools and for great documentation.

I am somewhat confused between the seed labeling tutorial and the one on Integration and label transfer. I noticed that the seed labeling tutorial does not bother with calculating and ‘sub-seting’ highly variable genes. Is there a specific reason for this?

As far as I can tell, seed labeling and label transfer are the same process, except for the presence of batches in the latter. Another question on here, did use batched data in seed labeling. Am I missing some crucial difference between the two processes.

Thanks a bunch,
Anand

Valentine_Svensson · January 20, 2022, 7:15am

Hi Anand,

If you can ever get away with not selecting genes that is great! When you do gene selection on one dataset and want apply the analysis on a different dataset there will be issues of ‘overfitting’ the analysis to the first dataset. Of course, single cell omics data is huge, so to speed things up you often want reduce the dataset size. If variation in gene expression can be explained by independent noise given all the data you have at the time, that is a very reasonable way to reduce the dataset size.

In my work I also have a rule of thumb to never use more genes than I have observations. I haven’t properly benchmarked the ability to learn gene variation from different dataset sizes, but I want to be sure that I’m not having issues due to not enough data to build a usable model.

So there are two factors: If data is too large, things will be slow with too many genes. If data is too small the number of genes you can learn about is limited. So I’m thinking the authors of the tutorials balanced these factors. (If you have the time and the data, use as many genes as you can!)

Best,
/Valentine

Anand · January 20, 2022, 4:01pm

Thanks for the answer Valentine! I noticed that the # of cells in the seed labeling tutorial is ~43k. It makes sense that they did not need to go for highly variable genes.

Thanks,
Anand

Topic		Replies	Views
Differential expression and highly variable genes scvi-tools	3	1735	October 5, 2022
Gene filtering prior to batch correction scRNA-seq scrna-seq , integration	2	736	July 9, 2021
Usage of HVG in scVI scvi-tools gene-selection , scvi	12	2235	March 1, 2022
Batch Integration Parameter Tuning scvi-tools integration , gene-selection , scvi , modeling	1	628	March 2, 2022
Transferring labels from gene expression to SMI spatial transcriptomics data scvi-tools	1	622	July 27, 2022

Seed labeling vs Integration and label transfer: highly variable genes

Related topics