Preserving biological variability in scVI sample integration

Hello scVI community!
I am utilizing scVI for the first time to integrate scRNA-seq data from over 50 samples (batches).

Each sample in my dataset also corresponds to biological variables of interest, being age. I am concerned that there may be a risk that the integration might over-correct and remove the age-related biological variations.

  1. Could you advise on how to configure the scVI to recognize age as a critical biological variable, while correcting for technical batch effects?

  2. Also, my scRNA-seq data is formatted for Seurat. I noticed that the scVI tutorials typically demonstrate preprocessing with log normalization. However, considering scVI’s input is raw count data rather than normalized data, I am curious whether the choice of normalization method (log normalization or SCTransform) influence the scVI integration results at all.

Any insights or suggestions would be greatly appreciated!

Hi, thanks for your questions!

  1. If you’d like the model to account for continuous covariates in your data such as age, you may pass in the corresponding AnnData.obs keys with the continuous_covariate_keys argument in scvi.model.SCVI.setup_anndata. This will, by default, concatenate these continuous values to the latent representation and then pass in that combined vector to the decoder.

  2. Correct, scVI expects raw counts as input. We typically compute log-normalized counts in tutorials for downstream functions such as PCA-UMAP visualization, but you’ll notice that we feed in the raw counts only to scVI. We don’t recommend passing in log-normalized or otherwise transformed data to scVI as these values will be out-of-distribution for the decoder.

Hi @martinkim0 , thank you so much for the prompt response!

Regarding your suggestion for my first question, the age variate consists of discrete integer values ranging from 1 to 50. Given that these are not continuous measurements but rather distinct age categories, should I represent these age values as floating-point numbers (e.g., 1.0 to 50.0) to imply continuity? Or is there an alternative approach within scVI that better accommodates discrete variables like age groups?

Thank you again!

I think it makes sense to convert these age values to floats since there is a sense of orderedness, and we’d like the model to treat them as such. What do you think @cane11?

I would recommend treating your samples (donors) as batch_key in scVI and not feeding in additional covariates. In my hands, feeding in additional covariates can lead to overcorrection. For downstream analysis, we used age groups to study differences in Multimodal profiling reveals tissue-directed signatures of human immune cells altered with age - PubMed. You can see in this publication downstream analysis to distinguish aging features. We used MrVI there but scVI will likely work to identify cell-types etc. For DE analysis, I would use pseudobulk DE. ContrastiveVI within scVI-tools 1.1 could be also an interesting model to study aging but I have no experience there.