I have a large scVI integrated dataset and was wondering what the best way to implement scenic is? This GRN tool takes a raw matrix, ideally with all genes used as input. But from what I have read, it seems there is no way to retrieve a corrected (but not normalised) counts matrix of all genes from scVI.
Should I use the full raw counts matrix, or the HVG-resitricted scVI normalised matrix?
Hi, thanks for you question. Iâm not familiar with SCENIC, but if youâd like access to scVIâs corrected and unnormalized gene counts, you can run the following after training:
X = model.get_normalized_expression(library_size="latent")
Note that this will only return the genes that scVI was trained on.
Just out of curiosity, does that mean that the values are the same as the input scVI received?
Also to comment @jesswhitts question: running GRN inference on uncorrected data might be interesting anyways because SCENIC can be robust enough to âcorrectâ for batch effects.
Can these corrected counts be used to perform DEGs analysis (in a pseudobulk fashion).
While I could correct for batch-effect in UMAP and clustering with scVI, but when I perform pyDESEQ2 on the raw counts, I see very less overlap between the batches. I wonder if the scVI corrected counts might be better for that purpose.
Is the âtransform_batchâ parameter required in this instance, or just specifying latent library size? I canât quite figure out what the transform_batch parameter does from the docs
@dub2s Iâm not sure actually - I have a feeling that itâs more appropriate to use uncorrected raw counts for DESeq2. Maybe someone else with more experience with DESeq2 can comment.
If you want to perform differential expression on scVI-corrected counts, I would recommend using the built-in function for it: scvi.model.SCVI â scvi-tools
@jesswhitts Using transform_batch shouldnât be necessary as it produces counterfactual reconstructions. It just decodes the latent representation with a different batch index than what is the actual data.
When you use model.get_normalized_expression(library_size="latent"), this will by default use the empirical library size of your data. In other words, it will scale the normalized expression generated from the model by the total UMI counts in each cell in your data, so you donât need to explicitly pass in a library size.
To conclude it, I would recommend using raw counts for DESEQ2. All autoencoder or factor models learn gene-gene correlation. This might lead to false positives. DESEQ2 expects unnormalized count data as input. You wonât get this type of data out of scVI.