I have a large scVI integrated dataset and was wondering what the best way to implement scenic is? This GRN tool takes a raw matrix, ideally with all genes used as input. But from what I have read, it seems there is no way to retrieve a corrected (but not normalised) counts matrix of all genes from scVI.
Should I use the full raw counts matrix, or the HVG-resitricted scVI normalised matrix?
Is it appropriate to run GRN on uncorrected data?
Hi, thanks for you question. I’m not familiar with SCENIC, but if you’d like access to scVI’s corrected and unnormalized gene counts, you can run the following after training:
X = model.get_normalized_expression(library_size="latent")
Note that this will only return the genes that scVI was trained on.
Just out of curiosity, does that mean that the values are the same as the input scVI received?
Also to comment @jesswhitts question: running GRN inference on uncorrected data might be interesting anyways because SCENIC can be robust enough to “correct” for batch effects.
No, these values will be different from the input that scVI receives as they are reconstructed/generated counts from the model.
Can these corrected counts be used to perform DEGs analysis (in a pseudobulk fashion).
While I could correct for batch-effect in UMAP and clustering with scVI, but when I perform pyDESEQ2 on the raw counts, I see very less overlap between the batches. I wonder if the scVI corrected counts might be better for that purpose.
Thanks for your help @martinkim0 !
Is the ‘transform_batch’ parameter required in this instance, or just specifying latent library size? I can’t quite figure out what the transform_batch parameter does from the docs
@dub2s I’m not sure actually - I have a feeling that it’s more appropriate to use uncorrected raw counts for DESeq2. Maybe someone else with more experience with DESeq2 can comment.
If you want to perform differential expression on scVI-corrected counts, I would recommend using the built-in function for it: scvi.model.SCVI — scvi-tools
transform_batch shouldn’t be necessary as it produces counterfactual reconstructions. It just decodes the latent representation with a different batch index than what is the actual data.
When you use
model.get_normalized_expression(library_size="latent"), this will by default use the empirical library size of your data. In other words, it will scale the normalized expression generated from the model by the total UMI counts in each cell in your data, so you don’t need to explicitly pass in a library size.
Thanks for your input. I wasn’t aware of the differential expression within scVI. I will be trying it soon!
This makes sense. Thanks for your help!
To conclude it, I would recommend using raw counts for DESEQ2. All autoencoder or factor models learn gene-gene correlation. This might lead to false positives. DESEQ2 expects unnormalized count data as input. You won’t get this type of data out of scVI.