Inquiry about Data Input and DE Analysis Details in scVI

Dear SCVI team,

I hope this message finds you well. I am currently utilizing scVI for my research and have some questions regarding the data input and analysis process, particularly concerning differential expression (DE) analysis. Your insights would be greatly helpful in advancing my understanding and application of the tool.

  1. Data Input for DE Analysis: When conducting DE analysis using scVI, what specific type of data should be inputted into the model? Is it the raw count data, normalized data within the model, or another form of standardized data? I noticed that in tutorials, there is no explicit specification of the data input type, which has led to some uncertainty about what exactly is being used for DE analysis.
  2. Interpretation of DE Results: Regarding the DE results, specifically when a mean_log2FC value is positive and exceeds 0.5, how should this be interpreted in terms of the comparison between two groups, say Group A and Group B? Does a positive value indicate that gene expression in Group A is greater than in Group B, or vice versa?
  3. Inclusion of Batch Parameters: In community discussions, there is often debate over whether to include a ‘batch’ parameter in the model, and how setting it to True or False might affect the outcomes. Could you provide some guidance on when it is advisable to include this parameter and when it might not be necessary?
  4. Data for Visualization: For visualizing markers or DE results, should we stick to the data form used in DE analysis or can we use normalized or scVI-normalized data?
  5. Scanpy Integration: Regarding the integration with scanpy for DE analysis, is normalized data typically used in scanpy, while scVI might use a different form of data?

I apologize for the multitude of questions, but your expertise would greatly clarify these crucial aspects, enabling more accurate application and interpretation of scVI in my work.

Thank you very much for your time and assistance.

Best regards,

1 Like

I want to know whether should I do the normalize step to the raw count adata before i using the SCVI_DE model. In my understanding, the DE model will using the trained model which using the scvi-model.SCVI function initially?

I can address a subset of the questions:

  1. Most of the models in scvi-tools (including scVI) require raw count data as input since the generative process parametrizes discrete distributions (negative binomial or Poisson). The generative process also learns normalized expression values (scvi.model.SCVI.get_normalized_expression) that are used for downstream differential expression. You can learn more about that in our user guide.

  2. It’s recommended to set up a batch key using scvi.model.SCVI.setup_anndata when you expect there to be technical effects in your data (e.g. assay type), and this can help later for differential expression.

Thank you for your reply!

Positive LFC means higher expression in group1 than group2.
I prefer validating and plotting results using raw data. This gives you additional insights about the actual measured differences.
Scanpy takes logarithmic values (after library size normalization) as input while scVI uses raw count data. The LFC is an arithmetic mean LFC in scVI and is the LFC of a geometric mean in scanpy. See The impact of package selection and versioning on single-cell RNA-seq analysis - PubMed for more details.