Hi I have spent a few days to learn how totalvi analyze CITE-seq data and I am bit confused by the contrasting steps between Scanpy and totalvi:
I have used Scanpy for 10x scRNA-seq for over 2 years now and I love it.
The typical steps are as follows:
- Read in the data.
3.normalize, log transformation.
- Generate highly variable genes.
- regress out, scale, PCA, neighbor , UMAP and Leiden.
For TOTALVI, I realize it is quite different:
- Read in the data.
- normalize, log transformation.
- Generate highly variable genes. The method in this step is very different! Why?
- vae = scvi.model.TOTALVI, vae.train. This is unique to scvi/TOTALVI.
Please, advise if this is appropriate steps to analyze CITE-seq using scvi/TOALVI?
totalVI replaces the regress out, scale, and PCA steps of Scanpy. How you select genes is up to you (we use the seurat v3 flavor of the function in scanpy, which requires the counts and not the normalized data). All the other differences you see are there to ensure that the anndata object has data in the correct slots.
Thx for the explanation. I have CITE-seq data from donor 1 to 3 and each donor has control or treated. How can I set up batch so I can see the clustering result in donor 1, 2 and 3 or control versus treated?
I would set this up as six batches of [donor] x [treatment condition]. This means the latent representation will describe cell-to-cell variation not explained due to donor-to-donor variation or differences due to treatment. This way you can cluster your data and define cell types. Once you have annotated the cells into e.g. 10 cell types you can do differential expression per cell type between treated and control.
This is a very effective way to learn how various types of cells in a context react differently to a perturbation and in what way they may be communicating with each other by investigating the DE genes.
I am also interested to do DE analysis, but I am not sure which values and methods to use. So far, I have exported raw gene counts and denoised protein values to R for DE analysis using edgeR and limma, respectively, on pseudobulked samples.
Does this approach make sense?
What I do depends on how complex the design of the experiment is. In straightforward settings I use scVI. If there is nested or hierarchical design I pseudobulk each sample and use GLMMs.
I don’t think you should do statistics on denoised values (without incorporating uncertainty about the denoising). The confidence/credible intervals will no longer be related to your observed data.
For this step: sc.pp.highly_variable_genes, I used to do like this:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
What are the best practice to select highly_variable_genes?
I appreciate your insights!