Hi all,
I recently analyzed a scRNA-seq dataset with transcriptome and protein measurements and I would now like to enrich signatures from other datasets on the resulting clusters using the AUCell algorithm. For those not familiar with this tool, genes in each cell are ranked from highest to lowest value and then the signatures are tested for their enrichment in the top X genes for each cell.
However, I realized that since I used the top2000 genes to calculate the joint space for transcriptome and protein markers, my denoised matrix is smaller than normal and I end up with having signatures with less than 20% overlap in genes in my dataset to test the enrichment.
I was wondering whether someone has had any experience already and how they overcame this bottleneck.
Thanks in advance,
Theo
Is it standard to use denoised/smoothed expression for AUCell?
This is a good question. I have not seen an analysis on such a dataset before, I only used these values because of the recommendation in the vignette. Otherwise, I could use the raw data where I have all the captured genes.
The only other way around this is to run totaVI with all genes, which there is nothing wrong with, but the latent space representation might mildly suffer (as in the same effect you’d see with PCA when using all genes versus highly variable).
For this particular problem, if you don’t have enough cells to run scVI on all genes but you still want the latent expression levels for some particular genes, you can pre-define a whitelest of genes annotated in pathways you are interested in scoring cells for with the AUCell strategy.
Run the HVG selection method to identify some highly variable genes, then filter the AnnData object so you select both your top highly variable genes and the genes from your whitelist.
Thank you Valentine for your answer. I have enough cells (>9,000), but I was not sure which count table to use in downstream analysis; the denoised latent counts or simply the raw? In other cases, e.g. for DEG analysis, the vignette uses the 2,000 HVG and I initially proceeded like this for AUCell, too. The issue was that I could not run the test because the signatures were not expressed in this gene list.
I don’t think 9,000 cells is enough to run scVI genome wide. But if you take the 2,000 HVG’s you’re using and also add in the genes present in the pathways you want to apply the AUCell strategy to I don’t think it will be more than ~3,000 genes in total, which should be fine.
If I understand right, AUCell looks for rank based enrichment of a pathway in the top expressing genes. I would adapt this workflow to scVI like this:
- Sample latent expression level (the ‘denoised’ level that is) for all genes from the posterior.
- Calculate the rank statistics based on these, and store.
- Repeat steps 1 and 2 ~1,000 times (this will be very fast with scVI).
- Now you have a distribution of the rank statistics that are consistent with the uncertainty in the data, which will have accounted for differences in sequencing depth between cells. You can use this distribution to answer question such as “what is the probability that the rank statistic is above my threshold X?” (where hopefully you can tell what a biologically/analytically meaningful value of X is).
You could also create some gene sets at random that are the same size as the gene sets in the pathways you are analyzing. Then the distributions of the rank statistics for these genes can work as a negative control that you can compare the distributions with.