Thoughts on a more ~realistic tutorial?

semenko · February 9, 2022, 3:55pm

I’m just getting started with scvi-tools — thanks for this amazing package!

When browsing the tutorials, they don’t seem applicable to analyzing new scRNA data – is there a workflow showing best practices when processing new data?

For example, I have a new scRNA dataset I’m trying to analyze – just Cell Ranger .mtx output. The tutorials I see all depend on previously known cell type annotations, metadata (e.g. percent_mito), marker matrices, etc.

Do you have a good workflow for processing totally new data (e.g. annotating unknown clusters, etc)? I’m envisioning something this (slightly aging) notebook but with scvi.

adamgayoso · February 9, 2022, 4:42pm

I think it’s a good idea generally, I guess I always assumed users would come to scvi-tools after preprocessing with standard procedures in scanpy/seurat and I didn’t want to create an overlap in content.

Outside of the preprocessing, it is probably worth it to have a tutorial showing cluster annotation.

semenko · February 10, 2022, 8:03pm

Thanks – I’m doing scRNA work as very much a side project – I was inspired by your landing page’s " End-to-end analysis" bullet

As someone very new to scRNA work, it’s unclear to me where scvi fits into a standard analysis — it looks like I’d build an scvi.model in lieu of scran / SCTransform / sc.pp.combat & sc.pp.regress_out — is that correct?

Valentine_Svensson · February 16, 2022, 1:28am

If you have 10x output folder from a paper (I prefer data that can be cited with a paper DOI so someone’s work get highlighted) I can make a quick notebook showing how I typically pull in data and fit a model for exploratory analysis with scVI .

I use the scanpy read_10x_mtx() function, but after that it’s all an scVI model. Depending on available metadata you’d iterate on the model design (how do things look with and without integrating out batch? are batches so different that you need more layers? are there a lot of low-quality cells or just a few?, etc).

Indeed, data scaling and integration of batches are part of the scVI model you build. As well as differential expression. The things you need to do “yourself” are e.g. clustering if you want to do that, or some other potential interpretation of cell similarities / topology (like ‘pseudotime’ if you have to do something like that for some reason). To quickly get an idea of what the biological sources of variation between your cells are I think using the .leiden() function from scanpy on the scVI latent representation gets you pretty far in terms of seeing if you have cell types in your data. Then you can start zooming in, add some supervision, or find a reference data set to annotate with.

/Valentine

adamgayoso · February 16, 2022, 7:35am

Is this something you’d like to add to our docs site? We can help flesh out the text if you pass it off to us.

Valentine_Svensson · February 16, 2022, 3:00pm

Sure! If you just provide the dataset, finding a good dataset is what would take me the longest. I do the exploratory analyses so often I can probably do it in my sleep haha.

adamgayoso · February 16, 2022, 9:55pm

@Valentine_Svensson Thoughts?

https://www.nature.com/articles/s41467-019-14118-w

I think the GEO deposition has the raw count matrix for each donor. How would you go about labeling from scratch? Given the popularity of blood datasets, maybe you could try some label transfer?

Valentine_Svensson · February 18, 2022, 3:42am

Sounds good! I’ll probably get to it either over the weekend or an evening next week some time.

For an exploratory analysis I wouldn’t get into the intricacies of cell type labeling. But could be an interesting follow-up to the tutorial: Tutorial 1) Load data and explore, save fitted model. Tutorial 2) Load the fitted model, convert to scANVI, transfer labels from somewhere. (I don’t have an annotated dataset at hand to transfer from though.)

adamgayoso · February 18, 2022, 3:56am

Ideally you wouldn’t use the study labels, but sort of reproduce the study results to some extent. I see now in the first results section they have the markers for each cell type…

semenko · February 18, 2022, 2:55pm

@Valentine_Svensson A good example of ~automated cell cluster labeling would be amazing – it’s my current biggest pain point. I’m trying to have an all-in-one Colab for some analyses – it’s hard to beat the ease of SingleR, where you get ~reasonable annotations on any log-normalized counts matrix with:

reference = celldex::HumanPrimaryCellAtlasData()
SingleR(test = your_single_cell_experiment_object, ref = reference, assay.type.test = 1, labels = reference$label.fine)

Both CellAssign and SCVI/scANVI (transfer from Tabula Sapeins) have a lot more complexity, and haven’t worked well in my hands. Both seem to require that set(training genes) == set(testing genes), which is tricky if I have any genes drop out. CellAssign doesn’t provide a broadly applicable reference matrix; when I build my own (e.g. using cellassign::marker_list_to_mat() with LM22 or similar), I get pretty poor annotations. SCVI has the headache of requiring (from what I can tell) identical obs in training and testing data (e.g. cell_ontology_class, batch_key, etc.).

adamgayoso · February 19, 2022, 6:50pm

It’s not a strict requirement, you can always add all zero genes to make them equal.

There are two approaches here:

De novo integration of your train and test data → classifier on train latent space, predictions on test latent space (tutorial)
scArches (tutorial for totalVI case but scVI code is identical basically)

adamgayoso · February 20, 2022, 6:23pm

@semenko what tissue are you working with?

Valentine_Svensson · February 23, 2022, 6:01am

I put a draft of a workflow here: Exploratory analysis with scVI · GitHub

I did it from the perspective of not having any idea of tissues or cells. Just how I’d investigate a random dataset starting from mtx files.

We should probably clean up the text and add some more explanations.
The exploratory analysis kind of stops at the step of having 8,000 marker genes to investigate. This particular dataset has a couple of interesting contrasts one can look at: PBMC vs CSF, MS vs Healthy. It doesn’t go into that though, just exploring how gene expression varies between the different kinds of cells in the samples.

semenko · February 25, 2022, 10:02pm

@adamgayoso I’m currently working with very poorly differentiated / unknown primary human tumor biopsies – none of the tissues in Tabula Sapiens are particularly closely related. (i.e. where would you put Testes / Adrenal / Parathyroid?)

I’m only trying to get relatively high-level annotations for use with EcoTyper – which requires one of ~12 annotations: (B cell, NK cell, Fibroblast, CD4 T cell, etc.).

So far, this has been easy with SingleR (perhaps ~20% of annotations are overly specific – but you can easily pick out the right tissue from its plotScoreHeatmap).

I’ll try to give this another shot with label transfer & Tabula Sapiens and share a Colab next week – I wonder if the lack of a clear Tabula Sapiens reference is a blocker? (Similarly, from my first shots with CellAssign, the annotations didn’t make much sense – perhaps I chose a bad reference matrix, though I had pretty reasonable selections for canonical markers.)

adamgayoso · February 26, 2022, 4:29pm

Yes this is problematic for the “VI” algorithms we have in the codebase. CellAssign should work ok though. Did you note the tutorial instruction? The lib size has to be relative to the mean lib size.

lib_size = adata.X.sum(1)
adata.obs["size_factor"] = lib_size / np.mean(lib_size)

Topic		Replies	Views
Struggling on under standing the parameter when clustering and cell type annotation. Also confused on normalization method scvi-tools scvi	0	590	June 6, 2023
scANVI relables known cells with known types incorrectly scvi-tools scanvi	13	1918	April 18, 2023
Comparing steps of Scanpy for scRNQ-seq and totalvi for CITE-seq scvi-tools totalvi	6	739	October 8, 2021
What model to use when integrating batches of scRNA-seq matrices containing >150,000 T and innate lymphoid cell (ILC) sub-populations scvi-tools scvi	7	651	May 26, 2022
Trajectory analysis of a lineage across different tissues scvi-tools	1	486	April 22, 2021

Thoughts on a more ~realistic tutorial?

Related topics