Thoughts on a more ~realistic tutorial?

I’m just getting started with scvi-tools — thanks for this amazing package!

When browsing the tutorials, they don’t seem applicable to analyzing new scRNA data – is there a workflow showing best practices when processing new data?

For example, I have a new scRNA dataset I’m trying to analyze – just Cell Ranger .mtx output. The tutorials I see all depend on previously known cell type annotations, metadata (e.g. percent_mito), marker matrices, etc.

Do you have a good workflow for processing totally new data (e.g. annotating unknown clusters, etc)? I’m envisioning something this (slightly aging) notebook but with scvi.

1 Like

I think it’s a good idea generally, I guess I always assumed users would come to scvi-tools after preprocessing with standard procedures in scanpy/seurat and I didn’t want to create an overlap in content.

Outside of the preprocessing, it is probably worth it to have a tutorial showing cluster annotation.

1 Like

Thanks – I’m doing scRNA work as very much a side project – I was inspired by your landing page’s " End-to-end analysis" bullet :slight_smile:

As someone very new to scRNA work, it’s unclear to me where scvi fits into a standard analysis — it looks like I’d build an scvi.model in lieu of scran / SCTransform / sc.pp.combat & sc.pp.regress_out — is that correct?

If you have 10x output folder from a paper (I prefer data that can be cited with a paper DOI so someone’s work get highlighted) I can make a quick notebook showing how I typically pull in data and fit a model for exploratory analysis with scVI :slight_smile: .

I use the scanpy read_10x_mtx() function, but after that it’s all an scVI model. Depending on available metadata you’d iterate on the model design (how do things look with and without integrating out batch? are batches so different that you need more layers? are there a lot of low-quality cells or just a few?, etc).

Indeed, data scaling and integration of batches are part of the scVI model you build. As well as differential expression. The things you need to do “yourself” are e.g. clustering if you want to do that, or some other potential interpretation of cell similarities / topology (like ‘pseudotime’ if you have to do something like that for some reason). To quickly get an idea of what the biological sources of variation between your cells are I think using the .leiden() function from scanpy on the scVI latent representation gets you pretty far in terms of seeing if you have cell types in your data. Then you can start zooming in, add some supervision, or find a reference data set to annotate with.

/Valentine

1 Like

Is this something you’d like to add to our docs site? We can help flesh out the text if you pass it off to us.

Sure! If you just provide the dataset, finding a good dataset is what would take me the longest. I do the exploratory analyses so often I can probably do it in my sleep haha.

@Valentine_Svensson Thoughts?

https://www.nature.com/articles/s41467-019-14118-w

I think the GEO deposition has the raw count matrix for each donor. How would you go about labeling from scratch? Given the popularity of blood datasets, maybe you could try some label transfer?

1 Like

Sounds good! I’ll probably get to it either over the weekend or an evening next week some time.

For an exploratory analysis I wouldn’t get into the intricacies of cell type labeling. But could be an interesting follow-up to the tutorial: Tutorial 1) Load data and explore, save fitted model. Tutorial 2) Load the fitted model, convert to scANVI, transfer labels from somewhere. (I don’t have an annotated dataset at hand to transfer from though.)

Ideally you wouldn’t use the study labels, but sort of reproduce the study results to some extent. I see now in the first results section they have the markers for each cell type…

@Valentine_Svensson A good example of ~automated cell cluster labeling would be amazing – it’s my current biggest pain point. I’m trying to have an all-in-one Colab for some analyses – it’s hard to beat the ease of SingleR, where you get ~reasonable annotations on any log-normalized counts matrix with:

reference = celldex::HumanPrimaryCellAtlasData()
SingleR(test = your_single_cell_experiment_object, ref = reference, assay.type.test = 1, labels = reference$label.fine)

Both CellAssign and SCVI/scANVI (transfer from Tabula Sapeins) have a lot more complexity, and haven’t worked well in my hands. Both seem to require that set(training genes) == set(testing genes), which is tricky if I have any genes drop out. CellAssign doesn’t provide a broadly applicable reference matrix; when I build my own (e.g. using cellassign::marker_list_to_mat() with LM22 or similar), I get pretty poor annotations. SCVI has the headache of requiring (from what I can tell) identical obs in training and testing data (e.g. cell_ontology_class, batch_key, etc.).

It’s not a strict requirement, you can always add all zero genes to make them equal.

There are two approaches here:

  1. De novo integration of your train and test data → classifier on train latent space, predictions on test latent space (tutorial)
  2. scArches (tutorial for totalVI case but scVI code is identical basically)

@semenko what tissue are you working with?

I put a draft of a workflow here: Exploratory analysis with scVI · GitHub

I did it from the perspective of not having any idea of tissues or cells. Just how I’d investigate a random dataset starting from mtx files.

We should probably clean up the text and add some more explanations.
The exploratory analysis kind of stops at the step of having 8,000 marker genes to investigate. This particular dataset has a couple of interesting contrasts one can look at: PBMC vs CSF, MS vs Healthy. It doesn’t go into that though, just exploring how gene expression varies between the different kinds of cells in the samples.

@adamgayoso I’m currently working with very poorly differentiated / unknown primary human tumor biopsies – none of the tissues in Tabula Sapiens are particularly closely related. (i.e. where would you put Testes / Adrenal / Parathyroid?)

I’m only trying to get relatively high-level annotations for use with EcoTyper – which requires one of ~12 annotations: (B cell, NK cell, Fibroblast, CD4 T cell, etc.).

So far, this has been easy with SingleR (perhaps ~20% of annotations are overly specific – but you can easily pick out the right tissue from its plotScoreHeatmap).

I’ll try to give this another shot with label transfer & Tabula Sapiens and share a Colab next week – I wonder if the lack of a clear Tabula Sapiens reference is a blocker? (Similarly, from my first shots with CellAssign, the annotations didn’t make much sense – perhaps I chose a bad reference matrix, though I had pretty reasonable selections for canonical markers.)

Yes this is problematic for the “VI” algorithms we have in the codebase. CellAssign should work ok though. Did you note the tutorial instruction? The lib size has to be relative to the mean lib size.

lib_size = adata.X.sum(1)
adata.obs["size_factor"] = lib_size / np.mean(lib_size)