Best practices for processing/analyzing large scale scrna-seq datasets across multiple days

ricomnl · September 11, 2022, 7:41pm

I’m using 10x for library prep and solely python (kallisto-bustools, anndata, scanpy, scvi-tools) for processing/analysis. I’m trying to figure out best practices in processing samples across multiple days. As an example, let’s say we prepared samples independently across 4 days (time-series), so that after sequencing and processing with kallisto-bustools (I guess it’d be similar for cellranger) we have 4 filtered and unfiltered h5ad count matrices (each filtered matrix has 10_000’s of cells). I’m interested in everything downstream from here. Some more specific questions:

What does your pipeline downstream of kallisto-bustools or cellranger look like if you have samples prepared individually across multiple days? (e.g. do you merge them and do QC filtering, or do you filter each individually?)
Do you use tools for ambient RNA removal and doublet detection? If yes, which ones have you considered
Do you use scVI or other tools that are based on autoencoders? How do you calibrate the hyperparameters in each individual run? (such as: latent dimension, no. of layers)
If you use scVI do you make use of any of the categorical/continuous covariate options? e.g. in this tutorial (Introduction to scvi-tools — scvi-tools) they correct for percent_mito and percent_ribo as well as cell source and donor
What are the biggest pitfalls you’ve encountered in processing your count matrices?

ivirshup · September 14, 2022, 1:47pm

Hey @ricomnl,

I can try and answer a few of your questions.

First, I don’t think you’ll need to worry with that dataset size. Under a hundred thousand cells should be quite manageable with most of our tools. Very manageable if you are on a workstation or hpc.

Cellranger typically downsamples reads across independent sample to even out the sequencing depths. I would also consider looking into the nf-core scrnaseq pipeline for other best practices. Ambient rna removal I think will vary by protocol. For doublet detection, scanpy has an interface to scrublet but you can also check out tools like DoubletDetection.

@giovp or @Marius1311, do you think you’d be able to make a suggestion on current best tooling for time series data?

Marius1311 · September 15, 2022, 9:34am

Hi @ricomnl, for time series data, a good starting point is Waddington OT, check out their website and documentation for plenty of explanation and tutorials: wot | A software package for analyzing snapshots of developmental processes

If you’re familiar with CellRank, or if you would like to do follow-up analysis with the WOT output, then you can check out CellRank’s RealTimeKernel in a tutorial here: Time series datasets — CellRank master documentation

These two tutorials should help with some of your preprocessing questions. In addition, note that @giovp and myself (with others) are currently developing an alternative to WOT which will easily scale to >100k cells and give you many more functions, see e.g. my MIA talk for some background: MIA: Marius Lange, Moscot: A toolbox for OT problems in single cell genomics; Primer by Marco Cuturi - YouTube

The tool is called moscot and will be released publicly on github soon, you can follow us on twitter to get notified when it’s out.

Topic		Replies	Views
Denoising/remove ambient RNA with scAR then scvi modeling? scvi-tools	3	551	January 6, 2024
Thoughts on a more ~realistic tutorial? scvi-tools tutorials	14	1350	February 26, 2022
Ambient RNA Filtering with scAR prior to scVI scvi-tools scvi	1	708	January 23, 2023
Comparing steps of Scanpy for scRNQ-seq and totalvi for CITE-seq scvi-tools totalvi	6	724	October 8, 2021
Dataset integration and analysis scvi-tools integration , cellassign , scvi , multivi , totalvi	3	873	May 3, 2023

Best practices for processing/analyzing large scale scrna-seq datasets across multiple days

Related topics