Best practices for processing/analyzing large scale scrna-seq datasets across multiple days

I’m using 10x for library prep and solely python (kallisto-bustools, anndata, scanpy, scvi-tools) for processing/analysis. I’m trying to figure out best practices in processing samples across multiple days. As an example, let’s say we prepared samples independently across 4 days (time-series), so that after sequencing and processing with kallisto-bustools (I guess it’d be similar for cellranger) we have 4 filtered and unfiltered h5ad count matrices (each filtered matrix has 10_000’s of cells). I’m interested in everything downstream from here. Some more specific questions:

  • What does your pipeline downstream of kallisto-bustools or cellranger look like if you have samples prepared individually across multiple days? (e.g. do you merge them and do QC filtering, or do you filter each individually?)
  • Do you use tools for ambient RNA removal and doublet detection? If yes, which ones have you considered
  • Do you use scVI or other tools that are based on autoencoders? How do you calibrate the hyperparameters in each individual run? (such as: latent dimension, no. of layers)
  • If you use scVI do you make use of any of the categorical/continuous covariate options? e.g. in this tutorial (Introduction to scvi-tools — scvi-tools) they correct for percent_mito and percent_ribo as well as cell source and donor
  • What are the biggest pitfalls you’ve encountered in processing your count matrices?

Hey @ricomnl,

I can try and answer a few of your questions.

First, I don’t think you’ll need to worry with that dataset size. Under a hundred thousand cells should be quite manageable with most of our tools. Very manageable if you are on a workstation or hpc.

Cellranger typically downsamples reads across independent sample to even out the sequencing depths. I would also consider looking into the nf-core scrnaseq pipeline for other best practices. Ambient rna removal I think will vary by protocol. For doublet detection, scanpy has an interface to scrublet but you can also check out tools like DoubletDetection.

@giovp or @Marius1311, do you think you’d be able to make a suggestion on current best tooling for time series data?

Hi @ricomnl, for time series data, a good starting point is Waddington OT, check out their website and documentation for plenty of explanation and tutorials: wot | A software package for analyzing snapshots of developmental processes

If you’re familiar with CellRank, or if you would like to do follow-up analysis with the WOT output, then you can check out CellRank’s RealTimeKernel in a tutorial here: Time series datasets — CellRank master documentation

These two tutorials should help with some of your preprocessing questions. In addition, note that @giovp and myself (with others) are currently developing an alternative to WOT which will easily scale to >100k cells and give you many more functions, see e.g. my MIA talk for some background: MIA: Marius Lange, Moscot: A toolbox for OT problems in single cell genomics; Primer by Marco Cuturi - YouTube

The tool is called moscot and will be released publicly on github soon, you can follow us on twitter to get notified when it’s out.