I’m using 10x for library prep and solely python (kallisto-bustools, anndata, scanpy, scvi-tools) for processing/analysis. I’m trying to figure out best practices in processing samples across multiple days. As an example, let’s say we prepared samples independently across 4 days (time-series), so that after sequencing and processing with kallisto-bustools (I guess it’d be similar for cellranger) we have 4 filtered and unfiltered h5ad count matrices (each filtered matrix has 10_000’s of cells). I’m interested in everything downstream from here. Some more specific questions:
What does your pipeline downstream of kallisto-bustools or cellranger look like if you have samples prepared individually across multiple days? (e.g. do you merge them and do QC filtering, or do you filter each individually?)
Do you use tools for ambient RNA removal and doublet detection? If yes, which ones have you considered
Do you use scVI or other tools that are based on autoencoders? How do you calibrate the hyperparameters in each individual run? (such as: latent dimension, no. of layers)
If you use scVI do you make use of any of the categorical/continuous covariate options? e.g. in this tutorial (Introduction to scvi-tools — scvi-tools) they correct for percent_mito and percent_ribo as well as cell source and donor
What are the biggest pitfalls you’ve encountered in processing your count matrices?
First, I don’t think you’ll need to worry with that dataset size. Under a hundred thousand cells should be quite manageable with most of our tools. Very manageable if you are on a workstation or hpc.
Cellranger typically downsamples reads across independent sample to even out the sequencing depths. I would also consider looking into the nf-core scrnaseq pipeline for other best practices. Ambient rna removal I think will vary by protocol. For doublet detection, scanpy has an interface to scrublet but you can also check out tools like DoubletDetection.
@giovp or @Marius1311, do you think you’d be able to make a suggestion on current best tooling for time series data?
If you’re familiar with CellRank, or if you would like to do follow-up analysis with the WOT output, then you can check out CellRank’s RealTimeKernel in a tutorial here: Time series datasets — CellRank master documentation