Struggling on under standing the parameter when clustering and cell type annotation. Also confused on normalization method

Hi,
I’ve been working on microwell-seq data of bone marrow cells of AML patients and normal people. I’ve been stucked at the beging to cluster and annotate the normal bone marrow cells and normal bm(about 9k cells) and peripheral blood cells combined data(about 80k cells after filter). It occurs to me that the microwell-seq data is of low quality. Their paper says they filtered out cells with less than 500 trancripts, but I find that only a few hundred cells qualifies instead of several thousand cells that they claimed…even if I did the up stream process like CB UMI extraction and alignment. So, as I did spend this much of time, I do want to finish the analysis…

For preprocessing:

  • Cells with less than 200 genes were filtered
  • Genes expressed in less than 3 or 10 cells were filtered
  • Remove doublets with SOLO model (ran per sample saparately)
  • Reload the data and remove the doublets
  • Remove cells expressing > 10% of mitochondrial genes
    For feature selection:
    Tried both scanpy highly_variable_genes and HDGs called by scry package(recommanded here in Single-cell best practices.

Then I just followed tutorials. The anndata is registered with parameter batch_key='Sample', because I noticed very prominent batch effect of the data among samples. The model is set up with default parameters or that I got from scvi_tuner.fit

The dimensional reduction result is confusing to me. If the leiden res is set lower, there’s intercepting clusters. If set higher, say 1 or 1.2, some clusters are very similar if I check the marker genes on pangaloDB.

For cluster annotation. I tried mannually (basically check markers of cell types provided by Single-cell best practices ) and automatic methods, say celltypist (majority voting) and label transfer with SCANVI. I got confused of the marker genes of some clusters, they were ranked in the middle as blood cell lineage not top if check in the pangaloDB. celltypist gave very few cell types even if I tried very high resolution as 150. Label transfer result identified cell types with many overlaps.



For normalization. I set up the anndata with size_factor_key, swithcing between size factor calculate maually or form scran while keep hyperparameters from autotune, like

vae1 = scvi.model.SCVI(adata_hv,
                      n_layers=5,
                      n_latent=60,
                      gene_likelihood="nb")
vae1.train(max_epochs=400,
          #early_stopping_patience=5,
          early_stopping=True)

If I do adata_hv.layers['scvi_normalized'] = vae1.get_normalized_expression(library_size = 1e4), the summary is much bigger than 1e4 array([ 830925.06, 791539.5 , 784703.25, ..., 1036273.4 , 952980.6 , 1025970.56], dtype=float32)

I’m not sure how can I get better results…Please give me some advice. Thanks