Struggling on under standing the parameter when clustering and cell type annotation. Also confused on normalization method

lconan · June 6, 2023, 1:29pm

Hi,
I’ve been working on microwell-seq data of bone marrow cells of AML patients and normal people. I’ve been stucked at the beging to cluster and annotate the normal bone marrow cells and normal bm(about 9k cells) and peripheral blood cells combined data(about 80k cells after filter). It occurs to me that the microwell-seq data is of low quality. Their paper says they filtered out cells with less than 500 trancripts, but I find that only a few hundred cells qualifies instead of several thousand cells that they claimed…even if I did the up stream process like CB UMI extraction and alignment. So, as I did spend this much of time, I do want to finish the analysis…

For preprocessing:

Cells with less than 200 genes were filtered
Genes expressed in less than 3 or 10 cells were filtered
Remove doublets with SOLO model (ran per sample saparately)
Reload the data and remove the doublets
Remove cells expressing > 10% of mitochondrial genes
For feature selection:
Tried both scanpy highly_variable_genes and HDGs called by scry package(recommanded here in Single-cell best practices.

Then I just followed tutorials. The anndata is registered with parameter batch_key='Sample', because I noticed very prominent batch effect of the data among samples. The model is set up with default parameters or that I got from scvi_tuner.fit…

The dimensional reduction result is confusing to me. If the leiden res is set lower, there’s intercepting clusters. If set higher, say 1 or 1.2, some clusters are very similar if I check the marker genes on pangaloDB.

For cluster annotation. I tried mannually (basically check markers of cell types provided by Single-cell best practices ) and automatic methods, say celltypist (majority voting) and label transfer with SCANVI. I got confused of the marker genes of some clusters, they were ranked in the middle as blood cell lineage not top if check in the pangaloDB. celltypist gave very few cell types even if I tried very high resolution as 150. Label transfer result identified cell types with many overlaps.

For normalization. I set up the anndata with size_factor_key, swithcing between size factor calculate maually or form scran while keep hyperparameters from autotune, like

vae1 = scvi.model.SCVI(adata_hv,
                      n_layers=5,
                      n_latent=60,
                      gene_likelihood="nb")
vae1.train(max_epochs=400,
          #early_stopping_patience=5,
          early_stopping=True)

If I do adata_hv.layers['scvi_normalized'] = vae1.get_normalized_expression(library_size = 1e4), the summary is much bigger than 1e4 array([ 830925.06, 791539.5 , 784703.25, ..., 1036273.4 , 952980.6 , 1025970.56], dtype=float32)

I’m not sure how can I get better results…Please give me some advice. Thanks

Topic		Replies	Views
Thoughts on a more ~realistic tutorial? scvi-tools tutorials	14	1356	February 26, 2022
Clustering subsets of cells scvi-tools scvi , clustering	3	1242	November 15, 2021
Minimum Cluster Size? scRNA-seq scrna-seq	2	488	July 8, 2021
Comparing steps of Scanpy for scRNQ-seq and totalvi for CITE-seq scvi-tools totalvi	6	725	October 8, 2021
scVI batch correction clusters all cells from sample in a circle (potential artifact) scvi-tools integration , scvi	14	93	June 3, 2025

Struggling on under standing the parameter when clustering and cell type annotation. Also confused on normalization method

Related topics