How to decide the number of Optimal clusters while clustering the subtypes of cells

Hello All,
I am working on approximately five hundred thousand cells belonging to brain tissue. I want to do clustering the neuronal and non-neuronal subtypes.

For example: In microglia cell type would like to see homeostatic, proliferating and activated microglia. For this, I have subsetted the clusters which belong to microglia and re-cluster them using various resolutions and n_neighbor cutoff.

But I would like to know if there is an elegant way to find out the optimal cluster numbers from the subset of cells.

Here is the piece of code

Blockquote

     # Read Anndata
    
       adata=anndata.read_h5ad("/home/Akila/integration/harmony/subset/celltype/
       microglia.h5ad")
      
     # known Marker gene-microglia
     
       Marker= {'Microglia-Homeostatic': ['CX3CR1','CSF1R','APBB1IP'],'Microglia- 
       Activated-1': ['CD163','CD83'],Microglia-Inflammatory': ['HLA-A', 'HLA- 
      B','C3'],'Microglia-Proliferative': ['FAM111B']}

    
    # Varying number of neighbors
    neighbor=[3:30]
    for k in range(len(neighbor)):
           sc.pp.neighbors(adata,use_rep="X_pca_harmony",n_pcs=18,
                           n_neighbors=int(neighbor[k]))
# Different resolution
           sc.tl.leiden(adata,resolution = 0.05 key_added = "leiden_0.05")
           sc.tl.leiden(adata,resolution = 0.2, key_added = "leiden_0.2")
           sc.tl.leiden(adata,resolution = 0.4, key_added = "leiden_0.4")
           sc.tl.leiden(adata,resolution = 0.6, key_added = "leiden_0.6")

  # Save plots
  list2=["leiden_0.05","leiden_0.2","leiden_0.4,"leiden_0.6]
  for j in range(len(list2)):

        with rc_context({'figure.figsize': (7, 7)}):
            sc.pl.umap(adata, color=str(list2[j]), add_outline=True, legend_loc='on data',
                legend_fontsize=10, legend_fontoutline=2,frameon=False,
                title='clustering of cells', palette='Set1')                          


          plt.savefig("/home/Akila/integration/harmony/subset/celltype/neighbor/"
         +str(neighbor[k])+str(list2[j])+"cluster_plot.png")

   #save marker plots
           sc.pl.dotplot(adata,marke,str(list2[j]))
        
          plt.savefig("/home/Akila/integration/harmony/subset/celltype/
         neighbor/"+str(neighbor[k])+str(list2[j])+"dotplot.png").

But while doing this, I feel I m subsetting the data randomly based on resolution and neighbor. In this case, should I use k-means and validate the elbow plot to obtain the optimal cluster number.

I have found "clustreeā€¯ method, which predicts the optimal clusters using R., But I am looking for some suggestions in python compatible methods. Can you please suggest me? How to proceed further.

Thanks
Akila

Hi Akila,

Unfortunately, this is an extremely challenging problem. The more cells you have, the more clusters you can analyze. The question of when to stop clustering is partially philosophical and partially practical (how many microglia sub-types can you work with and describe?)

The strategies depend on the goals of the research. In some cases you know what kind of cell types you expect (for example, here you are expecting three microglia sub-types), and the goal is to estimate the proportions of these, or learn about their gene expression or responses to stimuli. In some cases the goal is to further subdivide known sub-classes to dissect major directions of variability.

Since you are expecting three classes of microglia, it seems to me that there are a couple of strategies to take: 1) Use known markers for these sub-classes to divide your cells into those, than analyse them. or, 2) Do relatively high resolution clustering, and merge the clusters which appear to all have the characteristics of these sub-classes.

Hope this helps!
/Valentine