Clustering subsets of cells


I have some datasets I would like to integrate, select a few cell types that interest me and recluster them. However, I think I might have a problem with the second time I select variable genes and train the model, because I’m not sure if getting the normalized data is adequate.

I ran this to normalize the expression, save these normalized genes, select variable genes, and cluster downstream.

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
adata.raw = adata  # keep full dimension safe

Then I selected the clusters that interested me to cluster them again, with

adata2=adata[adata.obs['leiden_0.6'].isin(['1', '5', '4'])]

Because I probably a need a new set of variable genes, I used the block below to get all genes back.

adata2 = adata2.raw.to_adata()

These genes are normalized

   (0, 18)	2.37902
   (0, 20)	2.37902
   (0, 47)	2.37902
   (0, 68)	2.37902
   (0, 84)	3.0247393

Finally, I ran this block but cluster these cells of interest again. I commented on the normalization step, as the genes are already normalized.

adata2.layers["counts"] = adata2.X.copy()
adata2.raw = adata2  # keep full dimension safe
#sc.pp.normalize_total(adata2, target_sum=1e4)

There are 2 reasons I think something went wrong.
1 - all cells are too overlapped
2 - this warning

UserWarning: Make sure the registered X field in anndata contains unnormalized count data.

I assume the normalization should be performed with all cells present, which is why I decided to save normalized genes instead of counts. On the other hand, when I try to run this code but saving the raw counts instead by running

adata.raw = adata # keep full dimension safe

before the normalization, the cells are still too overlapped (they are not overlapped in the first clustering step).
Is there anything I am missing?


A few things

  1. you should use the seurat_v3 flavor for HVG selection, especially when giving it the count data.
  2. If I understand correctly, you want to rerun scVI on a subset of your data. Have you tried just subclustering using Scanpy’s API? In many cases I would not expect the result to fundamentally change (subclustering on the full latent space compared with recomputing the model)
1 Like

Hi Adam

  1. I will make sure to use seurat_v3 flavor, thank you
  2. I assumed it would be adequate to use the same method for subclustering, and I have noticed many articles that looked for subpopulations of a cell type did something similar. Moreover, since those cells were integrated by scVI, won’t scanpy’s clustering keep the batch effect?
  3. If the results won’t change, can I find potential subpopulations by simply increasing leiden resolution?

If memory allows, the following should be possible (please correct me if I’m wrong):

  • after loading your adata with raw counts, make a copy of it
  • use your original adata to find clusters on level 0
  • add your level 0 annotations to the priorly saved copy of your original adata
  • subset according to your clusters, and you will have an object with raw counts that only contains a subcluster of your choice. You can then repeat the standard workflow.

Probably not a very elegant way though.