Comparing Clusters of Different Anndata through the use of Dendrogram

Hi there,

I am currently trying to track a cluster through various different timepoints of sc-rna seq Anndata. Essentially, say there are two anndata objects, A1 and A2, I want to ask the question for A1 cluster 1 which cluster is it most similar to from A2?

I was trying to do this through the use of a dendrogram by first finding the mean gene expression for all the shared genes within two different timepoints and then calculating the pairwise distance, creating a distance matrix, and finally performing a linkage, and creating a dendrogram through the use of the scipy.cluster.hierarchy package.

However, I do not think these are yielding good results and was wondering if there was any other built in ways to do this through scanpy. I know in scanpy you can build a dendrogram of different clusters within a single anndata, but how do you do this with multiple anndatas? That too, how do you do it such that you are only comparing the clusters between the two anndata’s and not amongst themselves? I hope this makes sense, I appreciate the help!

One thing you could try is to integrate them using any of the methods available in scanpy, for example harmony:
https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.pp.harmony_integrate.html

What I would do is to first merge the two objects using AnnData’s concatenate method (with join=outer, else you might lose genes)
https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.concatenate.html#anndata.AnnData.concatenate

Then you generate a label for each cluster and AnnData, for example:

  • cluster 1 in AnnData 1: “C1-A1”
  • cluster 1 in AnnData 2: “C1-A2”
  • cluster 2 in AnnData 1: “C2-A1”
  • cluster 2 in AnnData 2: “C2-A2”

And you use these labels to fit harmony.

After running harmony, you can recompute the clustering on the new integrated space and check how many cells belong to their original clusters. Example:

  • New cluster 1: 50% of cells coming from C1-A1 and 50% from C1-A2, meaning that this new cluster is most likely the old C1.

However, since you mention that your data has time-points maybe there is a better way to account for them. Anyways, this integration would be the first thing I would try.

1 Like

This is an amazing idea and thank so much for providing the means for doing it. I will give it a shot!

1 Like