Hi everyone,
When I change the order of anndata objects in anndata concat, like the following:
adata1 = ad.concat([day1_data, day2_data, day3_data], join='outer')
adata2 = ad.concat([day2_data, day1_data, day3_data], join='outer')
And doing basic processing with normalization, PCA, neighbouring graph, UMAP, and leiden clustering, I got the same UMAP (at least no visual differences I can spot) but fairly significant clustering differences. Like some clusters being merged, some new ones pop up, etc. Did anyone else notice a similar thing? Also, I wonder why?
Hi @osmanmerdan, are you using igraph
or leidenalg
? Could you share an environment listing i.e., import scanpy; scanpy.logging.print_versions()
for where this happens? From the igraph R documentation, I could see the algorithm having this property “The Leiden algorithm consists of three phases: (1) local moving of nodes, (2) refinement of the partition and (3) aggregation of the network based on the refined partition.”
Dear @ilan-gold , I have included an example Jupyter notebook using a publicly available dataset. In my case, I was using Scanpy 1.11.0 and anndata 0.11.1, and I saw this behavior using both the leidenalg
and igraph
implementations. In the example I have included, I observed that the PCA analysis yielded different floating-point values for the two datasets, which may be the reason for the differing end results. Google Colab
Hmm @osmanmerdan this strikes me as a bug with scipy’s PCA, no? Maybe a bug report there would be a good start? I think if you cut down your example to just that final cell with arpack
using scipy
directly, you have a great bug report
Hi @ilan-gold, I have opened a discussion here. And it turns out that the np.float32
precision in the normalized count matrix, plus how arpack
works, can cause these kinds of numerical inconsistencies. So, I have converted the adata.X
data type to np.float64
and ran the clustering again.
Here is the interesting part for me: now the PCA values are equal within the 1e-12
limits. However, neighbouring/ clustering is still different. .obsp['distances']
.obsp['connectivities']
entries are different.
I went further and changed the method to compute connectivities from method="umap"
to method="gauss"
. But it also resulted in different connectivity arrays.
Updated notebook: Google Colab
Hi @osmanmerdan I have a hunch that the leiden
algorithm is not stable up to ordering. From wikipedia I see that there is a “Local Moving of Nodes” phase which I appears to be unstable up to ordering. Does this make sense?
While I don’t know a whole lot about the inner workings of the algorithms underlying umap
or neighbors
, I assume given that neighbors
relies on ANN (specifically, the default is using PyNNDescentTransformer
), the ordering again matters because of the approximate nature. For example, I notice the there are a factor of 10 more difference in connectivities than distances, which really signals to me that this is down to some sort of tie-breaking that is getting propagated down from the distances to the connectivities (and then to leiden
and umap
). So distances are slightly different, which means connectivities are too, which means leiden is even worse etc.
Here’s a quick sample that I think highlights this fact:
import numpy as np
from pynndescent import PyNNDescentTransformer
from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
n_samples = 1000
n_features = 10
n_neighbors = 10
random_state = 42
X, _ = make_blobs(n_samples=n_samples, n_features=n_features, random_state=random_state)
transformer1 = PyNNDescentTransformer(n_neighbors=n_neighbors, random_state=42)
X_nn_1 = transformer1.fit_transform(X)
X_shuffled, indices = shuffle(X, np.arange(len(X)), random_state=123)
transformer2 = PyNNDescentTransformer(n_neighbors=n_neighbors, random_state=42)
X_nn_2 = transformer2.fit_transform(X_shuffled)
inverse_indices = np.zeros_like(indices)
inverse_indices[indices] = np.arange(len(indices))
X_nn_2_mapped = X_nn_2[inverse_indices, :][:, inverse_indices] # reorder both rows and columns
np.testing.assert_equal(X, X_shuffled[inverse_indices]) # should be equal
np.testing.assert_almost_equal(X_nn_1.toarray(), X_nn_2_mapped.toarray()) # off by 174 for me
Short of going into the scanpy code to check the output of nndescent output, I think this probably signals the issue really is the ordering, and it starts as soon as you compute neighbors.