Can I use the package scib-metrics on methods that don't output an embedding?

Hello,

I am currently testing several tools to integrate my data, and I would like to compare them using the metrics computed by scIB.

I want to compare outputs of Seurat RPCA, Scanorama and scVI.
As far as I understood, while Scanorama and scVI do output a low dimensionnal embedding of the data, Seurat RPCA doesn’t.

I want to use the scib-metrics package to benchmark the different integration, but the documentation seems to suggest that this package currently works only on embedding-based methods :

In the tutorial we find :

Here we run a few embedding-based methods. By focusing on embedding-based methods, we can substantially reduce the runtime of the benchmarking metrics.

In principle, graph-based integration methods can also be benchmarked on some of the metrics that have graph inputs. Future work can explore using graph convolutional networks to embed the graph and then using the embedding-based metrics.

In the scib_metrics.benchmark.Benchmarker function documentation :
**embedding_obsm_keys** – List of obsm keys that contain the embeddings to be benchmarked.

Which means I wouldn’t be able to compare the output of RPCA with the others.
so my main question is : can I use the package scib-metrics on methods that don’t output an embedding ?

The scib method was originally used to compare more integration tools than the embedding based ones. I tried installing the original scib package, but it conflicts with my version of pandas. I’ll try to solve it (I guess I’ll have to setup proper conda environnements), but scib-metrics seemed more straightforward for what I was trying to do.

Any help would be appreciated !

From my understanding, it seems like Seurat RPCA is able to output a lower-dimensional representation of the data, right? If this is the case, you should be able to feed this into scib-metrics just like any embedding method. For reference, by default we compute PCA on the raw counts as a “benchmark” embedding in scib-metrics.

Hello,

Thank you for your answer!

So, I was mistakenly thinking that RPCA was only outputing corrected gene expression, but I was wrong. A new cell embedding is indeed computed by Seurat RPCA.

For anyone interested, in Seurat V5 (5.0.1):
After running:

#R
se <- IntegrateLayers(
  object = se, method = RPCAIntegration,
  orig.reduction = "pca", new.reduction = "integrated.rpca",
  verbose = FALSE
)

They are available in the DimReduc “integrated.rpca” (or any name you might have given it) and the cell embedding part of it can be extracted easily with:

#R
rpca_embed <- Embeddings(se, reduction = "integrated.rpca")

@mlebel Hello were you able to run the tool? how did you import the rpca_embed into python into the adata object? I tried to use the tool and the progress line is not moving at all.

I assume you also created a GitHub issue. Let me respond here. Your dataset is rather large (1.3 million cells). The implementation of sci-metrics is more performant than the original implementation, however, it is still expansive to compute the metrics (NN graphs with high number of neighbors).
We have GPU support within scib-metrics which requires that you install a GPU version of JAX. You can check that a GPU is used. It is otherwise safe to downsample to e.g. 100k cells and compute the metrics on this subset. Above 300k cells running it on CPU will take a very long time.

I see thank you. what is the expected time for the 1.3m cells to finish on GPU (comparing 5 methods)? I am not sure how downsampling the seurat object affects the integrated assay. Perhaps the integrated assay will remain same size. The number of cells in the object will just be less than the array (unequal size) and it will throw errors. or I might be mistaken.

It depended a bit on the size of each sample but expect it to take roughly 30 minutes per embedding.