scGen generate irreproducible output

Hi guys,

I try to perform batch correction for my datasets using scGen:
https://scgen.readthedocs.io/en/stable/tutorials/scgen_batch_removal.html

But everytime I run the code, it generates different UMAP plots, so it is not reproducible in the training step.
Do you know any line of code that can help to fix the randomness.

Do you have pytorch installed with GPU support? If yes, perhaps it is related to this issue I opened in May:

Random state not reproducible. · Issue #2480 · scverse/scanpy (github.com)

The problem is associated with recent versions of scanpy/anndata. Only using pytorch with CPU support solved the problem for me. Not sure yet why this happens.

1 Like

Hi @ddiez,

Thank you for your response, I tried to add:

install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

but it seems not solving the problem. It still generates varying umap plot.

I think this command should have some option to set fixed random:
model.train(
max_epochs=2,
batch_size=32,
early_stopping=True,
early_stopping_patience=25,
)
but I got stuck here. I try seed_everything(0, workers=True) from pytorch package but seems not working too.

Hi @nxquybms,

Actually, what I meant is the opposite. Instead of installing pytorch-cuda you need the cpu variety. For example, this is what I have installed in an environment without the problem I reported in that issue:

pytorch                   2.0.0           cpu_py310hd11e9c7_0    conda-forge
pytorch-lightning         2.0.4              pyhd8ed1ab_0    conda-forge

Since the problem you report is in the construction of the UMAP, it is possible that the problem is not with scGen but with UMAP. This is what I am reporting in the issue above. I think it might be the same issue here. That is, scGen might be returning you consistent results but umap (and leiden) will give non-reproducible results.

Another possible reason for your problem is that you are using scvi-tools >= 1.0.0 and then you need to specify the seed yourself directly with:

import scvi
scvi.settings.seed = 0 # Or whatever number you like.
1 Like

Hi @ddiez,

Thank you for your response and the code.
Can you explain more about setting the CPU variety (sorry this is not of my expertise).
I actually tried to add other random state including scvi scvi.settings.seed = 0 but it seems not working too. I don’t think it is the problem with the construction of UMAP. This is due to the fact that when I repeat the code with UMAP, it keep the same visualization. However, when I re-run the training model step, it change the UMAP construction.
I make a small gg-colab on this problem that could be more visuable for the issue. If you have spare time, please help me have a quick look. I highly appreciate your feedback.

(I am a bit extreme here that I put random state every code for checking)

Best,

Hi @nxquybms,

Thanks for the colab notebook, it was very useful. Thanks to it I could solve, somehow, the issue I was reporting about scanpy, and I also think I know why you are getting different results after running everything again.

First, about the issue in scanpy, and FYI, I tested my problem again in google colab and could not find it again. Then I tested again in my system and found that the issue only occurs when I install things with conda. If I install everything with pip (like in the colab notebook) then everything works as expected. So I think that issue is completely unrelated to yours. Sorry for the noise and thank you for giving me a clue to solve it (although I still do not know why that happens with conda…).

About your problem, I tried myself running the perturbation and batch correction workflows several times and was able to get the same UMAP/leiden clusters.

I think in your case it does not work because you are not really repeating everything from scratch. You are reusing the model object that you trained before. scvi will not start training from zero, it will continue, or restart training from the previous model. You can check that if you plot the training/validation loss per epoch the first and second times.

If you reinitialize completely your model object from the start, before training, I expect you will get replicable results:

model = scgen.SCGEN(train)
1 Like

Hi @ddiez,

Oh wow my plot is reproducible for now. It’s great that you also solved yours somehow.

Thank you so much!
Have a nice week

Bests