Get_normalized_expression causes my kernel to disconnect

Hey everyone! fairly new user of scvi-tools and sCANVI here.

Every time I run get_normalized_expression on my sCANVI query model, my kernel says it disconnects (which I believe is likely a memory issue?)

This is actually my second time running the model; the first time, get_normalized_expression ran with no issues. Between running the models, I updated my scvi from 1.3.1.post1 to 1.3.3. The main changes I implemented between the new and old models are:

  • the new model has linear_classifier = True
  • the new model has max_epochs = 100 instead of 20

(I implemented both of these changes in the sCANVI reference model, but I am running get_normalized_expression on the query model. I made these changes due to suggestions I saw, because previously my prediction accuracy on reference data was quite low.)

Everything else stayed the same, including the number of cells and genes. The anndata object I am working with has ~1.5 million cells and 5000 genes.

For troubleshooting, I tried changing my batch size from 128 to 32 for get_normalized_expression, which didn’t work. I also tried running it with a limited gene_list; it worked for both 10 and 20 genes. I haven’t tried anything greater than 20, besides when I ran it on all my genes (5000).

Please give me suggestions and troubleshooting tips!

Hello,

I understand from you that the model was trained successfully (on ref? query? both?) and the issue is with memory during get_normalized_expression on the query. Reducing number of genes helps and this strengthen the reason its a memory issue .

I wouldn’t reduce it that much though, a common practice is to select the top ~2000 Highly variable genes.

In order to save memory, first of all you can separate your scripts to 2 parts (even 3) for the training and query part. Once your model is trained (seems it does), Save it and clear the memory of that script/notebook (and any other redundant thing on the machine). Then load it back and continue on a fresh script. If you insist on doing it all in same script, you can delete redundant objects or do gc.collect() during the code. it helps.

Another thing you can do is to remove validation step, if exists (make sure check_val_every_n_epoch is None and early_stopping is False). You also mentioned the anndata size, but is it used for training or query? if its the query , maybe you can do it in smaller chunks each time?

As for running the get_normalized_expression itself, you can also do it with chunks (you can specify the indices to run on), but it might be the same as query chunk of data I mentioned before.

All of this actions can help you to save memory.

Running 1.5M cells with 5K genes, requires a lot of memory , so monitor its consumption during running your code with htop for example.

Finally go over this tutorials of reference mapping with scanvi, see it matches yours.