Hi! Thank you for the great program. I’ve been using totalVI very often on small datasets (~40 samples) with no problem. Recently we’ve acquired a CITE+RNA dataset of more than 100 samples with 10,000 cells each. The data was generated in pools of 4-5 samples each and batches of 4 pools (~20 samples). I believe totalVI has to be run on the full data at once, but the size of the dataset is making it difficult, reaching our computing limit. If running totalVI by batch is not an option, what would you recommend to make it less challenging?
Any help would be much appreciated! Thank you again for the amazing program.
Hi, thanks for your question. How long does it currently take to train totalVI on the full dataset? Or are you running into memory issues with loading the whole data? If your main concern is the total runtime of the algorithm, you could try the following to potentially speed up training time:
-
Since
early_stopping
is enabled by default inTOTALVI
, you can modify its parameters such asearly_stopping_patience
, which controls the number of epochs before early stopping kicks in. Lowering this value will most likely decrease the number of training epochs. See more here. This should be adjusted after inspecting the validation loss. -
Increase the
batch_size
. The default is set to256
, which is typically not enough to maximize GPU memory. Increasing this value and monitoring GPU utilization can potentially improve runtime. -
Increase the learning rate to speed up convergence. This should also be done carefully by inspecting loss curves and making sure the model does not diverge.