Best practice for handling custom points during SCVI training

alessimichele · January 13, 2025, 5:14pm

Hi everyone,

I’m working on a custom implementation with scvi-tools and have a design question regarding the best way to handle specific points selected during training.

In my setup, at the end of each epoch, I select a subset of points that will be used for loss calculation in the subsequent epoch (using the loss method of the VAE module). The selection of these points is implemented through a custom TrainingPlan.

I’m considering three possible approaches to manage these points throughout the training process:

Add them as attributes of the VAE module. For instance, during the initialization of the SCVI model, this would look like:
self.module.new_points = new_pts.
Store them in adata using something like:
adata.uns['_scvi_new_points'].

Add a field to the adata_manager via:

adata_manager.register_new_fields([StringUnsField(
    registry_key='new_points',
    uns_key='_scvi_new_points',
    required=False
)])

In this case, I believe I might also need to modify the __getitem__ method of the AnnTorchDataset class.

Which approach would you recommend? Are there any advantages, drawbacks, or potential pitfalls for each? If there’s a better alternative, I’d love to hear about it.

Thank you in advance for your guidance!

cane11 · January 13, 2025, 8:16pm

Hi, you would want to use a custom lightning callback for this (I understand correctly that after each epoch you select a new set of cells for the subsequent epoch). This lightning callback has to create a new datasplit and dataloader using custom indices (passed to datasplitter). I would recommend using Callback — PyTorch Lightning 2.5.0.post0 documentation on_epoch_start which sounds like the correct handle here.

alessimichele · January 14, 2025, 10:04am

Hi Can,

Thank you for your response and suggestion! Let me clarify my use case a bit further:

The selected points are not directly used for training, so they don’t need to be included in the training data loader. Instead, these points are specifically used to calculate some parameters at the beginning of each epoch, which are then utilized in the loss calculation throughout the current epoch. Therefore, I need access to these points at the start of every epoch to compute the necessary parameters.

At the end of each epoch, these points are updated. They are generated using the decoder and exist in the data space (with a shape of (K, n_cells)), but there’s no guarantee that they correspond to actual points in the training set.

Given this setup, do you think the custom Lightning callback is still the best approach? Or would you suggest a different mechanism to manage and update these points efficiently across epochs?

Thank you again for your insights!

cane11 · January 14, 2025, 6:10pm

Yes, everything you want to do at the beginning of each epoch (or step) should be handled by a Callback (and by lightning in general). It creates the smallest overhand and you don’t want to modify the module or the data during training.
Another option would be to generate them during each step and define it in the decoder call. If you define them per epoch, just pass them to the loss function (you can check our current kl_weight schedule on how to implement the Callback).

alessimichele · January 24, 2025, 10:37am

Hi Can,
after a few days of coding, everything is now working, thanks to your suggestions.

Thank you very much!
Michele

Topic		Replies	Views
Incorporating augmentations (per batch) for scVI training scvi-tools scvi	11	361	February 28, 2024
Loading an scVI model from a pytorch lightning checkpoint scvi-tools scvi	3	829	August 29, 2023
Constructing a high-level model: Problem with .train function scvi-tools	7	1042	October 27, 2021
Pass scvi models to interpretation algorithm for downstream analysis scvi-tools scvi	1	497	July 15, 2021
Error when using scVI.model.train scvi-tools integration	2	146	August 19, 2024

Best practice for handling custom points during SCVI training

Related topics