Hi, I have a basic question about the procedure to run scvi models.
When a model is created and trained, we run 3 instructions, 1) setup_anndata, 2) instantiate the model, 3) model.train()
My understanding is that when we run setup anndata, we registered some fields in the anndata and a “AnndataManager is stored in the AnndataManager Store” .
So, my question are:
A) let’s say that I run step 1, setup_anndata, where is the anndatamanager store?
B) can I access the anndatamanager without going to step 2 and creating a model?
C) after step 1, we have the fields encoded and written to the anndata.obs, but where is the state registry?
Thanks
Hi,
AnndataManager is hidden at this point but you can extract it by its pointer that is the scvi_uuid or scvi_manager_uuid you added to adata while doing the setup anndata, like so: SCVI._get_most_recent_anndata_manager(adata)
.
Then, the state registries exists in any field that will be including in the AnndataManager registry.field_registries, e.g: AnndataManager.registry['field_registries']['batch']['state_registry']
and so on..
AnndataManager Store it is a dictionary initialized per model type (e.g SCVI) which hold those uuid’s. see this in the base_model class (which every model inherits from), _setup_adata_manager_store and _per_instance_manager_store. it is not exposed during the setup anndata directly, only those mapping (but you can fetch it, see previous answer). its a kind of lazy initialization for the model.
So its there in memory and yes AnnDataManager link to that adata (a reference copy, not value), as well as each instance of the model holds a copy of it when initialized. so thats a duplication and the reason is that each model can ran with its own scenario and can add more information on top of that duplicated adata (like latent layer), later to be saved and used elsewhere perhaps.
as for AnnTorchDataset, yes the adata is also copied by value there, and changing its structure before being apply into a torch model during data loading. however this is train time duplication and is not wasting memory.
having said all of that, I did not design this whole data registering mechanism. Perhaps deeper questions might be referred to Adam ,Can , Martin and Ilan.
Hope I helped.
Hi Ori,
thank you so much.
A) This is clear now. During my step 1, setup_anndata, scvi writes class variables even when the class is not initialized. This is some behavior from python that I was not aware.
B) This is also clear. During my step 2, anndata is duplicated, so in principle, I can remove from memory the first one without affecting the model.
Two more questions,
For AnnTorchDataset, why do you say that data duplication is “train time duplication” and does not consume memory? Is this because this is happening on a batch of data?
Can you give me a really brief overview of how you are thinking of handling the AnndataManager and the Data in the class duplication with a CustomDataloader?
Thanks !
oh it consume memory, but only a batch of data, and then released (unlike the previous duplication).
in custom dataloaders, in which I believe you meant we don’t use adata, we dont run the setup_anndata, and the model is initialized without adata.
instead we use a pre-defined registry, which suits the custom dataloader and the model of interest and it is part of the custom dataloader class initialization. so no lazy init here.
than we use the registry to init the model and the dataloader to create batches and run the training. its a parallel, bypass, mechanism to the AnndataManager.
actually this is something that we are now adding officially to next release of scvi-tools (we have a custom dataloader for LaminAI and one for Census data based on TileDB).
Hi Ori
Do you have the example of the custom data loader ? Is it in the branch?
Best
yes its in the custom dataloaders registry branch
Hi Ori,
I gave it a try and made some comments on the branch, there was an error on the tutorial. I want to help with this so, is there any chat that you are having Can and you about this? Also, I compare head to head the TileDB and the regular anndata loaders and Tile is 50% slower.
One more thing… how do I do get_latent() with the TileDB data module?
Last thing. I tested the TileDB dataloader ir regular and DDP mode and what is causing the delay is the slow access to data. GPU peaks and process super fast but in between batches there is a long waiting time.