What training datasets are used in the scVI pretrained model (version 2024-07-01)?

Hi there! :slight_smile:
I’m looking to create a testing dataset to benchmark a series of integration methods, and I want to avoid using datasets that were included in the training phase. However, I couldn’t find any detailed information about the training datasets abouts scVI pretrained model (version 2024-07-01) on the scVI tutorial page. Could anyone help me with this or point me in the right direction?

Hi, CZI and specifically the CELLxGENE team has full responsibility to train these models. However, all data in the LTS of the specified date are included for training. You can access the dataset ID column (however, it’s hard to track datasets outside of CELLxGENE census using this information as it’s a hash). I would recommend computing the diff between current releases and this LTS and use one of the new datasets as the test dataset as we have done in scvi-hub.

Thanks for your kind response! :smiling_face_with_three_hearts: I’ll follow the steps outlined in scvi-hub.