Training split conditioned on batch_key

Hello SCVI-Team,

I am currently working with public scRNAseq Atlas data (HCA, HLCA, etc.). It is crucial to account for not only patient-specific but also technical effects such as study, sample preparation, and assay type. Typically, I utilize the patient_id as the batch_key and address sources of technical noise with the categorical_covariate_keys parameter during model setup. This approach generally works well with some additional preprocessing steps.

However, I’ve noticed significant variations in the total number of cells per patient across different studies. This variability poses a risk of underrepresentation or absence of certain patients and/or studies (and their respective covariates) in the training and validation datasets.

Would it be advisable to split the data per patient ID into training and validation sets, or could this approach potentially compromise the model training?

I greatly appreciate your insights.

Best wishes, Florian

I have reviewed the splitting indices in model.train_indices and model.validation_indices and confirmed that the random shuffling and splitting effectively distribute data without considering the patient ID. Each patient (or study) approximately maintains a 0.9/0.1 ratio for the training/validation datasets, which aligns with our expectations (In hindsight, it was unreasonable to assume otherwise.).

However, would you advise to monitor any additional parameters to ensure proper learning when observing strong differences in total cell numbers between covariant?

Thank you once again for your support.
Best regards, Florian

Hi, we have scVI criticism. The think you could monitor is whether the CV per cell is different for different batches. This could point to this batch being modeled worse. It is hard to tell without much more information whether this is due to technical effects (library sizes, ambient genes etc) or model training though.

1 Like

@cane11 Thank you very much! I just slowly start to really understand some fundamental concepts of VAE and your input is highly appreciated. I will read up the Posterior Predictive checks (PPC) section of scvi-criticism, it looks great :slight_smile: