Training split conditioned on batch_key

Florian_Deckert · May 16, 2024, 3:51pm

Hello SCVI-Team,

I am currently working with public scRNAseq Atlas data (HCA, HLCA, etc.). It is crucial to account for not only patient-specific but also technical effects such as study, sample preparation, and assay type. Typically, I utilize the patient_id as the batch_key and address sources of technical noise with the categorical_covariate_keys parameter during model setup. This approach generally works well with some additional preprocessing steps.

However, I’ve noticed significant variations in the total number of cells per patient across different studies. This variability poses a risk of underrepresentation or absence of certain patients and/or studies (and their respective covariates) in the training and validation datasets.

Would it be advisable to split the data per patient ID into training and validation sets, or could this approach potentially compromise the model training?

I greatly appreciate your insights.

Best wishes, Florian

Florian_Deckert · May 16, 2024, 4:30pm

I have reviewed the splitting indices in model.train_indices and model.validation_indices and confirmed that the random shuffling and splitting effectively distribute data without considering the patient ID. Each patient (or study) approximately maintains a 0.9/0.1 ratio for the training/validation datasets, which aligns with our expectations (In hindsight, it was unreasonable to assume otherwise.).

However, would you advise to monitor any additional parameters to ensure proper learning when observing strong differences in total cell numbers between covariant?

Thank you once again for your support.
Best regards, Florian

cane11 · May 17, 2024, 12:08pm

Hi, we have scVI criticism. The think you could monitor is whether the CV per cell is different for different batches. This could point to this batch being modeled worse. It is hard to tell without much more information whether this is due to technical effects (library sizes, ambient genes etc) or model training though.

Florian_Deckert · May 22, 2024, 2:26pm

@cane11 Thank you very much! I just slowly start to really understand some fundamental concepts of VAE and your input is highly appreciated. I will read up the Posterior Predictive checks (PPC) section of scvi-criticism, it looks great

Topic		Replies	Views
Merging data from multiple cohorts and many donors with scVI scvi-tools	2	817	September 22, 2021
Recommendation for transform_batch / categorical_covariate_keys to obtain "batch corrected" counts scvi-tools integration	1	173	July 3, 2024
scVI integration using two batch keys scvi-tools	5	1188	October 24, 2023
Choosing a Batch Key scvi-tools integration , scvi	2	1088	April 18, 2023
scVI integration set batch_key and poor Umap result scvi-tools integration , diff-exp , scvi	3	198	August 7, 2024

Training split conditioned on batch_key

Related topics