Integration with scVI

Hi, I have a dataset composed of three patients and from each of them I have colon, liver and blood. I am wondering what is the best methods to apply in order to minimize the biological loss but at the same time being able to correct for batch effect.

In the “mode.SCVI.setup_anndata” command:
is it better to set the batch_key equal to “tissue” (tissue of origin) or “patients” or “batch_id” with batch_id equal to each sample (colon_patient1, colon_patient2, ecc)?


Hi Sara,

It depends on how you want to use the learned representation. With integration, the only thing that changes is what the latent variables represent (and potential options for normalization in the DE).

I would use the learned representation to define cell types that are consistent between tissues and patients and batches. In this case, I would set batch_key to each sample. For example, this way you can identify a population of cells you can call ‘Macrophages’, with a large transcriptional program that determines this type. Then you can ask the question ‘how are colon macrophages different from liver macrophages?’ using the .differential_expression() method. If you then find a gene that is different, you can see how much variability there are between donors or batches by using .get_normalized_expression().

But another option is to say that you fundamentally think that gut macrophages and liver macrophages are so different that you want this variation to be reflected in the learned representation. Perhaps you want to find cells that have extreme ‘gut macrophage’ or ‘liver macrophage’ phenotypes by using the representation. But you think these definitions of cells should be consistent between donors. In this case you would set batch_key to 'patients'.

In practice, when I do these sorts of analyses, as an early step I would typically actually do multiple variations: no integration, integrate patients, integrate patients_tissues, integrate tissues, integrate patient_tissue_batch. And investigate qualitatively how the representation of the data changes. This way I can get an idea of what sources of variation contribute to the data.

Hope this is useful!


Thank you very much Valentine.
Your explanation is very clear and useful.