Choosing a Batch Key

I have a very large dataset with data from many animals across 2 species and multiple developmental time points. Each animal replicate has many batches, with 2 reactions per region of brain region of interest. I’m having trouble selecting a batch key to select HVGs and to pass on to generate the scvi model.

From how I understand it, the batch key is supposed to indicate a variable in the dataset that we expect to see some variation that is not due to actual biological differences, i.e. between batches of the same animal and same region. However, the variable for batch in our dataset inherently includes information about age, species, and region, so my concern is that our data will be over-integrated on those axes. Another option is to choose the animal ID as the batch key, but that still includes age and species separation.

Does anyone have guidance for how to select a batch key in this scenario? Our goal is to maintain true biological differences along the age, species, and region axes but still be able to cluster broad subtypes of nuclei.

Is this something you could use the ‘gene-label’ dispersion for? I was thinking of maybe using the batch ID for the batch key and creating a custom label for the category of age/species/region, but I am not confident in understanding how the dispersion parameter works.

Thank you!

Hi, thank you for your question. Setting dispersion="gene-label" means that the model will learn separate gene-wise negative binomial dispersion parameters for each label in labels_key provided in setup_anndata. I’m not sure how much setting this will help with batch correction, maybe @adamgayoso can provide more insight.

Hi Danamcc,

I think you’re saying you have a structure in your data with (Species → Age → Animal → Region → Batch), where I’m using the → symbol as ‘has several derived samples of’.

You want to look at variation depending on species, age, and region, while account for other sources.

If you want to use the scVI latent cell representation for this, you cannot “include” variation present on the left of that hierarchy when accounting for something on the right of the hierarchy. This is because you have confounding. Adjusting for variation between batches inherently means adjusting for variation between ages because each batch can only be one possible age. (Even though any one age can have multiple batches).

The only way to account for ‘batch’ but leaving variation due to ‘age’ or ‘region’ intact would be if each batch had all the ages and regions in them.

This is of course impossible, just due to the structure of the world.

In other analysis settings, you can learn about differences between regions or species, even though there are variation in the animals or batches you are not interested in. This is done by looking at averages in the left groups in the hierarchy across the multiple measurements in the right groups.

In that ‘quantifying averages’ setting you have a lot of structure in your data that I think will be helpful. It seems you have regions that are the same no matter which animal they are from; thus the different regions provides replications for animal-animal variation, and at the same time the animals provides replications for region-region variation.

I think in your situation, it would be best to assume that cell types are the same between ages, species, and regions (this does not e.g. mean that the proportion or even presence of cell types will overlap between them). This way you can integrate all your batches, then annotate your cells, and then after you have done that you can do analysis that averages across replications for to learn about the entities you are interested in.

Hope this helps,
/Valentine