I have a very large dataset with data from many animals across 2 species and multiple developmental time points. Each animal replicate has many batches, with 2 reactions per region of brain region of interest. I’m having trouble selecting a batch key to select HVGs and to pass on to generate the scvi model.
From how I understand it, the batch key is supposed to indicate a variable in the dataset that we expect to see some variation that is not due to actual biological differences, i.e. between batches of the same animal and same region. However, the variable for batch in our dataset inherently includes information about age, species, and region, so my concern is that our data will be over-integrated on those axes. Another option is to choose the animal ID as the batch key, but that still includes age and species separation.
Does anyone have guidance for how to select a batch key in this scenario? Our goal is to maintain true biological differences along the age, species, and region axes but still be able to cluster broad subtypes of nuclei.
Is this something you could use the ‘gene-label’ dispersion for? I was thinking of maybe using the batch ID for the batch key and creating a custom label for the category of age/species/region, but I am not confident in understanding how the dispersion parameter works.