Nested batch effects with scvi

Hi all,

I’m not exactly sure how to code the batch effect comparison here. I started with including all the levels, but I’m not sure if this is ideal. Essentially this is a 10x FFPE experiment with 15 samples. The 10X FFPE protocol splits this into 4 pool, and the Pools were sequenced in 2 batches. The set up is something like:

Batch 1 has pools 1 and 2
Batch 2 has pools 3 and 4. There is one sample in batch 2 that was also repeated in batch one.

I currently have it as:

scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["pool", "batch", "samples"],
    continuous_covariate_keys=["pct_counts_mt", "S_score", "G2M_score"],
    
)

Not sure if this is ideal however, as I realize some of the categorical variables are very well correlated. Thank you for your help!!

Correlated categorical covariates are not a problem in my experience and also not in the code. ScVI corrects for those covariates, it doesn’t try to learn the effect of each covariate separately. However, in my experience adding continuous covariates can make the latent space significantly worse as more information is encoded through those continuous covariates than through the actual latent space.

Thank you for your advice. Do you usually not include any continuous covariates? Is there any way to quantify how much information I’m removing from the latent space between models? Like for example, is there any way to estimate how much I lost if I included cell cycle scores as a continuous covariate, or if I included %mt as another example.

I generally don’t include them. Run scVI and check the output. If there is a strong gradient, I’m adding them and rerunning training. Best advice to check that structure is well preserved is prior knowledge (like celltypes or development trajectory). Quantifying it is possible using scib-metrics, while they are very focussed on cell-types.