Batch Integration Parameter Tuning

Hello,

Great tool, thanks for all your efforts…couple questions below.

I am integrating ~500K cells from > 40 donors, and I am interested in parameter tuning for the model. I’m new to neural networks, but it seems like the output (clustering, markers, UMAP) would be primarily affected by the number of HVGs, number of layers, and final dimensions. I was wondering what your thoughts were on toggling these three inputs for model refinement. My understanding is increasing layers should tease out more hidden interactions, while increasing dimensions is essentially allocating more space to store variability/patterns?

Does your model automatically consider variance from sequencing depth or should users specify number of UMIs as a continuous covariate?

Thanks for your help.

Yes, though it’s not always so simple. There are pecularities when training Variational autoencoders related to inactive dimensions of the “bottleneck” layer. But your thought process is very reasonable.

What you could do is define some relevant metrics to you (like in scIB) and then do hyperparameter optimization using Ray Tune or other packages or a simple grid search.

The model automatically uses the observed library size of the gene expression data you supply (as it’s counts, just takes the sum). In the newest release you can provide your own size_factor_key to setup_anndata (on the linear scale!)