Protocol for model optimization (currently focused on MultiVI)

Hi there - first, I am simply blown away by the suite of tools you have assembled. It is truly awe-inspiring and sets the bar for any other python / comp-bio developers.

My question (and please assist me if I have placed this on the forum in an undesired location): Do you have a general protocol for model optimization? In this instance, I am particularly interested in using MultiVI to integrate cells from the same sample measured by scRNA-seq and scATAC-seq. After following the tutorial, I have been playing around with the model parameters such as the # of hidden units / latent variables.

Despite some initial attempts at tuning parameters, etc., I have only obtained results that show a sub-optimal co-embedding:
image

'MultiVI Model with INPUTS: n_genes:7213, n_regions:28053\nn_hidden: 187, n_latent: 13, n_layers_encoder: 2, n_layers_decoder: 2 , dropout_rate: 0.1, latent_distribution: normal, deep injection: False, gene_likelihood: zinb'

Unfortunately, since I am a new user, I can only upload one image at the moment but I did try other model implementations and samples below.

'MultiVI Model with INPUTS: n_genes:8863, n_regions:33682\nn_hidden: 206, n_latent: 14, n_layers_encoder: 2, n_layers_decoder: 2 , dropout_rate: 0.1, latent_distribution: normal, deep injection: False, gene_likelihood: zinb'
'MultiVI Model with INPUTS: n_genes:8863, n_regions:33682\nn_hidden: 5, n_latent: 2, n_layers_encoder: 2, n_layers_decoder: 2 , dropout_rate: 0.1, latent_distribution: normal, deep injection: False, gene_likelihood: zinb'

I feel there must be a more optimal latent space to be identified, however I am not sure of the typical procedure you might recommend for HP tuning and optimization within your framework. Happy to share code / more info but I basically started at the MultiVI tutorial linked above. Any help is greatly appreciated.

Thank you :slight_smile:

MultiVI requires to have some data where ATAC + RNA are measured simultaneously. It doesn’t look like you have that here?

Ah, I see - I misread the intention of the example notebook. You are correct: I do not have data where each modality is measured simultaneously in the same cell (just the same sample). That brings me to two questions:

  1. If I paired (same-cell) RNA/ATAC data from a model cell line of the same disease, do you think this could serve as an anchor point along which we might integrate other samples measured by the same assays for which we have separate but matched modality measurements? Perhaps this question involves too many unknowns from your end to say yes or no.

  2. Do you have any other recommendations given matched (but not paired) scRNA-seq and scATAC-seq samples using your framework? I could use PeakVI and scvi to analyze the modalities independently, but I would be curious to know if there is a preferred solution here…

Thanks again!