Best practices for downstream analysis of Resolvi-corrected data

Dear scverse team,

Thank you for all your efforts in creating this incredible software library. I have a question about resolvi, which I am using to analyze and corrected batch effects for a set of Xenium datasets. Variants of this question have been asked before but from reading the responses I am still left a bit unsure.

I have trained my resolvi model, something like this.

import scvi
from scvi.external import RESOLVI

RESOLVI.setup_anndata(adata, layer='counts', batch_key="sample")

resolvi = RESOLVI(adata, n_latent=30, semisupervised=False)

resolvi.train(max_epochs=100, enable_progress_bar=True,accelerator="gpu")

adata.obsm["X_resolVI"] = resolvae.get_latent_representation()

# Create a name for your model folder
model_dir = "resolvi_model_2M_cells"

# Save the model
resolvi.save(model_dir, overwrite=True, save_anndata=False)

I have clustered using the X_resolVI latent space - all fine. Where I get more unsure is if I want to generate a dotplot of specific marker gene expression. Ideally I want this to be in an appropriately normalized fashion so that data is comparable across cells with variable depth of sequencing, and also deals with the significant batch effects between samples.

Based on the tutorials I have been deriving ‘generated_expression’ as below, which I believe based on some of the questions on here is preferred over ‘get_normalized_expression’.

import numpy as np
resolvae=resolvi

samples_corr = resolvae.sample_posterior(
    model=resolvae.module.model_corrected,batch_size=65536,
    return_sites=["px_rate"],
    summary_fun={"post_sample_q50": np.median},
    num_samples=3,
    summary_frequency=30,
)
import pandas as pd
samples_corr = pd.DataFrame(samples_corr).T
adata.layers["generated_expression"] = samples_corr.loc["post_sample_q50", "px_rate"]

Now, if I want to generate a dotplot, can I use the ‘generated_expression’ directly or do I need to perform additional normalization steps. For example, sc.pp.normalize_total(adata, layer=‘generated_expression’). This is implied in the manuscript e.g.

Throughout the manuscript, we use generated expression from resolVI. We perform count normalization. Recently, area normalization was suggested for spatial transcriptomics [33]. However, this benchmarking was performed without comparing different segmentation algorithms. We perform square root transformation of normalized counts as we found after log-1p transformation a worse identification of expected cell types.

Currently, my dotplot command looks something like this, using an expression cut-off as discussed elsewhere, as I believe you end up with a few zero-count cells.

sc.pl.dotplot(
    adata, 
    my_genes, 
    groupby='cell_type', 
    layer="generated_expression",   
    standard_scale='var',
    expression_cutoff=0.2
)

As a linked point, when running hotspot, should I input the ‘generated_expression’ layer, using the ‘danb’ model, what is the appropriate value for ‘umi_counts_obs_key’.

I would really value your advice on the best way to handle downstream analyses of corrected expression data. I am seeing really excellent results with resolvi, which seems to be very substantially improving cell admixture and allows me to cluster across a large dataset of 2 million ‘cells’. One point I would mention tangentially is the differential expression module runs very slowly with such a large dataset and so I have been randomly subsampling the dataset to achieve this.

Thank you!

Sam Kleeman, Postdoctoral Fellow, CSHL

I don’t see any reason why sc.pl.dotplot running on the top genes selected from DE and using X_resolVI and dendrogram=True won’t work:

sc.tl.dendrogram(ref_adata, groupby=“resolvi_predicted”, use_rep=“X_resolVI”)

sc.pl.dotplot(
adata=ref_adata,
var_names=sel_markers,
groupby=“resolvi_predicted”,
dendrogram=True,
color_map=“Blues”,
swap_axes=True,
use_raw=False,
standard_scale=“var”,
)

@cane11 Can you add your ideas here?

Hi Sam, to clean things up here. Throughout the manuscript we use posterior predictive samples (similar to model.posterior_predictive_sample in scVI). This generated data contains count data and then requires analogous steps to raw data such as count normalization and square root transformation.

import numpy as np
resolvae=resolvi

samples_corr = resolvae.sample_posterior(
    model=resolvae.module.model_corrected,batch_size=65536,
    return_sites=["obs"],
    summary_fun={"post_sample_q50": np.median},
    num_samples=3,
    summary_frequency=30,
)
import pandas as pd
samples_corr = pd.DataFrame(samples_corr).T
adata.layers["generated_expression"] = samples_corr.loc["post_sample_q50", "px_rate"]
```

Instead px_rate are dense count normalized estimates of gene expression (analogous to get_normalized_expression in scVI or also resolVI the only difference here is that sample_posterior for px_rate uses pyro to collect across multiple samples while get_normalized_expression uses the underlying torch computations to further speed up computation).

So if you want a drop-in replacement for count data generate with return_site “obs”, this is what we used for benchmarking in our manuscript. If you need dense estimates of gene expression (such as in VisiumHD) and want to use resolVI for data normalization use return_site “px_rate”. I will make it clear in the final version of the manuscript and replace generated expression with posterior predictive samples to refine this distinction.