Dear scverse team,
Thank you for all your efforts in creating this incredible software library. I have a question about resolvi, which I am using to analyze and corrected batch effects for a set of Xenium datasets. Variants of this question have been asked before but from reading the responses I am still left a bit unsure.
I have trained my resolvi model, something like this.
import scvi
from scvi.external import RESOLVI
RESOLVI.setup_anndata(adata, layer='counts', batch_key="sample")
resolvi = RESOLVI(adata, n_latent=30, semisupervised=False)
resolvi.train(max_epochs=100, enable_progress_bar=True,accelerator="gpu")
adata.obsm["X_resolVI"] = resolvae.get_latent_representation()
# Create a name for your model folder
model_dir = "resolvi_model_2M_cells"
# Save the model
resolvi.save(model_dir, overwrite=True, save_anndata=False)
I have clustered using the X_resolVI latent space - all fine. Where I get more unsure is if I want to generate a dotplot of specific marker gene expression. Ideally I want this to be in an appropriately normalized fashion so that data is comparable across cells with variable depth of sequencing, and also deals with the significant batch effects between samples.
Based on the tutorials I have been deriving ‘generated_expression’ as below, which I believe based on some of the questions on here is preferred over ‘get_normalized_expression’.
import numpy as np
resolvae=resolvi
samples_corr = resolvae.sample_posterior(
model=resolvae.module.model_corrected,batch_size=65536,
return_sites=["px_rate"],
summary_fun={"post_sample_q50": np.median},
num_samples=3,
summary_frequency=30,
)
import pandas as pd
samples_corr = pd.DataFrame(samples_corr).T
adata.layers["generated_expression"] = samples_corr.loc["post_sample_q50", "px_rate"]
Now, if I want to generate a dotplot, can I use the ‘generated_expression’ directly or do I need to perform additional normalization steps. For example, sc.pp.normalize_total(adata, layer=‘generated_expression’). This is implied in the manuscript e.g.
Throughout the manuscript, we use generated expression from resolVI. We perform count normalization. Recently, area normalization was suggested for spatial transcriptomics [33]. However, this benchmarking was performed without comparing different segmentation algorithms. We perform square root transformation of normalized counts as we found after log-1p transformation a worse identification of expected cell types.
Currently, my dotplot command looks something like this, using an expression cut-off as discussed elsewhere, as I believe you end up with a few zero-count cells.
sc.pl.dotplot(
adata,
my_genes,
groupby='cell_type',
layer="generated_expression",
standard_scale='var',
expression_cutoff=0.2
)
As a linked point, when running hotspot, should I input the ‘generated_expression’ layer, using the ‘danb’ model, what is the appropriate value for ‘umi_counts_obs_key’.
I would really value your advice on the best way to handle downstream analyses of corrected expression data. I am seeing really excellent results with resolvi, which seems to be very substantially improving cell admixture and allows me to cluster across a large dataset of 2 million ‘cells’. One point I would mention tangentially is the differential expression module runs very slowly with such a large dataset and so I have been randomly subsampling the dataset to achieve this.
Thank you!
Sam Kleeman, Postdoctoral Fellow, CSHL