Are there any recommendations on normalizing spatial transcriptomics data before visualization? In the spatial scanpy tutorial, the gene expression is normalized like scRNA-seq data using normalize_total + log1p. In the squidpy visium tutorial, on the other hand, raw counts are plotted.
Personally I’m not convinced that normalize_total makes sense for spatial data, as
I’d assume there is less technical variability between spots than between droplets.
There are biological reasons for different spatial regions having different mRNA content (see also the example below).
But there should probably still be some normalization when comparing between samples. I’m wondering if something like scran makes sense here? As far as I understood scran takes different mRNA content of cells is taken into account when computing scaling factors. Alternatively maybe just normalizing at the sample level (e.g. normalize each sample to 10k * n_spots_under_tissue) is appropriate?
normalize total in a tumor sample
In my experience normalize_total seems problematic at least for the tumor slides I have, because different regions of the slide have vastely different mRNA contents. For instance, the tumor region has a much higher count density than the stromal or immune regions.
When plotting the normalized gene expression, the normalized values appear to be much higher (although sparser) in the immune region, which is misleading:
Fig3: KRAS expression, log-normalized
Here is the same plot with just log1p transformed counts:
I was wondering the same thing.
While most publications do normalize_total this also seems a bit strange to me.
I saw Seurat’s sctransform and stLearn’s SME recommended here as they do not assume that every spot / cell should contain the same count.
Not sure how to best evaluate what works well and what doesn’t though.
The normalization from Cell2location using detection efficiency behaves well in my hands. For pure display reasons and if you don’t want to train something, another round of count normalization after logarithm is helpful in my hands as proposed by Lior Pachter (don’t remember where).
Thanks for the suggestion! Where do you get that from?
Is this
X / y_s
where X is the raw UMI counts and y_s is the “a location/observation-specific scaling parameter which adjusts for differences in sensitivity between observations and batches” obtained using
def compute_expected(self, samples, adata_manager, ind_x=None):
r"""Compute expected expression of each gene in each location. Useful for evaluating how well
the model learned expression pattern of all genes in the data.
"""
if ind_x is None:
ind_x = np.arange(adata_manager.adata.n_obs).astype(int)
else:
ind_x = ind_x.astype(int)
obs2sample = adata_manager.get_from_registry(REGISTRY_KEYS.BATCH_KEY)
obs2sample = pd.get_dummies(obs2sample.flatten()).values[ind_x, :]
mu = (
np.dot(samples["w_sf"][ind_x, :], self.cell_state_mat.T) * samples["m_g"]
+ np.dot(obs2sample, samples["s_g_gene_add"])
) * samples["detection_y_s"][ind_x, :]
alpha = np.dot(obs2sample, 1 / np.power(samples["alpha_g_inverse"], 2))
return {"mu": mu, "alpha": alpha, "ind_x": ind_x}
removing samples["detection_y_s"][ind_x, :] from this code would remove the detection efficiency and generate normalized data. Similarly normalizing to the same samples["detection_y_s"][ind_x, :] would give you normalized counts.