Visium normalization best practices

Are there any recommendations on normalizing spatial transcriptomics data before visualization? In the spatial scanpy tutorial, the gene expression is normalized like scRNA-seq data using normalize_total + log1p. In the squidpy visium tutorial, on the other hand, raw counts are plotted.

Personally I’m not convinced that normalize_total makes sense for spatial data, as

  1. I’d assume there is less technical variability between spots than between droplets.
  2. There are biological reasons for different spatial regions having different mRNA content (see also the example below).

But there should probably still be some normalization when comparing between samples. I’m wondering if something like scran makes sense here? As far as I understood scran takes different mRNA content of cells is taken into account when computing scaling factors. Alternatively maybe just normalizing at the sample level (e.g. normalize each sample to 10k * n_spots_under_tissue) is appropriate?


normalize total in a tumor sample

In my experience normalize_total seems problematic at least for the tumor slides I have, because different regions of the slide have vastely different mRNA contents. For instance, the tumor region has a much higher count density than the stromal or immune regions.
image
Fig1: log1p total counts


Fig2: spatial niches

When plotting the normalized gene expression, the normalized values appear to be much higher (although sparser) in the immune region, which is misleading:
image
Fig3: KRAS expression, log-normalized

Here is the same plot with just log1p transformed counts:
image
Fig4: KRAS expression, log1p-transformed

1 Like

Hi @grst ,

I was wondering the same thing.
While most publications do normalize_total this also seems a bit strange to me.
I saw Seurat’s sctransform and stLearn’s SME recommended here as they do not assume that every spot / cell should contain the same count.

Not sure how to best evaluate what works well and what doesn’t though. :man_shrugging:

The normalization from Cell2location using detection efficiency behaves well in my hands. For pure display reasons and if you don’t want to train something, another round of count normalization after logarithm is helpful in my hands as proposed by Lior Pachter (don’t remember where).

1 Like

Hi @cane11,

Thanks for the suggestion! Where do you get that from?
Is this

X / y_s

where X is the raw UMI counts and y_s is the “a location/observation-specific scaling parameter which adjusts for differences in sensitivity between observations and batches” obtained using

mod.samples["post_sample_means"]["detection_y_s"]

?

Hi Gregor, this sounds correct. However, I was speaking about the per-celltype expression after deconvolution (described in the NCEM part of Cell2location - Mapping human lymph node cell types to 10X Visium with Cell2location — cell2location documentation. This one still works, if there are sever differences in library size along the chip. For CELLxGENE session and similar, I had the easiest visual time doing normalize_total->log1p->normalize_total (called PFlog1pPF in https://www.biorxiv.org/content/10.1101/2022.05.06.490859v1.full.pdf).

I don’t think it’s discussed somewhere, but:

    def compute_expected(self, samples, adata_manager, ind_x=None):
        r"""Compute expected expression of each gene in each location. Useful for evaluating how well
        the model learned expression pattern of all genes in the data.
        """
        if ind_x is None:
            ind_x = np.arange(adata_manager.adata.n_obs).astype(int)
        else:
            ind_x = ind_x.astype(int)
        obs2sample = adata_manager.get_from_registry(REGISTRY_KEYS.BATCH_KEY)
        obs2sample = pd.get_dummies(obs2sample.flatten()).values[ind_x, :]
        mu = (
            np.dot(samples["w_sf"][ind_x, :], self.cell_state_mat.T) * samples["m_g"]
            + np.dot(obs2sample, samples["s_g_gene_add"])
        ) * samples["detection_y_s"][ind_x, :]
        alpha = np.dot(obs2sample, 1 / np.power(samples["alpha_g_inverse"], 2))

        return {"mu": mu, "alpha": alpha, "ind_x": ind_x}

removing samples["detection_y_s"][ind_x, :] from this code would remove the detection efficiency and generate normalized data. Similarly normalizing to the same samples["detection_y_s"][ind_x, :] would give you normalized counts.

1 Like