Transferring labels from gene expression to SMI spatial transcriptomics data


We have single cell RNA sequencing data (10x 5’) and spatially resolved single cell dataset from COSMX SMI technology. I was looking into destVI which can assign a mixture of celltypes to spots (multiple cells), but I am not sure if that model is directly applicable to the single cell spatial data.

I was leaning towards just running scanVI, by marking the spatial dataset as “unknown” and doing the label transfer. The concern there would be the distribution of counts being different from both the technologies and the number of genes sequenced for both datasets (10x 5’ - whole genome, COSMX - 990 genes).

I would be curious to know your suggestions. Thanks.


My understanding is that COSMX is also measuring expression by sequencing, and I believe it uses UMIs (please correct me if I’m wrong). In this setting I would suspect the data are directly comparable in terms of noise distribution. If you subset the genes in the 10x data to the 990 genes used in the COSMX it should then in theory work without too many issues.

To make myself certain about this I would do the following:

  1. Make a label for all the genes in the 10x data of whether they are part of the COSMX panel
  2. Individually for both datasets, plot mean vs variance and mean vs fraction cells with zero counts, for all genes, but using the same axis limits for both datasets. In the 10x data I would color all genes grey, except for the genes that overlap with COSMX, which I’d color black. And I’d color all of the COSMX genes black. I’d put these plots next to eachother.
  3. If I’m convinced that both technologies have similar mean vs variance and mean vs zeros patterns, I’d go ahead. But first I would pick 4-5 genes in the COSMX panel that I think are important for the tissue, and put text labels in the plots where those fall in both the technologies. I would want to make sure that they are not flipped in the 10x vs the COSMX, for example.
  4. Next I’d make a subset anndata that has only the 990 genes, and fit a regular scVI with batch integration, where the 10x data are in different batches from the COSMX data (if you have multiple samples).
  5. Then I’d check an MDE/tSNE/UMAP plot of the embeddings of the scVI with the batches, to see if they mix at all.
    6a. If they DO mix, then I’d move on to fitting the scanVI model.
    6b. If they DO NOT mix, I’d use the scVI differential expression method between different sets of the data to try to debug why the model doesn’t think some cell populations should overlap.


1 Like