Mapping 10X Xenium 5k data to scRNA-seq reference

Hello everyone,

I’m desperately trying to create a latent space from my scRNA-seq reference and map my Xenium cells to that. “Reference mapping with scvi-tools” worked well for each dataset individually, but together, results in basically both UMAPs next to each other. Does anyone have any suggestions on how to embed my Xenium data into my scRNA-seq reference UMAP?

Thank you so much for any suggestions and experiences!

Best regards,

Not so sure why to query Xenium cells from scRNA-seq reference and not train a new model from scratch, but you can do several things that might improve your integration (if not done already):

  • Make sure to use the same gene list. Use SCVI.prepare_query_anndata to align the Xenium gene names to the reference. If the Xenium assay measures many fewer genes than the reference, consider down-sampling the reference gene set to the intersection to avoid poor performance. Make sure the reference model converged afterwards.
  • When calling SCVI.load_query_data, set unfrozen=True or selectively unfreeze parts of the network (e.g. freeze_expression=False, freeze_batchnorm_encoder=False). This allows more of the network to adapt to Xenium‐specific effects instead of only fitting the newly‑added weights.
  • After loading the query, call model.train() again for a sufficient number of epochs to let the latent space mix. Check convergence.
  • Compute the UMAP on the combined latent space.
1 Like

@ori-kron-wis

Thank you!

My goal was to improve the notoriously flawed Xenium annotations and “dusty” UMAP by using a single cell RNA-seq reference, annotating that first, then mapping the Xenium cells to that and annotating them via scANVI or a nearest neighbor approach to get cleaner cluster separation (from scRNA-seq) and more valid and finer cell type and cell state annotations. This is a summary of the 2 approaches i tried so far:

  1. For the first approach i used the scANVI reference mapping, with a shared embedding space of Xenium and scRNA-seq (HVG 2000 and subsampled to shared genes).

vae_ref = scvi.model.SCVI.load(X…”scvi_model_scseq", adata=adata_ref)

scanvi_ref = scvi.model.SCANVI.from_scvi_model(vae_ref, labels_key=“celltype”, unlabeled_category=“Unknown”)

scanvi_ref.train(accelerator = “gpu”, max_epochs=50, n_samples_per_label=100)

adata_ref.obsm[“X_scANVI”] = vae_ref.get_latent_representation()

sc.pp.neighbors(adata_ref, use_rep=“X_scANVI”)

sc.tl.leiden(adata_ref)

sc.tl.umap(adata_ref)

scvi.model.SCANVI.prepare_query_anndata(adata_query, X…”scanvi_model_scseq")

scanvi_query = scvi.model.SCANVI.load_query_data(adata_query, X…”scanvi_model_scseq")

scanvi_query.train(max_epochs=100, plan_kwargs={“weight_decay”: 0.0})

adata_query.obsm[“X_scANVI”] = scanvi_query.get_latent_representation()

adata_query.obs[“celltype_scanvi”] = scanvi_query.predict()

adata_full = adata_query.concatenate(adata_ref)

adata_full.obsm[“X_scANVI”] = scanvi_query.get_latent_representation(adata_full)

sc.pp.neighbors(adata_full, use_rep=“X_scANVI”)
sc.tl.umap(adata_full)
sc.tl.leiden(adata_full)

  1. For the second approach, I used integrated scRNA-seq data as a fixed embedding and mapped Xenium cells to their nearest reference neighbors. This worked quite well, with good marker logic - also when i plot control_celltype_annotations (from solely integrated and DGE gathered Xenium annotations) on the shared UMAP. BUT around 60% of “close contact phenotypes,” such as dendritic cells or CD4 subtypes that are very close to tumor cells, are, in my opinion, mistakenly labeled as tumor cells in this approach. I guess Xenium transcript bleed made them map to the tumor cluster (already used proseg to refine Xenium segmentation).

sc.pp.highly_variable_genes(adata_ref, flavor=“seurat_v3”, n_top_genes=2000, layer=“counts”, batch_key= “biopsy_sc”, subset=True)

shared_genes = adata_ref.var_names.intersection(adata_xenium.var_names)

adata_xenium = adata_xenium[:, shared_genes].copy()

adata_ref = adata_ref[:, shared_genes].copy()

scvi.model.SCVI.setup_anndata(adata_ref, layer=“counts”, batch_key=“biopsy_sc”)

vae = scvi.model.SCVI(adata_ref, gene_likelihood=“nb”, n_layers=2, n_latent=30)

vae.train(accelerator=“gpu”, max_epochs=50, early_stopping=True, early_stopping_patience=10)

adata_ref.obsm[“scVI”] = vae.get_latent_representation()

adata_ref.layers[‘scvi_normalized’] = vae.get_normalized_expression(library_size = 1e4)

sc.pp.neighbors(adata_ref, use_rep = ‘scVI’)

sc.tl.umap(adata_ref)

sc.tl.leiden(adata_ref, resolution = 0.5)

vae = scvi.model.SCVI.load(…scvi_model_scseq, adata=adata_ref)

scvi.model.SCVI.prepare_query_anndata(adata_xenium, vae)

vae_query = scvi.model.SCVI.load_query_data(adata_xenium, vae)

vae_query.train(accelerator=“gpu”, max_epochs=50, plan_kwargs={“weight_decay”: 0.0})

adata_xenium.obsm[“scVI”] = vae_query.get_latent_representation()

umap_model = umap.UMAP(n_neighbors=15, min_dist=0.5, metric=“euclidean”)

umap_model.fit(adata_ref.obsm[“scVI”])

adata_xenium.obsm[“X_umap”] = umap_model.transform(adata_xenium.obsm[“scVI”])

Then, the KNN annotation transfer from single cell annotations to Xenium cells.

Thank you for any suggestions!

To address these bleed (or “diffusion”) errors, we developed the resolVI model. However, resolVI doesn’t allow mapping single-cell to spatial data and in my hands this doesn’t work quite well. So instead of relying on this, I usually recomputed cell-type annotations on the spatial data. I have seen convincing applications using scArches to map spatial on single-cells and guess that segmentation errors were minor for these scenarios. However, I did not access raw data there and can’t provide good guidance.