Scvi-tools and xenium

Hi,
Can I use scvi-tools to process 10x genomics xenium spatial transcriptome dataset ? The following adata is an anndata object from xenium dataset.

# Registering the data
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="library_id")
# Creating and training a model
model = scvi.model.SCVI(adata, n_hidden=128, n_latent=30, n_layers=2, dispersion='gene')
model.train()

Hi, yes it works. However, this data contains less counts per cell than standard single-cell data and is more noisy (e.g. background). We will release hopefully latter this month an adapted model that works with image-based spatial transcriptomics and has an improved framework to model this data. In the meantime, I would recommend looking into ProSeg for segmentation.

I have tested scvi with a xenium dataset. when there is only one sample (~250,000 cells, ~100 genes), the umap and leiden clustering looks good. But when I integrated two xenium datasets with scvi, there are a lot of tiny clusters (~40 cells per cluster) except two major clusters ! This has never been occurred when intergrating multiple single cell datasets. So it looks there are many differences between xenium and single cell datasets, the data struct of xenium is not compatible well with scvi.

These isolated islands are a problem with low count data and UMAP not so much with scVI. You can mitigate this by either stricter filtering to filter out low count cells. I assume your second sample has lower quality and if you run it seperately you will see similar behavior. You can also reduce this behavior by increasing n_neighbors before running UMAP. You get pretty similar behavior if you run scVI with very poor quality (low library complexity) single-cell data. The reason is just that in these cases 40 cells can have exactly the same expression values.

Thanks, the isolated islands was indeed caused by the low library complexity. In my case, the two datasets were both in good quality. However, the datasets were from different source, the genes became two few when integrating, becuase only the genes expressed in both datasets were kept, a lot of genes were filtered.

Hi, thanks for a great package and good discussion here. An adapted model for image based spatial transcriptomics would be fantastic! Any ETA on release?