Weird UMAP after running scVI

Hello, there! Thank you for developing such a great tool.

I’m analyzing my single cell RNA sequencing dataset (about 2,000K cells) using scanpy and scVI. After running scVI and plotting UMAP, my plot seems so weird. The code I run is as below:

adata = ad.concat([adata_dict[key] for key in adata_dict.keys()], merge = "same")

# subsampling to reduce computing time
sc.pp.subsample(adata, n_obs = 100000, random_state = 42)

# preprocessing
sc.pp.filter_cells(adata, min_counts = 1000)
sc.pp.filter_cells(adata, min_genes = 10)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, exclude_highly_expressed = True, target_sum = 1e6)
sc.pp.log1p(adata)
adata.raw = adata.copy()

# selecting HVGs and running scVI
sc.pp.highly_variable_genes(adata, n_top_genes = 3000, layer = "counts", flavor = "seurat_v3", batch_key = "chemistry")
scvi.model.SCVI.setup_anndata(adata, layer = "counts", categorical_covariate_keys = ["sample", "chemistry"])
model_scvi = scvi.model.SCVI(adata, n_hidden = 128, n_latent = 20, n_layers = 2, gene_likelihood = "nb")
model_scvi.train()
adata.obsm["X_scVI"] = model_scvi.get_latent_representation()

# plotting UMAP
sc.pp.neighbors(adata, use_rep = "X_scVI", random_state = 42)
sc.tl.leiden(adata, random_state = 42, key_added = "leiden_scVI", resolution = 1.0)
sc.tl.umap(adata, random_state = 42)
sc.pl.umap(adata, color = "cell type")

and the UMAP looks like
umap

I tried

  1. choosing different number of HVGs (3,000 ~ 6,000)
  2. choosing different number of n_latent when generating model
  3. trying normalization and log-transformation before merging
  4. defining max_epochs as np.min([round((20000 / adata.n_obs) * 400), 400]) which was written in the scVI tutorial and single cell best practices paper
  5. running scVI on both the entire dataset and the subsampled dataset yielded the same result

but neither of them worked. Also, I made sure that gene expression matrix which scVI used was composed of raw counts. Running harmony seems to work well but I have no idea why does my UMAP after running scVI look like this. Any ideas or comments will be appreciated.

Best,

Hi, can you check how many counts the cells contain after highly_variable_gene selection. It might be a problem with UMAP (small outlier cells that are hard to see. Can you add plots for the 20 latent dimensions or a 2D PCA plot of your latent dimensions.

Thank you for your reply.

  1. I’m not sure what you mean by the number of counts the cells contain after the “HVG selection”:
np.max(adata.obs["n_counts"])
192466.0
np.min(adata.obs["n_counts"])
1000.0

Though I didn’t perform QC for this dataset, I removed cells whose total counts were bigger than 99th percentile of the n_counts or n_genes or smaller than 1st percentile when preprocessing the original dataset (this dataset was downsampled to 1000K cells).

image

I’m not sure I got your purpose…

The PCA looks fine, which was my expectation. The model has captured variations in your data. UMAPs can look wildly off if there are disconnected graphs in the neighborhood graph. This is a problem with UMAP. You can increase the number of neighbors for UMAP which usually helps with those wrong UMAP displays.

set the number of pcs to 20 using n_pcs=20 when you are running sc.pp.neighbors.

I’m confused. You don’t need to set n_pcs when using scVI embeddings but should run it on all latent dimensions (see our tutorials).

You are right. In this case, n_lanent =20.