Hello, there! Thank you for developing such a great tool.
I’m analyzing my single cell RNA sequencing dataset (about 2,000K cells) using scanpy and scVI. After running scVI and plotting UMAP, my plot seems so weird. The code I run is as below:
adata = ad.concat([adata_dict[key] for key in adata_dict.keys()], merge = "same")
# subsampling to reduce computing time
sc.pp.subsample(adata, n_obs = 100000, random_state = 42)
# preprocessing
sc.pp.filter_cells(adata, min_counts = 1000)
sc.pp.filter_cells(adata, min_genes = 10)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, exclude_highly_expressed = True, target_sum = 1e6)
sc.pp.log1p(adata)
adata.raw = adata.copy()
# selecting HVGs and running scVI
sc.pp.highly_variable_genes(adata, n_top_genes = 3000, layer = "counts", flavor = "seurat_v3", batch_key = "chemistry")
scvi.model.SCVI.setup_anndata(adata, layer = "counts", categorical_covariate_keys = ["sample", "chemistry"])
model_scvi = scvi.model.SCVI(adata, n_hidden = 128, n_latent = 20, n_layers = 2, gene_likelihood = "nb")
model_scvi.train()
adata.obsm["X_scVI"] = model_scvi.get_latent_representation()
# plotting UMAP
sc.pp.neighbors(adata, use_rep = "X_scVI", random_state = 42)
sc.tl.leiden(adata, random_state = 42, key_added = "leiden_scVI", resolution = 1.0)
sc.tl.umap(adata, random_state = 42)
sc.pl.umap(adata, color = "cell type")
and the UMAP looks like
I tried
- choosing different number of HVGs (3,000 ~ 6,000)
- choosing different number of n_latent when generating model
- trying normalization and log-transformation before merging
- defining max_epochs as
np.min([round((20000 / adata.n_obs) * 400), 400])
which was written in the scVI tutorial and single cell best practices paper - running scVI on both the entire dataset and the subsampled dataset yielded the same result
but neither of them worked. Also, I made sure that gene expression matrix which scVI used was composed of raw counts. Running harmony seems to work well but I have no idea why does my UMAP after running scVI look like this. Any ideas or comments will be appreciated.
Best,