scVI batch correction clusters all cells from sample in a circle (potential artifact)

When using scVI for batch correction and clustering on my single-cell data, I observed an unusual result: all cells from the same sample are clustered in a circle, separated from the rest. Changes to n_latent , training epochs, and HVG numbers did not resolve the circular artifact. The UMAP plot is attached for reference (see cluster 10)

Below are my codes:

import scanpy as sc
import torch
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random
from collections import defaultdict

torch.set_float32_matmul_precision("high")

pbmc = sc.read_h5ad('/data/work/data/raw_data/ESCC_PBMC_All_Basic_QC.h5ad')

# downsample each sample for hvg calculation
sample_key = 'sample_info'
n_subsample = 5000  
pbmc_list = []

for sample in pbmc.obs[sample_key].unique():
    pbmc_sample = pbmc[pbmc.obs[sample_key] == sample].copy()
    n_cells = pbmc_sample.shape[0]
    if n_cells > n_subsample:
        idx = np.random.choice(n_cells, size=n_subsample, replace=False)
        pbmc_sub = pbmc_sample[idx, :].copy()
        pbmc_list.append(pbmc_sub)
    else:
        pbmc_list.append(pbmc_sample)
        
pbmc_meger = sc.concat(pbmc_list, label='tmp_info')

sc.pp.highly_variable_genes(
    pbmc_meger,
    flavor="seurat_v3",
    n_top_genes=3000,
    batch_key="tmp_info",
    span=0.8
)

hvg_genes = pbmc_meger.var[pbmc_meger.var['highly_variable']].index.tolist()
pbmc.var['highly_variable'] = pbmc.var_names.isin(hvg_genes)

sc.pp.normalize_total(pbmc, target_sum=1e4)
sc.pp.log1p(pbmc)

pbmc.raw = pbmc

import scvi
import tqdm as notebook_tqdm

scvi.model.SCVI.setup_anndata(pbmc, layer="counts", batch_key="sample_info",
                             continuous_covariate_keys=['doublet_score', 'n_genes_by_counts'],
                             categorical_covariate_keys=['patient_num'])

model = scvi.model.SCVI(pbmc, n_layers=2, n_latent=30, gene_likelihood="nb")
model.train(batch_size=512, max_epochs=32)
latent = model.get_latent_representation()
pbmc.obsm['X_scVI'] = latent
latent.shape
sc.pp.neighbors(pbmc, use_rep='X_scVI', random_state=26)
sc.tl.leiden(pbmc, flavor='igraph', n_iterations=-1, resolution=0.3)
sc.tl.umap(pbmc)
sc.pl.umap(pbmc, color='leiden')

There is no explicit error output, but the UMAP result is abnormal (see attached image).

scVI Versions: 1.1.6.post2

as shown in elbo_train and train_loss_epoch figures, the model had learned the data feature.

elbo_train train_loss_epoch

Hi,

Thank you for posting here.

Given you have already performed quality control, including filtering out low-quality cells and genes with low expression levels and that the input matrix is indeed the count matrix before transformation (although I dont see you are preserving counts), I see that the actual selection of HVG’s is not really filtered. Try something like:
pbmc[:, pbmc.var['highly_variable']] before setup anndata etc’..

You can also increase the number of neighbors in scanpy.pp.neighbors

In addition, what we see in the UMAP is the leiden clusters colors, you might also want to compare when you add “sample_info”

sc.pl.umap(
    pbmc,
    color=["leiden", "sample_info","tmp_info"],
    ncols=3,
    frameon=False,
)

Then you should see the integration

Hi,
Thank you for your prompt reply and suggestions.
I follow your suggestion subsetting the HVG and increasing the number of neighbors from 15 to 25. And I see the strange circle still here.

hvg_genes = pbmc_meger.var[pbmc_meger.var['highly_variable']].index.tolist()
pbmc.var['highly_variable'] = pbmc.var_names.isin(hvg_genes)

pbmc = pbmc[:, pbmc.var['highly_variable']]
sc.pp.normalize_total(pbmc, target_sum=1e4)
sc.pp.log1p(pbmc)
pbmc.raw = pbmc

scvi.model.SCVI.setup_anndata(pbmc, layer="counts", batch_key="sample_info",
                             continuous_covariate_keys=['doublet_score', 'n_genes_by_counts'],
                             categorical_covariate_keys=['patient_num'])

model = scvi.model.SCVI(pbmc, n_layers=2, n_latent=30, gene_likelihood="nb")
model.train(batch_size=512, max_epochs=32)
latent = model.get_latent_representation()
pbmc.obsm['X_scVI'] = latent
latent.shape
sc.pp.neighbors(pbmc, use_rep='X_scVI', random_state=26, n_neighbors=25)
sc.tl.leiden(pbmc, flavor='igraph', n_iterations=-1, resolution=0.3)
sc.tl.umap(pbmc)

the circle contains all cells of some sample

OK, so we do see integration of this circle from different samples.
I see that you used the doublet score as a continuous covariate, why is that? usually it used to filter cells. I suspect that in your case those cells are exactly this circle.

Can you repeat without those continuous covariates? (other option to color the UMP by doublet score)

Actually, I’m afraid that the double score will affect my result after filtering out double cells, so I added it as a continuous covariate. But the issue still exists even after I delete all continuous covariates.

scvi.model.SCVI.setup_anndata(pbmc, layer="counts", batch_key="sample_info",
                             categorical_covariate_keys=['patient_num'])



I also tried using Harmony for batch correction, and it yielded good results. However, because of the matched downstream analysis and the rich set of functions, I prefer to use scVI to complete this task. But I truly don’t know how to fix this problem :face_with_spiral_eyes:. I’m looking forward to your reply!

the result of batch correction with Harmony

This is usually an issue with low quality samples (like very few expressed genes or counts after hvg gene filtering). Can you check for this? In my experience, it’s usually about filtering stricter for this specific sample (maybe even excluding it).

yes, I checked the circle samples separatally, and their quality is ok.
here is a sample in the circle

and here is another sample not in the circle

is this circle still appears when changing the random state of scvi, sc.tl.umap and sc.pp.neighbors?

yes, I changed the random state, it still exist

Are these counts before or after highly variable gene filtering? Can you display expression of a couple of cell-type marker genes?

there counts are filtered before highly variable gene.
there are some cell-type marker genes exhibited in UMAP

Indeed very weird. I have only seen strange behavior when using encode_covariate=True, which you are not passing. Could you display the same markers in PCA (harmony space for this sample and all other samples)?
Do you see much lower scVI.criticism scores or reconstruction loss for this specific sample?

I’m sorry for the late reply.
there are the same markers in PCA

harmony
all sample


circle samples

normal samples

scvi
all samples


circle samples

normal samples

examining the reconstruction loss, I noticed that the circle samples with a higher loss value than normal samples, and their values are same.

circle_list = ['S-CTF_2426926-2', 'S-CTF_2426926-1', 'B-YPF_2446426-1', 'B-YPF_2446426-2', 'S-PZY_2426498-1', 'S-PZY_2426498-2', 'B-HDY_2447253-1', 'B-HDY_2447253-2',
               'B-MXZ_1884381-1', 'B-XLP_2446851-2', 'B-XLP_2446851-1', 'B-YGQ_2425678-2', 'B-YGQ_2425678-1']
for i in circle_list:
    sub_pbmc = pbmc[pbmc.obs['sample_info'] == i].copy()
    loss = model.get_elbo(sub_pbmc)
    print(loss)
# output
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
normal_list = ['B-ZAP_2450072-1', 'S-PFX_2068055-2', 'B-YGJ_2428374-2', 'P-LYZ_2410003-2', 'B-LYZ_2410003-2', 'B-CTF_2426926-2']
for i in normal_list:
    sub_pbmc = pbmc[pbmc.obs['sample_info'] == i].copy()
    loss = model.get_elbo(sub_pbmc)
    print(loss)
# output
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-535.7904, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-422.3558, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-380.1718, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-748.8119, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-734.4749, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-369.6415, device='cuda:0')

Hey, I was just curious. Is the scVI run on raw counts? I see that you assign the .raw attribute after normalizing counts.

it is a bit confusing but yes it uses .raw attributes for down stream analysis (usually log normalized etc..), but trains with real raw counts, found in .X attributes (or a custom layer), if using adata.