scVI batch correction clusters all cells from sample in a circle (potential artifact)

CocoConstant · May 21, 2025, 1:45pm

When using scVI for batch correction and clustering on my single-cell data, I observed an unusual result: all cells from the same sample are clustered in a circle, separated from the rest. Changes to n_latent , training epochs, and HVG numbers did not resolve the circular artifact. The UMAP plot is attached for reference (see cluster 10)

Below are my codes:

import scanpy as sc
import torch
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random
from collections import defaultdict

torch.set_float32_matmul_precision("high")

pbmc = sc.read_h5ad('/data/work/data/raw_data/ESCC_PBMC_All_Basic_QC.h5ad')

# downsample each sample for hvg calculation
sample_key = 'sample_info'
n_subsample = 5000  
pbmc_list = []

for sample in pbmc.obs[sample_key].unique():
    pbmc_sample = pbmc[pbmc.obs[sample_key] == sample].copy()
    n_cells = pbmc_sample.shape[0]
    if n_cells > n_subsample:
        idx = np.random.choice(n_cells, size=n_subsample, replace=False)
        pbmc_sub = pbmc_sample[idx, :].copy()
        pbmc_list.append(pbmc_sub)
    else:
        pbmc_list.append(pbmc_sample)
        
pbmc_meger = sc.concat(pbmc_list, label='tmp_info')

sc.pp.highly_variable_genes(
    pbmc_meger,
    flavor="seurat_v3",
    n_top_genes=3000,
    batch_key="tmp_info",
    span=0.8
)

hvg_genes = pbmc_meger.var[pbmc_meger.var['highly_variable']].index.tolist()
pbmc.var['highly_variable'] = pbmc.var_names.isin(hvg_genes)

sc.pp.normalize_total(pbmc, target_sum=1e4)
sc.pp.log1p(pbmc)

pbmc.raw = pbmc

import scvi
import tqdm as notebook_tqdm

scvi.model.SCVI.setup_anndata(pbmc, layer="counts", batch_key="sample_info",
                             continuous_covariate_keys=['doublet_score', 'n_genes_by_counts'],
                             categorical_covariate_keys=['patient_num'])

model = scvi.model.SCVI(pbmc, n_layers=2, n_latent=30, gene_likelihood="nb")
model.train(batch_size=512, max_epochs=32)
latent = model.get_latent_representation()
pbmc.obsm['X_scVI'] = latent
latent.shape
sc.pp.neighbors(pbmc, use_rep='X_scVI', random_state=26)
sc.tl.leiden(pbmc, flavor='igraph', n_iterations=-1, resolution=0.3)
sc.tl.umap(pbmc)
sc.pl.umap(pbmc, color='leiden')

There is no explicit error output, but the UMAP result is abnormal (see attached image).

scVI Versions: 1.1.6.post2

as shown in elbo_train and train_loss_epoch figures, the model had learned the data feature.

elbo_train	train_loss_epoch
image575×431 15.6 KB	image577×429 14 KB

ori-kron-wis · May 21, 2025, 2:43pm

Hi,

Thank you for posting here.

Given you have already performed quality control, including filtering out low-quality cells and genes with low expression levels and that the input matrix is indeed the count matrix before transformation (although I dont see you are preserving counts), I see that the actual selection of HVG’s is not really filtered. Try something like:
pbmc[:, pbmc.var['highly_variable']] before setup anndata etc’..

You can also increase the number of neighbors in scanpy.pp.neighbors

In addition, what we see in the UMAP is the leiden clusters colors, you might also want to compare when you add “sample_info”

sc.pl.umap(
    pbmc,
    color=["leiden", "sample_info","tmp_info"],
    ncols=3,
    frameon=False,
)

Then you should see the integration

CocoConstant · May 22, 2025, 7:49am

Hi,
Thank you for your prompt reply and suggestions.
I follow your suggestion subsetting the HVG and increasing the number of neighbors from 15 to 25. And I see the strange circle still here.

hvg_genes = pbmc_meger.var[pbmc_meger.var['highly_variable']].index.tolist()
pbmc.var['highly_variable'] = pbmc.var_names.isin(hvg_genes)

pbmc = pbmc[:, pbmc.var['highly_variable']]
sc.pp.normalize_total(pbmc, target_sum=1e4)
sc.pp.log1p(pbmc)
pbmc.raw = pbmc

scvi.model.SCVI.setup_anndata(pbmc, layer="counts", batch_key="sample_info",
                             continuous_covariate_keys=['doublet_score', 'n_genes_by_counts'],
                             categorical_covariate_keys=['patient_num'])

model = scvi.model.SCVI(pbmc, n_layers=2, n_latent=30, gene_likelihood="nb")
model.train(batch_size=512, max_epochs=32)
latent = model.get_latent_representation()
pbmc.obsm['X_scVI'] = latent
latent.shape
sc.pp.neighbors(pbmc, use_rep='X_scVI', random_state=26, n_neighbors=25)
sc.tl.leiden(pbmc, flavor='igraph', n_iterations=-1, resolution=0.3)
sc.tl.umap(pbmc)

the circle contains all cells of some sample

ori-kron-wis · May 22, 2025, 9:14am

OK, so we do see integration of this circle from different samples.
I see that you used the doublet score as a continuous covariate, why is that? usually it used to filter cells. I suspect that in your case those cells are exactly this circle.

Can you repeat without those continuous covariates? (other option to color the UMP by doublet score)

CocoConstant · May 22, 2025, 11:51am

Actually, I’m afraid that the double score will affect my result after filtering out double cells, so I added it as a continuous covariate. But the issue still exists even after I delete all continuous covariates.

scvi.model.SCVI.setup_anndata(pbmc, layer="counts", batch_key="sample_info",
                             categorical_covariate_keys=['patient_num'])

I also tried using Harmony for batch correction, and it yielded good results. However, because of the matched downstream analysis and the rich set of functions, I prefer to use scVI to complete this task. But I truly don’t know how to fix this problem

. I’m looking forward to your reply!

the result of batch correction with Harmony

image1150×971 240 KB

cane11 · May 22, 2025, 12:13pm

This is usually an issue with low quality samples (like very few expressed genes or counts after hvg gene filtering). Can you check for this? In my experience, it’s usually about filtering stricter for this specific sample (maybe even excluding it).

CocoConstant · May 22, 2025, 1:11pm

yes, I checked the circle samples separatally, and their quality is ok.
here is a sample in the circle

and here is another sample not in the circle

ori-kron-wis · May 22, 2025, 1:26pm

is this circle still appears when changing the random state of scvi, sc.tl.umap and sc.pp.neighbors?

CocoConstant · May 22, 2025, 2:13pm

yes, I changed the random state, it still exist

cane11 · May 22, 2025, 5:30pm

Are these counts before or after highly variable gene filtering? Can you display expression of a couple of cell-type marker genes?

CocoConstant · May 23, 2025, 3:33pm

there counts are filtered before highly variable gene.
there are some cell-type marker genes exhibited in UMAP

cane11 · May 23, 2025, 5:32pm

Indeed very weird. I have only seen strange behavior when using encode_covariate=True, which you are not passing. Could you display the same markers in PCA (harmony space for this sample and all other samples)?
Do you see much lower scVI.criticism scores or reconstruction loss for this specific sample?

CocoConstant · May 25, 2025, 2:08pm

I’m sorry for the late reply.
there are the same markers in PCA

harmony
all sample

all samples1850×477 153 KB

circle samples

circle samples1849×573 203 KB

normal samples

normal samples1847×526 182 KB

scvi
all samples

all samples1826×484 143 KB

circle samples

circle samples1813×516 155 KB

normal samples

normal samples1829×514 152 KB

examining the reconstruction loss, I noticed that the circle samples with a higher loss value than normal samples, and their values are same.

circle_list = ['S-CTF_2426926-2', 'S-CTF_2426926-1', 'B-YPF_2446426-1', 'B-YPF_2446426-2', 'S-PZY_2426498-1', 'S-PZY_2426498-2', 'B-HDY_2447253-1', 'B-HDY_2447253-2',
               'B-MXZ_1884381-1', 'B-XLP_2446851-2', 'B-XLP_2446851-1', 'B-YGQ_2425678-2', 'B-YGQ_2425678-1']
for i in circle_list:
    sub_pbmc = pbmc[pbmc.obs['sample_info'] == i].copy()
    loss = model.get_elbo(sub_pbmc)
    print(loss)

# output
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-3.4732, device='cuda:0')

normal_list = ['B-ZAP_2450072-1', 'S-PFX_2068055-2', 'B-YGJ_2428374-2', 'P-LYZ_2410003-2', 'B-LYZ_2410003-2', 'B-CTF_2426926-2']
for i in normal_list:
    sub_pbmc = pbmc[pbmc.obs['sample_info'] == i].copy()
    loss = model.get_elbo(sub_pbmc)
    print(loss)

# output
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-535.7904, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-422.3558, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-380.1718, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-748.8119, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-734.4749, device='cuda:0')
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
tensor(-369.6415, device='cuda:0')

abs51295 · June 1, 2025, 5:05am

Hey, I was just curious. Is the scVI run on raw counts? I see that you assign the .raw attribute after normalizing counts.

ori-kron-wis · June 3, 2025, 9:05am

it is a bit confusing but yes it uses .raw attributes for down stream analysis (usually log normalized etc..), but trains with real raw counts, found in .X attributes (or a custom layer), if using adata.

Topic		Replies	Views
Radial shape of cluster on UMAP, why? scanpy	0	107	November 7, 2024
Scvi - denoising single-cell/single-nucleus transcription data scvi-tools scvi	3	228	August 8, 2024
Cell type labeling help scvi-tools	3	539	September 6, 2022
Weird UMAP after running scVI scvi-tools	6	242	September 12, 2024
Suggestions for training on xenium spatial transcriptomics dataset scvi-tools	5	147	March 22, 2025

scVI batch correction clusters all cells from sample in a circle (potential artifact)

There is no explicit error output, but the UMAP result is abnormal (see attached image).

scVI Versions: 1.1.6.post2

Related topics