Doublet removal: solo model train error

TRakesh0121 · August 26, 2024, 4:40pm

Hi,
I am performing doublet removal using scvi-tools. The following code is working on my other dataset. However, in an alternate concatenated dataset, it fails. How to overcome this problem? I checked there is no NaN values in my input dataset.

sc.pp.filter_genes(doublet_adata, min_cells = 10)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')

filtered out 20160 genes that are detected in less than 10 cells
extracting highly variable genes
→ added
‘highly_variable’, boolean vector (adata.var)
‘highly_variable_rank’, float vector (adata.var)
‘means’, float vector (adata.var)
‘variances’, float vector (adata.var)
‘variances_norm’, float vector (adata.var)

doublet_adata

AnnData object with n_obs × n_vars = 216049 × 2000
obs: ‘sample_id’, ‘diagnosis’, ‘tissue_location’, ‘Glioma_grade’, ‘_scvi_batch’, ‘_scvi_labels’
var: ‘gene_ids’, ‘n_cells’, ‘highly_variable’, ‘highly_variable_rank’, ‘means’, ‘variances’, ‘variances_norm’
uns: ‘hvg’, ‘_scvi_uuid’, ‘_scvi_manager_uuid’

doublet_adata.obs.isnull().any()

sample_id False
diagnosis False
tissue_location False
Glioma_grade False
_scvi_batch False
_scvi_labels False
dtype: bool

doublet_adata.var.isnull().any()

gene_ids False
n_cells False
highly_variable False
highly_variable_rank False
means False
variances False
variances_norm False
dtype: bool

scvi.model.SCVI.setup_anndata(doublet_adata)
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [10:28<00:00, 16.95s/it, v_num=1, train_loss_step=21.7, train_loss_epoch=39.6]

Trainer.fit stopped: max_epochs=37 reached.

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [10:28<00:00, 16.99s/it, v_num=1, train_loss_step=21.7, train_loss_epoch=39.6]

solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train(batch_size=130)

INFO Creating doublets, preparing SOLO model.
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 1/400: 0%|▏ | 1/400 [00:08<54:12, 8.15s/it, v_num=1, train_loss_step=nan, train_loss_epoch=nan]
Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.

Thanks

cane11 · August 26, 2024, 5:21pm

Hi, you might want to check out our FAQ for none errors. It’s pretty unusual in SOLO in my experience. I would check that the AnnData contain count data (not normalized), filter out cells with less than 100 cells after hvg filtering. I would otherwise suggest to reduce the learning rate in SOLO to 1e-4. I would suggest to use a batch_key and restrict SOLO to single batches.

TRakesh0121 · August 26, 2024, 11:48pm

Thanks @cane11. I checked FAQs.

It’s a public dataset. As suggested I tried all the options but error appearing in concatenated data. Therefore, I tried running on each sample.

BY “filter out cells with less than 100 cells after hvg filtering” do you mean => “filter out cells with less than 100 genes”? Because, with that understanding, I started processing one sample at a time, and it’s working.

doublet_adata

AnnData object with n_obs × n_vars = 61980 × 40316
obs: ‘sample_id’, ‘diagnosis’, ‘tissue_location’, ‘Glioma_grade’
var: ‘gene_ids’

sc.pp.filter_genes(doublet_adata, min_cells = 100)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 3000, subset = True, flavor = 'seurat_v3')
sc.pp.filter_cells(doublet_adata, min_genes=100)

filtered out 25638 genes that are detected in less than 100 cells
extracting highly variable genes
→ added
‘highly_variable’, boolean vector (adata.var)
‘highly_variable_rank’, float vector (adata.var)
‘means’, float vector (adata.var)
‘variances’, float vector (adata.var)
‘variances_norm’, float vector (adata.var)
filtered out 59131 cells that have less than 100 genes expressed

scvi.model.SCVI.setup_anndata(doublet_adata)
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)
    
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train(batch_size=130)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████| 400/400 [02:59<00:00, 2.30it/s, v_num=1, train_loss_step=1.84e+3, train_loss_epoch=1.78e+3]
Trainer.fit stopped: max_epochs=400 reached.
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████| 400/400 [02:59<00:00, 2.23it/s, v_num=1, train_loss_step=1.84e+3, train_loss_epoch=1.78e+3]
INFO Creating doublets, preparing SOLO model.
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:44<00:00, 9.27it/s, v_num=1, train_loss_step=0.668, train_loss_epoch=0.426]
Trainer.fit stopped: max_epochs=400 reached.
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:44<00:00, 9.06it/s, v_num=1, train_loss_step=0.668, train_loss_epoch=0.426]

@cane11 : Please comment. Just want to make sure that I am processing the data in correct manner.

Thanks

cane11 · August 27, 2024, 12:45am

I would recommend using the restrict_to_batch argument in from_scvi_model and setting up scVI with a batch key (train a single scVI model across all batches and SOLO per batch). This way the selection is more similar across batches. You might be able to reduce the min_counts filtering to keep more cells. For standard 10X data, I usually find a steep drop in quality with counts before gene filtering below 1000 counts.

ddiez · August 27, 2024, 9:48am

Although I do not know about the details, I just saw this in the output of the first step (sc.pp.filter_genes). I think this is a very large number of genes removed. How many are left? Might this be the source of the problem?

cane11 · August 27, 2024, 3:52pm

I assume the reason for both is the same (low reads per cell), which could be either low read depth in 10X or another technology with less efficient capture. Filtering cells will help to make it more stable (likely cells with 30 counts after hvg selection will work).

TRakesh0121 · August 27, 2024, 4:49pm

Hi @cane11,

I tried the following code, but ‘restrict_to_batch’ is giving error.

scvi.model.SCVI.setup_anndata(doublet_adata, batch_key="sample_id")
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)
    
solo = scvi.external.SOLO.from_scvi_model(vae, restrict_to_batch="SCPCL000001" )
solo.train(batch_size=130)

TRakesh0121 · August 27, 2024, 4:52pm

Hi @ddiez

It’s a public dataset, removing NaN from the dataset leaves ~40,000 genes.
After filter genes that are detected in less than 100 cells, ~15000 were there.

Thanks

cane11 · August 27, 2024, 5:01pm

Please add the error message.

TRakesh0121 · August 27, 2024, 8:34pm

@cane11 Sorry, I forgot to attach the error. Here is the command and error.
I am using the ‘batch_key’ & ‘restrict_to_batch’ option for the first time, that’s no idea what exactly this error means.

sc.pp.filter_genes(doublet_adata, min_cells = 10)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
    
scvi.model.SCVI.setup_anndata(doublet_adata, batch_key="sample_id")
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)
    
solo = scvi.external.SOLO.from_scvi_model(vae,restrict_to_batch="SCPCL000001")
solo.train(batch_size=130)

Thanks

cane11 · August 27, 2024, 8:58pm

It is registered with a single batch. You should input the full data (without subsetting to a single batch) to scVI. SOLO then simulates doublets for a single batch and provides results for this single batch.

TRakesh0121 · August 27, 2024, 11:31pm

Hi @cane11,

Thanks. Still it’s not working.

adata_file_name_combined_adata = sc.concat([adata_file_name_1, adata_file_name_2], merge='same')
adata_file_name_combined_adata

AnnData object with n_obs × n_vars = 216049 × 40316
obs: ‘sample_id’, ‘diagnosis’, ‘tissue_location’, ‘Glioma_grade’
var: ‘gene_ids’

doublet_adata = adata_file_name_combined_adata.copy()
sc.pp.filter_genes(doublet_adata, min_cells = 10)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')

filtered out 20160 genes that are detected in less than 10 cells
extracting highly variable genes
→ added
‘highly_variable’, boolean vector (adata.var)
‘highly_variable_rank’, float vector (adata.var)
‘means’, float vector (adata.var)
‘variances’, float vector (adata.var)
‘variances_norm’, float vector (adata.var)

doublet_adata.obs.sample_id.unique()

[‘SCPCS000001’, ‘SCPCS000026’]
Categories (2, object): [‘SCPCS000001’, ‘SCPCS000026’]

scvi.model.SCVI.setup_anndata(doublet_adata,batch_key="sample_id")
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [11:40<00:00, 18.82s/it, v_num=1, train_loss_step=38.1, train_loss_epoch=39.1]

Trainer.fit stopped: max_epochs=37 reached.

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [11:40<00:00, 18.92s/it, v_num=1, train_loss_step=38.1, train_loss_epoch=39.1]

solo = scvi.external.SOLO.from_scvi_model(vae, restrict_to_batch="SCPCS000001")
solo.train(batch_size=130)

INFO Creating doublets, preparing SOLO model.

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Epoch 1/400: 0%|▏ | 1/400 [00:05<37:27, 5.63s/it, v_num=1, train_loss_step=nan, train_loss_epoch=nan]

Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.

As per my understanding, I should remove the doublets from every sample (I have 40+) first, then concat and integrate them?

Thanks

cane11 · August 28, 2024, 12:23am

Now combine both (first advice). You still need to filter out cells with only a few counts. You can set min_counts to 100. If you have some knowledge of celltypes you can also check what the minimum counts per cell is that clusters by celltype and not by low counts.

TRakesh0121 · August 29, 2024, 6:22am

Thanks, it finally worked!!

Topic		Replies	Views
Batch-Specific Training for Doublet Removal Help scvi , solo	0	430	October 4, 2022
Model cannot be initialized and return error scvi-tools	6	67	January 15, 2025
MultiVI error on only paired data scvi-tools	3	450	December 19, 2023
SOLO/scVI .train() error related to batch_size? scvi-tools scvi	6	1304	February 20, 2025
Preparing data for totalVI scvi-tools totalvi , preprocessing	11	1288	July 30, 2021

Doublet removal: solo model train error

Related topics