Doublet removal: solo model train error

Hi,
I am performing doublet removal using scvi-tools. The following code is working on my other dataset. However, in an alternate concatenated dataset, it fails. How to overcome this problem? I checked there is no NaN values in my input dataset.

sc.pp.filter_genes(doublet_adata, min_cells = 10)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')

filtered out 20160 genes that are detected in less than 10 cells
extracting highly variable genes
→ added
‘highly_variable’, boolean vector (adata.var)
‘highly_variable_rank’, float vector (adata.var)
‘means’, float vector (adata.var)
‘variances’, float vector (adata.var)
‘variances_norm’, float vector (adata.var)

doublet_adata

AnnData object with n_obs × n_vars = 216049 × 2000
obs: ‘sample_id’, ‘diagnosis’, ‘tissue_location’, ‘Glioma_grade’, ‘_scvi_batch’, ‘_scvi_labels’
var: ‘gene_ids’, ‘n_cells’, ‘highly_variable’, ‘highly_variable_rank’, ‘means’, ‘variances’, ‘variances_norm’
uns: ‘hvg’, ‘_scvi_uuid’, ‘_scvi_manager_uuid’

doublet_adata.obs.isnull().any()

sample_id False
diagnosis False
tissue_location False
Glioma_grade False
_scvi_batch False
_scvi_labels False
dtype: bool

doublet_adata.var.isnull().any()

gene_ids False
n_cells False
highly_variable False
highly_variable_rank False
means False
variances False
variances_norm False
dtype: bool

scvi.model.SCVI.setup_anndata(doublet_adata)
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [10:28<00:00, 16.95s/it, v_num=1, train_loss_step=21.7, train_loss_epoch=39.6]

Trainer.fit stopped: max_epochs=37 reached.

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [10:28<00:00, 16.99s/it, v_num=1, train_loss_step=21.7, train_loss_epoch=39.6]

solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train(batch_size=130)

INFO Creating doublets, preparing SOLO model.
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 1/400: 0%|▏ | 1/400 [00:08<54:12, 8.15s/it, v_num=1, train_loss_step=nan, train_loss_epoch=nan]
Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.


Thanks

Hi, you might want to check out our FAQ for none errors. It’s pretty unusual in SOLO in my experience. I would check that the AnnData contain count data (not normalized), filter out cells with less than 100 cells after hvg filtering. I would otherwise suggest to reduce the learning rate in SOLO to 1e-4. I would suggest to use a batch_key and restrict SOLO to single batches.

Thanks @cane11. I checked FAQs.

It’s a public dataset. As suggested I tried all the options but error appearing in concatenated data. Therefore, I tried running on each sample.

BY “filter out cells with less than 100 cells after hvg filtering” do you mean => “filter out cells with less than 100 genes”? Because, with that understanding, I started processing one sample at a time, and it’s working.

doublet_adata 

AnnData object with n_obs × n_vars = 61980 × 40316
obs: ‘sample_id’, ‘diagnosis’, ‘tissue_location’, ‘Glioma_grade’
var: ‘gene_ids’

sc.pp.filter_genes(doublet_adata, min_cells = 100)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 3000, subset = True, flavor = 'seurat_v3')
sc.pp.filter_cells(doublet_adata, min_genes=100)

filtered out 25638 genes that are detected in less than 100 cells
extracting highly variable genes
→ added
‘highly_variable’, boolean vector (adata.var)
‘highly_variable_rank’, float vector (adata.var)
‘means’, float vector (adata.var)
‘variances’, float vector (adata.var)
‘variances_norm’, float vector (adata.var)
filtered out 59131 cells that have less than 100 genes expressed

scvi.model.SCVI.setup_anndata(doublet_adata)
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)
    
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train(batch_size=130)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████| 400/400 [02:59<00:00, 2.30it/s, v_num=1, train_loss_step=1.84e+3, train_loss_epoch=1.78e+3]
Trainer.fit stopped: max_epochs=400 reached.
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████| 400/400 [02:59<00:00, 2.23it/s, v_num=1, train_loss_step=1.84e+3, train_loss_epoch=1.78e+3]
INFO Creating doublets, preparing SOLO model.
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:44<00:00, 9.27it/s, v_num=1, train_loss_step=0.668, train_loss_epoch=0.426]
Trainer.fit stopped: max_epochs=400 reached.
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:44<00:00, 9.06it/s, v_num=1, train_loss_step=0.668, train_loss_epoch=0.426]


@cane11 : Please comment. Just want to make sure that I am processing the data in correct manner.

Thanks

I would recommend using the restrict_to_batch argument in from_scvi_model and setting up scVI with a batch key (train a single scVI model across all batches and SOLO per batch). This way the selection is more similar across batches. You might be able to reduce the min_counts filtering to keep more cells. For standard 10X data, I usually find a steep drop in quality with counts before gene filtering below 1000 counts.

Although I do not know about the details, I just saw this in the output of the first step (sc.pp.filter_genes). I think this is a very large number of genes removed. How many are left? Might this be the source of the problem?

I assume the reason for both is the same (low reads per cell), which could be either low read depth in 10X or another technology with less efficient capture. Filtering cells will help to make it more stable (likely cells with 30 counts after hvg selection will work).

Hi @cane11,

I tried the following code, but ‘restrict_to_batch’ is giving error.

scvi.model.SCVI.setup_anndata(doublet_adata, batch_key="sample_id")
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)
    
solo = scvi.external.SOLO.from_scvi_model(vae, restrict_to_batch="SCPCL000001" )
solo.train(batch_size=130)

Hi @ddiez

It’s a public dataset, removing NaN from the dataset leaves ~40,000 genes.
After filter genes that are detected in less than 100 cells, ~15000 were there.

Thanks

Please add the error message.

@cane11 Sorry, I forgot to attach the error. Here is the command and error.
I am using the ‘batch_key’ & ‘restrict_to_batch’ option for the first time, that’s no idea what exactly this error means.

sc.pp.filter_genes(doublet_adata, min_cells = 10)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
    
scvi.model.SCVI.setup_anndata(doublet_adata, batch_key="sample_id")
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)
    
solo = scvi.external.SOLO.from_scvi_model(vae,restrict_to_batch="SCPCL000001")
solo.train(batch_size=130)

Thanks

It is registered with a single batch. You should input the full data (without subsetting to a single batch) to scVI. SOLO then simulates doublets for a single batch and provides results for this single batch.

Hi @cane11,

Thanks. Still it’s not working.

adata_file_name_combined_adata = sc.concat([adata_file_name_1, adata_file_name_2], merge='same')
adata_file_name_combined_adata

AnnData object with n_obs × n_vars = 216049 × 40316
obs: ‘sample_id’, ‘diagnosis’, ‘tissue_location’, ‘Glioma_grade’
var: ‘gene_ids’

doublet_adata = adata_file_name_combined_adata.copy()
sc.pp.filter_genes(doublet_adata, min_cells = 10)
sc.pp.highly_variable_genes(doublet_adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')

filtered out 20160 genes that are detected in less than 10 cells
extracting highly variable genes
→ added
‘highly_variable’, boolean vector (adata.var)
‘highly_variable_rank’, float vector (adata.var)
‘means’, float vector (adata.var)
‘variances’, float vector (adata.var)
‘variances_norm’, float vector (adata.var)

doublet_adata.obs.sample_id.unique()

[‘SCPCS000001’, ‘SCPCS000026’]
Categories (2, object): [‘SCPCS000001’, ‘SCPCS000026’]

scvi.model.SCVI.setup_anndata(doublet_adata,batch_key="sample_id")
vae = scvi.model.SCVI(doublet_adata)
vae.train(batch_size=130)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [11:40<00:00, 18.82s/it, v_num=1, train_loss_step=38.1, train_loss_epoch=39.1]

Trainer.fit stopped: max_epochs=37 reached.

Epoch 37/37: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [11:40<00:00, 18.92s/it, v_num=1, train_loss_step=38.1, train_loss_epoch=39.1]

solo = scvi.external.SOLO.from_scvi_model(vae, restrict_to_batch="SCPCS000001")
solo.train(batch_size=130)

INFO Creating doublets, preparing SOLO model.

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Epoch 1/400: 0%|▏ | 1/400 [00:05<37:27, 5.63s/it, v_num=1, train_loss_step=nan, train_loss_epoch=nan]

Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.


As per my understanding, I should remove the doublets from every sample (I have 40+) first, then concat and integrate them?

Thanks

Now combine both (first advice). You still need to filter out cells with only a few counts. You can set min_counts to 100. If you have some knowledge of celltypes you can also check what the minimum counts per cell is that clusters by celltype and not by low counts.

Thanks, it finally worked!!