Shared cell types not mixing when integrating datasets from different species

Hello,

I am attempting to integrate datasets of the same tissue from different species. Some of the datasets are single-cell while some are single-nucleus. I thought that SCVI would work well for this task, however, as seen in the UMAPs below there is almost no integration of the different datasets, even from the same species. Using the consolidated cell-type labels from the original datasets, it seems shared cell types are in the same region of the umap, but the common cells from the different datasets do not mix (bottom umap). Below I have the code I used to make the umaps, where I used Sample and Method (single-cell vs. single-nuc) as categorical covariates. However, I have also tried other models such as:

scvi.model.SCVI.setup_anndata(adata, layer = "counts",
                             categorical_covariate_keys=['Dataset','Method', 'Organism','Sample'],
                             continuous_covariate_keys=['percent.mito', 'percent.ribo','nCount_RNA',])

and got similar results. I am using raw counts as input and all datasets have been subjected to the same quality control.

I have also tried scANVI using the consolidated cell-type labels from the original datasets and still did not achieve any overlap. Lastly, I have tried integrating different combinations of the datasets and could not even get cells from the same species and same method to integrate.

Almost surprisingly, I was able to achieve pretty good integration by scaling variable genes within each dataset and using harmony. However, I would prefer to use scvi and am confused as to why it is not working very well. If anyone has insight on how I might improve my integration, it would be much appreciated. Thank you for your consideration!

sc.pp.highly_variable_genes(adata, n_top_genes=3000, subset = True, layer = 'counts',
                           flavor = "seurat_v3", batch_key="Dataset") 

scvi.model.SCVI.setup_anndata(adata, layer = "counts",
                             categorical_covariate_keys=["Sample",'Method'],
                             continuous_covariate_keys=['percent.mito','percent.ribo'])

model = scvi.model.SCVI(adata)


#Train the model and save
model.train(early_stopping=True)

#Save model
model.save("./scvi_model",overwrite = True)

#Get latent rep
latent = model.get_latent_representation() 

adata.obsm['X_scVI'] = latent

sc.pp.neighbors(adata, use_rep = 'X_scVI')
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution = 0.5)

Hi! It seems that the current default for scvi is that encode_covariates=False (scvi-tools/src/scvi/module/_vae.py at main · scverse/scvi-tools · GitHub) while it is True for the batch key.
Try with encode_covariates = True so that also the encoder sees them, hope it solves it for you!

Hello, thank you for your reply! Interesting that the default for encode_covariates is false. I am surprised this is not mentioned in the scvi integration vignettes.

I tried setting encode_covariates=True. From the documentation it seems I would set this in the scvi.model.SCVI command. I also tried setting sample to the batch key instead of a categorical covariate. The updated lines of code are below.

Unfortunately, it seems that the datasets still did not integrate well for the most part. I ran my code on a subset of the datasets this time for simplicity and for speed. These three datasets are all single-cell and mouse, so there is less variation than in the example I posted previously.

Any other suggestions of what could potentially be preventing the successful integration? Feel free to let me know and thank you for your help!

scvi.model.SCVI.setup_anndata(adata, layer = "counts",
                              batch_key= 'Sample',
                             categorical_covariate_keys=['Dataset','Method'],
                             continuous_covariate_keys=['percent.mito', 'percent.ribo','nCount_RNA',])

model = scvi.model.SCVI(adata, encode_covariates = True)

For cross-species and cross-technology integration you might want to check sysVI. This is out-of-scope for regular scVI due to drastic batch effects. Encode covariates is especially not part of the original scVI publication. I marked bad experiences adding many continuous covariates with worse latent space. Apart from that the training looks fine. Be aware that harmony directly targets integration and often integrates more with heavy batch effects, while scVI only integrates if cells are similar to each other.

Hi @cane11,

thank you for this information, it is very helpful! I was checking out sysVI and it looks like it will be better suited for these datasets.

A few follow up questions:

  • With regards to your comment about adding many continuous covariates, are you saying that sometimes more continuous covariates leads to worse integration? Does the same go with multiple categorical covariates? Basically, is it sometimes more beneficial to go with a simpler model?

  • With regards to your comment about Harmony, would you say that Harmony prioritizes integration at the expense of real variation between batches? If so, would this be considered a bad thing? I would imagine it would depend on what the goals of the integration are, but would love some perspective.

Feel free to let me know your thoughts and thank you for your consideration!