Integrating different tissues with scVI

Hi,

I would like to integrated data from three different studies using scVI.
These data only contain T cells from two different tissues (blood and tumour location), and one of these datasets only has T cells from blood, another one only has T cells from tumour and the other one has T cells from both blood and tissue.

To do train the model, I use the tissue covariate as a covariate to regress in the integration and trained the model using the following code:

epochs = 40

scvi.model.SCVI.setup_anndata(ref, layer="raw_counts", batch_key="batch", categorical_covariate_keys=["tissue", "dataset_patient_id"]) # correct for tissue and patient covariates

ref_model = scvi.model.SCVI(ref, n_layers=2, n_latent=30, gene_likelihood="nb") # not default values but work well in the integration task according to scvi people

ref_model.train(check_val_every_n_epoch=1, max_epochs=epochs, early_stopping=True)

I used 40 epochs as this value was the best one to avoid overfitting (when plotting the validation loss with default epoch value, the model started to overfit after 40/50).

By initialising the model with categorical_covariate_keys=["tissue", "dataset_patient_id"] I obtained a UMAP with this result when colouring cells by tissue of origin:

Since I integrated only T cells, I would not expect that much difference between the tumour tissue and the blood tissue, although some differences could be present. However, since I set the categorical_covariate_keys=["tissue", "dataset_patient_id"] I thought this was the reason why I obtained this result.

So I tried to integrate the data without correcting for tissue. This is the code I used (again 40 epochs was the right threshold for avoinding overfitting):

epochs = 40

scvi.model.SCVI.setup_anndata(ref, layer="raw_counts", batch_key="batch") # correct for tissue and patient covariates

ref_model = scvi.model.SCVI(ref, n_layers=2, n_latent=30, gene_likelihood="nb") # not default values but work well in the integration task according to scvi people

ref_model.train(check_val_every_n_epoch=1, max_epochs=epochs, early_stopping=True)

However, when plotting the UMAP for this second model that did not correct for tissue, I obtained a very similar result:

When integrating these same data with the RPCA algorithm implemented in Seurat T cells were mixed between tumour and blood tissues and I did not obtain this strong separation between the two tissues, as shown in the figure below:

Would you think this is real biological variation or just the scVI models did not integrate properly the data?
Am I training the model in a wrong way? What suggestions you might have for this situation?

Thank you very much for your help.

Martina

1 Like

Hi, I would recommend performing clustering and analyzing differentially expressed genes between these clusters.
Gut feeling - the scVI results look closer to my prior expectation of T cell biology. Most T cell states in cancer can’t be found in healthy blood. The rPCA will likely throw together well described populations like activated and naive-like TREG.
Both models should correct for tissue (categorical covariate key and batch key is the same). The first one looks better. I’m a bit confused by the many clusters in scVI especially within blood (lower right). This might be bad integration but hard to tell without seeing sample IDs and sample covariatesand DE genes.

Hi,
thank you for your fast reply.

Anyway, these are the clusters for the scVI model using a resolution of 0.5 (the one for which I used the categorical covariate key):

While these are the clusters for the RPCA UMAP:

However, I still haven’t performed DGE between the clusters.

I am a bit confused: what do you mean by saying that categorical covariate key and batch key are the same? These parameters correct for the same thing?
Batch key should be the parameter that takes into consideration each batch (each experiment/sample) to remove indeed batch effects due to technical variation in performing the experiments, while the categorical covariate key should take into consideration other variables that introduce variation that is biological rather than technical to be able to distinguish it.
Did I misinterpret these two parameters of the model maybe?

Moreover, blood samples are not from healthy people, but from the same people with tumour. These datasets have tumour tissue and blood samples from patients that have cancer, so blood samples might contain cells that infiltrated the tumour as well for a certain period of time and then went back into the blood, so I would expect some overlap between the two tissues of course, while for tumour I would expect cell states that are only found in tumour since they are in the tumour microenvironment.

Anyway, if you would like to see if integration with the first scVI model worked properly, I plotted the UMAP colouring by batch. There are some clusters that only contain a few batches since the UMAP separated between blood and tumour tissue and as a consequence even the batches should overlap by tissue of origin.

By looking at the UMAP to check how the three datasets have been integrated, you can see from the figure below that they are separated but this is again due to the fact that one dataset only contain blood samples (Wang et al), one dataset only contains tumour samples (Yost et al) and another one contains both blood and tumour (Luoma et al).

Based on this, do you have any suggestions to check if the model performed well or not? I am pretty new with scVI tools and I would like to understand how to improve this analysis.

Thank you again for your help!

I would highly suggest to check DEG and potentially use batch key when filtering highly variable genes or to stick with this integration if the clusters are meaningful. I would expect few immune cells in peripheral blood from most cancers.
Categorical covariate and batch key is both meant to integrate out the respective key. They are treated similar within the network.

1 Like