Hi,
I would like to integrated data from three different studies using scVI.
These data only contain T cells from two different tissues (blood and tumour location), and one of these datasets only has T cells from blood, another one only has T cells from tumour and the other one has T cells from both blood and tissue.
To do train the model, I use the tissue covariate as a covariate to regress in the integration and trained the model using the following code:
epochs = 40
scvi.model.SCVI.setup_anndata(ref, layer="raw_counts", batch_key="batch", categorical_covariate_keys=["tissue", "dataset_patient_id"]) # correct for tissue and patient covariates
ref_model = scvi.model.SCVI(ref, n_layers=2, n_latent=30, gene_likelihood="nb") # not default values but work well in the integration task according to scvi people
ref_model.train(check_val_every_n_epoch=1, max_epochs=epochs, early_stopping=True)
I used 40 epochs as this value was the best one to avoid overfitting (when plotting the validation loss with default epoch value, the model started to overfit after 40/50).
By initialising the model with categorical_covariate_keys=["tissue", "dataset_patient_id"]
I obtained a UMAP with this result when colouring cells by tissue of origin:
Since I integrated only T cells, I would not expect that much difference between the tumour tissue and the blood tissue, although some differences could be present. However, since I set the categorical_covariate_keys=["tissue", "dataset_patient_id"]
I thought this was the reason why I obtained this result.
So I tried to integrate the data without correcting for tissue. This is the code I used (again 40 epochs was the right threshold for avoinding overfitting):
epochs = 40
scvi.model.SCVI.setup_anndata(ref, layer="raw_counts", batch_key="batch") # correct for tissue and patient covariates
ref_model = scvi.model.SCVI(ref, n_layers=2, n_latent=30, gene_likelihood="nb") # not default values but work well in the integration task according to scvi people
ref_model.train(check_val_every_n_epoch=1, max_epochs=epochs, early_stopping=True)
However, when plotting the UMAP for this second model that did not correct for tissue, I obtained a very similar result:
When integrating these same data with the RPCA algorithm implemented in Seurat T cells were mixed between tumour and blood tissues and I did not obtain this strong separation between the two tissues, as shown in the figure below:
Would you think this is real biological variation or just the scVI models did not integrate properly the data?
Am I training the model in a wrong way? What suggestions you might have for this situation?
Thank you very much for your help.
Martina