Trouble integrating my dataset into a reference dataset

joshuak94 · September 12, 2024, 4:55pm

Hello! I’m trying to integrate my query dataset (~13000 cells) into a larger reference dataset (OMG database). I’ve selected cells from E11.5 embryos, matching my sample. I select 5,000 highly variable genes using the pearson residuals. However, my resulting UMAP shows me that they are clearly not integrated at all. I’ve tried downsampling the reference dataset (200,000 cells → 30,000 cells), but it doesn’t help at all. Here’s the UMAP:

Some information about my dataset: it’s generated from a transgenic mouse embryo E11.5 which have been FACSed for mCherry expression. It’s a mix of cell types which I only vaguely know the composition of (should contain primarily brain, heart, limb tissue). Additionally, there are 30 artificial genes which I’ve inserted into the genome (they don’t exist in the WT transcriptome), but they’re promptly removed in the HVG selection.

What I’m wondering is whether this is a problem because SCANVI wasn’t designed for this type of data (my query dataset has a subset of the reference dataset cell types, and potentially in different proportions), or if it’s a problem in how I’ve set up the model/parameters.

Here’s my code:

ref_data = sc.read("/project/omg/adata_JAX_dataset_2_E115.h5ad")
fresh_data = sc.read("/project/fresh_combined-null-1-1_Filtered_ensembl.h5ad")
fresh_data.obs["celltype"] = "Unknown"

# Calculate QC and filter outliers
sc.pp.calculate_qc_metrics(fresh_data, percent_top=None, log1p=False, inplace=True)
fresh_data.obs["outlier_total"] = fresh_data.obs.total_counts > 10000
fresh_data.obs["outlier_ngenes"] = fresh_data.obs.n_genes_by_counts > 1500
sc.pp.filter_genes(fresh_data, min_cells=1)

# Concatenate both datasets
full_data = anndata.concat([ref_data, fresh_data], label="batch", join = "inner")
full_data.obs_names_make_unique()
full_data.obs["batch"] = full_data.obs["batch"].cat.rename_categories(["Ref", "Fresh"])
full_data.layers["counts"] = full_data.X

# Select HVG
sc.experimental.pp.highly_variable_genes(full_data, flavor="pearson_residuals", n_top_genes=5000, batch_key = "batch", subset = True)

# Setup variables
SCVI_LATENT_KEY = "X_scVI"
SCANVI_LATENT_KEY = "X_scANVI"
SCANVI_PREDICTIONS_KEY = "predictions_scanvi"
SCVI_CLUSTERS_KEY = "leiden_scVI"
SCANVI_PREDICTIONS_SCORE = "score_scanvi"

# Train SCVI
sca.models.SCVI.setup_anndata(full_data, layer = "counts", labels_key="celltype")
scvi_full = sca.models.SCVI(
    full_data,
    n_layers=4,
    n_latent=50,
    gene_likelihood="nb",
    encode_covariates=True,
    deeply_inject_covariates=False,
    use_layer_norm="both",
    use_batch_norm="none",
    dropout_rate=0.2,
)
scvi_full.train()
print("Saving SCVI full model.")
scvi_full.save("/project/models/scvi/scvi_omg_115_subsampled30k_integrated_fresh", overwrite=True)

# Train SCANVI
scanvae_full = sca.models.SCANVI.from_scvi_model(scvi_full, unlabeled_category = "Unknown", labels_key = "celltype")
print("Labelled Indices: ", len(scanvae_full._labeled_indices))
print("Unlabelled Indices: ", len(scanvae_full._unlabeled_indices))
scanvae_full.train(max_epochs=10, n_samples_per_label=100)
print("Saving SCANVI full model.")
scanvae_full.save("/project/models/scvi/scanvi_omg_115_subsampled30k_integrated_fresh", overwrite=True)

full_data.obsm[SCANVI_LATENT_KEY] = scanvae_full.get_latent_representation()
full_data.obs[SCANVI_PREDICTIONS_KEY] = scanvae_full.predict()
full_data.obs[SCANVI_PREDICTIONS_SCORE] = scanvae_full.predict(soft = True).max(axis = 1)
full_data.write("/project/omg_115_subsampled30k_integrated_fresh.h5ad", "gzip")

# Get clusters & UMAP
sc.pp.neighbors(full_data, use_rep = SCANVI_LATENT_KEY)
sc.tl.leiden(full_data, key_added=SCVI_CLUSTERS_KEY, flavor = "igraph", n_iterations = 2)
sc.tl.umap(full_data)
sc.pl.umap(full_data, color = ["batch"], frameon = False, ncols = 1)

Thanks for any advice!!

cane11 · September 12, 2024, 6:17pm

Can you share results with other integration methods like BBKNN or harmony (both available in scanpy).
Reducing the number of latent dimensions will likely help with integration. The n_layers is quite high and together with encode_covariates this can create issues. I would turn off encode_covariates, reduce n_latent to 10 and check how well it integrates first.

joshuak94 · September 13, 2024, 7:39am

Hi Can! Thanks for the prompt response. Here are three UMAPS.

Integrated using SCVI, n_layers = 10, encode_covariates = False.

Untitled1668×1384 257 KB
Integrated using Harmony, default parameters.

Untitled-11668×1384 219 KB
Integrated using BBKNN, also default settings.|

Untitled1668×1384 337 KB
Integrated using BBKNN, but with ridge regression on the scaled data prior. As explained here.

Untitled1668×1384 330 KB
Lastly, for thoroughness’ sake, I used scanpy’s ingest function. First did PCA reduction, calculated neighbors and UMAP representation on the reference, then called ingest.

Untitled1616×1384 253 KB

It looks like ingest does the best job, followed by BBKNN with ridge regression, then BBKNN & Harmony tied, then SCVI unable to integrate any of the cells.

Additionally, here’s the UMAP for using BBKNN on the Harmony PCs, instead of the standard PCs. I’m not sure if this is even a valid thing to do, as I’m not completey sure what the changes Harmony applies to the PCs are…

joshuak94 · September 13, 2024, 7:40am

Additionally, I had a quick question. The integration of the two datasets doesn’t have an effect on the downstream SCANVI label transfer, does it? I wasn’t sure if these two tasks were dependent on each other or independent.

yojetsharma · September 13, 2024, 7:26pm

Could ‘flavour’ parameter be the one causing this? Have you tried “seurat_v3” or playing around with this?

cane11 · September 13, 2024, 8:18pm

It’s hard to judge whether BBKNN with ridge regression is overintegrating.Likely you will have the easiest time working with the harmony integration (colored by prior cell-type labels might help to decide on this). Both datasets seem to have strong differences. Harmony performs stronger in batch correction while scVI performs better in cell-type preservation in numerous benchmarks, so I think your results agree with this.
Reducing the number of highly variable genes (maybe 1000) might help to better integrate them. Yes, if your integration is not good. scANVI will be also worse for transfering labels.

joshuak94 · September 15, 2024, 11:48am

Originally I had done my analyses with seurat_v3. I tried switching to pearson_residuals after reading that maybe this captures better biological variance. But at least between these two options I didn’t see any difference using the original parameters (n_latent = 50, 5000 HVG).

yojetsharma · September 15, 2024, 12:23pm

Then, could it because of the non-matching categorical labels in the query wrt your reference?

joshuak94 · September 15, 2024, 2:17pm

Sorry, I’m not sure I know which categorical labels you mean. Do you mean the celltype labels? In the query I set the celltype to “Unknown”, but I thought this is common procedure for doing the label transfer.

joshuak94 · September 15, 2024, 3:18pm

Hmmm I see. Changing the number of HVGs did seem to help a bit with the integration.

Continuing my analyses with the harmony integration would be less than ideal, because as far as I’m aware, harmony does not support label transfer/annotation of the query dataset. I was hoping to use SCANVI to map my query to the reference and get the cell types that way.

cane11 · September 15, 2024, 3:46pm

Have you tried this:

You can run a kNN classifier, see e.g. my PopV package how to perform label annotation with harmony integration: GitHub - YosefLab/PopV.

joshuak94 · September 15, 2024, 8:36pm

I tried reducing the n_latent, the resulting UMAP is the first in the list of UMAPs I uploaded above.

I will check out PopV soon! I noticed however that there are only pretrained models for the tabula sapiens datasets. Is there a tutorial available for training my own models? Since I’m interested in mouse embryonic data, I guess a reference dataset like the Ontogeny of Mouse Graphed, or Mouse organogenesis cell atlas would be better sources.

cane11 · September 15, 2024, 10:13pm

Sorry for the confusion. I don’t think you should run popV here (while you could). It just contains code to classify celltypes based on harmony integration (k-nearest neighbors).

joshuak94 · September 16, 2024, 8:57am

Thanks for the pointers! Is there a reason you don’t think popV should be run in this case? From what I understand, it’s a consensus based method for generating cell type labels using a variety of different models, right? It sounds like it could get the most robust/accurate labeling of my dataset.

cane11 · September 17, 2024, 12:00am

I would assume consensus scores will be low as integration is not working very well and also classifiers on gene expression might have high uncertainty. My approach here would be. A marker gene based approach for cell-typing and then perform (semi)supervised integration like scANVI or semisupervised MrVI to do comparative analysis.

dacho · October 11, 2024, 9:18am

I don‘t know if you have any update on this issue, because I am having similiar issue when integrating my query to the reference. Reducing the HVGs to 1000 only helepd a bit. When I downsampled my reference (using same qc criteria based on my query n_genes_by_counts and total_counts), it also got bit better.

So, I am now trying to approach in consensus way as well. I used rough manual annotation, Cellassign, Celltypist and SCANVI to annotate and compare the output of these methods. I don‘t know if this is something that we could do, but around 70% of cells seems to agree on the cell type.

And I think here you could see how you could create your own reference in popV:

cane11 · October 11, 2024, 7:07pm

Hi @dacho welcome to the forum. Without more information it is difficult to provide a good response - experimental setup, how many cells, do you expect it to be integrated. Running PopV will work for most cases, here I was questioning the approach as it will highlight that it’s difficult to annotate a query using a reference dataset if they are very distinct. PopV should output in this case, that you better annotate it manually and won’t provide much help.

Topic		Replies	Views
Shared cell types not mixing when integrating datasets from different species scvi-tools integration , scvi	4	60	June 19, 2025
[Usage clarification] Should the .obs in query and reference be exactly the same? scvi-tools	2	31	September 13, 2024
Add new data to existing integration scvi-tools	3	65	March 7, 2025
What to do when the query dataset has different genes from the reference? scvi-tools	4	84	September 4, 2024
Workflow to integrate dataset from two different species scvi-tools	1	104	January 21, 2025

Trouble integrating my dataset into a reference dataset

Related topics