Trouble integrating my dataset into a reference dataset

Hello! I’m trying to integrate my query dataset (~13000 cells) into a larger reference dataset (OMG database). I’ve selected cells from E11.5 embryos, matching my sample. I select 5,000 highly variable genes using the pearson residuals. However, my resulting UMAP shows me that they are clearly not integrated at all. I’ve tried downsampling the reference dataset (200,000 cells → 30,000 cells), but it doesn’t help at all. Here’s the UMAP:

Some information about my dataset: it’s generated from a transgenic mouse embryo E11.5 which have been FACSed for mCherry expression. It’s a mix of cell types which I only vaguely know the composition of (should contain primarily brain, heart, limb tissue). Additionally, there are 30 artificial genes which I’ve inserted into the genome (they don’t exist in the WT transcriptome), but they’re promptly removed in the HVG selection.

What I’m wondering is whether this is a problem because SCANVI wasn’t designed for this type of data (my query dataset has a subset of the reference dataset cell types, and potentially in different proportions), or if it’s a problem in how I’ve set up the model/parameters.

Here’s my code:

ref_data = sc.read("/project/omg/adata_JAX_dataset_2_E115.h5ad")
fresh_data = sc.read("/project/fresh_combined-null-1-1_Filtered_ensembl.h5ad")
fresh_data.obs["celltype"] = "Unknown"

# Calculate QC and filter outliers
sc.pp.calculate_qc_metrics(fresh_data, percent_top=None, log1p=False, inplace=True)
fresh_data.obs["outlier_total"] = fresh_data.obs.total_counts > 10000
fresh_data.obs["outlier_ngenes"] = fresh_data.obs.n_genes_by_counts > 1500
sc.pp.filter_genes(fresh_data, min_cells=1)

# Concatenate both datasets
full_data = anndata.concat([ref_data, fresh_data], label="batch", join = "inner")
full_data.obs_names_make_unique()
full_data.obs["batch"] = full_data.obs["batch"].cat.rename_categories(["Ref", "Fresh"])
full_data.layers["counts"] = full_data.X

# Select HVG
sc.experimental.pp.highly_variable_genes(full_data, flavor="pearson_residuals", n_top_genes=5000, batch_key = "batch", subset = True)

# Setup variables
SCVI_LATENT_KEY = "X_scVI"
SCANVI_LATENT_KEY = "X_scANVI"
SCANVI_PREDICTIONS_KEY = "predictions_scanvi"
SCVI_CLUSTERS_KEY = "leiden_scVI"
SCANVI_PREDICTIONS_SCORE = "score_scanvi"

# Train SCVI
sca.models.SCVI.setup_anndata(full_data, layer = "counts", labels_key="celltype")
scvi_full = sca.models.SCVI(
    full_data,
    n_layers=4,
    n_latent=50,
    gene_likelihood="nb",
    encode_covariates=True,
    deeply_inject_covariates=False,
    use_layer_norm="both",
    use_batch_norm="none",
    dropout_rate=0.2,
)
scvi_full.train()
print("Saving SCVI full model.")
scvi_full.save("/project/models/scvi/scvi_omg_115_subsampled30k_integrated_fresh", overwrite=True)

# Train SCANVI
scanvae_full = sca.models.SCANVI.from_scvi_model(scvi_full, unlabeled_category = "Unknown", labels_key = "celltype")
print("Labelled Indices: ", len(scanvae_full._labeled_indices))
print("Unlabelled Indices: ", len(scanvae_full._unlabeled_indices))
scanvae_full.train(max_epochs=10, n_samples_per_label=100)
print("Saving SCANVI full model.")
scanvae_full.save("/project/models/scvi/scanvi_omg_115_subsampled30k_integrated_fresh", overwrite=True)

full_data.obsm[SCANVI_LATENT_KEY] = scanvae_full.get_latent_representation()
full_data.obs[SCANVI_PREDICTIONS_KEY] = scanvae_full.predict()
full_data.obs[SCANVI_PREDICTIONS_SCORE] = scanvae_full.predict(soft = True).max(axis = 1)
full_data.write("/project/omg_115_subsampled30k_integrated_fresh.h5ad", "gzip")

# Get clusters & UMAP
sc.pp.neighbors(full_data, use_rep = SCANVI_LATENT_KEY)
sc.tl.leiden(full_data, key_added=SCVI_CLUSTERS_KEY, flavor = "igraph", n_iterations = 2)
sc.tl.umap(full_data)
sc.pl.umap(full_data, color = ["batch"], frameon = False, ncols = 1)

Thanks for any advice!!

Can you share results with other integration methods like BBKNN or harmony (both available in scanpy).
Reducing the number of latent dimensions will likely help with integration. The n_layers is quite high and together with encode_covariates this can create issues. I would turn off encode_covariates, reduce n_latent to 10 and check how well it integrates first.

Hi Can! Thanks for the prompt response. Here are three UMAPS.

  1. Integrated using SCVI, n_layers = 10, encode_covariates = False.
  2. Integrated using Harmony, default parameters.
  3. Integrated using BBKNN, also default settings.|
  4. Integrated using BBKNN, but with ridge regression on the scaled data prior. As explained here.
  5. Lastly, for thoroughness’ sake, I used scanpy’s ingest function. First did PCA reduction, calculated neighbors and UMAP representation on the reference, then called ingest.

It looks like ingest does the best job, followed by BBKNN with ridge regression, then BBKNN & Harmony tied, then SCVI unable to integrate any of the cells.

Additionally, here’s the UMAP for using BBKNN on the Harmony PCs, instead of the standard PCs. I’m not sure if this is even a valid thing to do, as I’m not completey sure what the changes Harmony applies to the PCs are…

Additionally, I had a quick question. The integration of the two datasets doesn’t have an effect on the downstream SCANVI label transfer, does it? I wasn’t sure if these two tasks were dependent on each other or independent.

Could ‘flavour’ parameter be the one causing this? Have you tried “seurat_v3” or playing around with this?

It’s hard to judge whether BBKNN with ridge regression is overintegrating.Likely you will have the easiest time working with the harmony integration (colored by prior cell-type labels might help to decide on this). Both datasets seem to have strong differences. Harmony performs stronger in batch correction while scVI performs better in cell-type preservation in numerous benchmarks, so I think your results agree with this.
Reducing the number of highly variable genes (maybe 1000) might help to better integrate them. Yes, if your integration is not good. scANVI will be also worse for transfering labels.

Originally I had done my analyses with seurat_v3. I tried switching to pearson_residuals after reading that maybe this captures better biological variance. But at least between these two options I didn’t see any difference using the original parameters (n_latent = 50, 5000 HVG).

Then, could it because of the non-matching categorical labels in the query wrt your reference?

Sorry, I’m not sure I know which categorical labels you mean. Do you mean the celltype labels? In the query I set the celltype to “Unknown”, but I thought this is common procedure for doing the label transfer.

Hmmm I see. Changing the number of HVGs did seem to help a bit with the integration.

Continuing my analyses with the harmony integration would be less than ideal, because as far as I’m aware, harmony does not support label transfer/annotation of the query dataset. I was hoping to use SCANVI to map my query to the reference and get the cell types that way.

Have you tried this:

You can run a kNN classifier, see e.g. my PopV package how to perform label annotation with harmony integration: GitHub - YosefLab/PopV.

I tried reducing the n_latent, the resulting UMAP is the first in the list of UMAPs I uploaded above.

I will check out PopV soon! I noticed however that there are only pretrained models for the tabula sapiens datasets. Is there a tutorial available for training my own models? Since I’m interested in mouse embryonic data, I guess a reference dataset like the Ontogeny of Mouse Graphed, or Mouse organogenesis cell atlas would be better sources.

Sorry for the confusion. I don’t think you should run popV here (while you could). It just contains code to classify celltypes based on harmony integration (k-nearest neighbors).

1 Like

Thanks for the pointers! Is there a reason you don’t think popV should be run in this case? From what I understand, it’s a consensus based method for generating cell type labels using a variety of different models, right? It sounds like it could get the most robust/accurate labeling of my dataset.

I would assume consensus scores will be low as integration is not working very well and also classifiers on gene expression might have high uncertainty. My approach here would be. A marker gene based approach for cell-typing and then perform (semi)supervised integration like scANVI or semisupervised MrVI to do comparative analysis.

I don‘t know if you have any update on this issue, because I am having similiar issue when integrating my query to the reference. Reducing the HVGs to 1000 only helepd a bit. When I downsampled my reference (using same qc criteria based on my query n_genes_by_counts and total_counts), it also got bit better.

So, I am now trying to approach in consensus way as well. I used rough manual annotation, Cellassign, Celltypist and SCANVI to annotate and compare the output of these methods. I don‘t know if this is something that we could do, but around 70% of cells seems to agree on the cell type.

And I think here you could see how you could create your own reference in popV:

Hi @dacho welcome to the forum. Without more information it is difficult to provide a good response - experimental setup, how many cells, do you expect it to be integrated. Running PopV will work for most cases, here I was questioning the approach as it will highlight that it’s difficult to annotate a query using a reference dataset if they are very distinct. PopV should output in this case, that you better annotate it manually and won’t provide much help.