scVI integration set batch_key and poor Umap result

I have a large dataset downloaded from GEO with inherent structural complexities. When integrating data using the scVI model for batch effect correction and conducting DEG analysis on Cancer vs. Normal samples, I need guidance on how to properly set up Anndata and configure the batch_key and categorical_covariate_key.

Currently, I have set batch_key as ‘sample’ and categorical_covariate_keys as ‘study_origin’ and ‘sequencing_method’. The hierarchy of my data is as follows: 12 samples (including Control and Cancer) < 3 sequencing methods < 4 studies from GEO.

Despite these current settings, my UMAP clustering shows poor integration for one cancer sample, with barely any other samples mapping to the same area. Could you help me troubleshoot this issue?

Hi, it is very helpful to provide more information especially including plots and code that you used. To provide some ideas what to look for: Are the other samples/studies well integrated? Is this specific study very different from the other ones (like different sequencing technology or similar drastic differences)? It can help to define a batch_key (here study_id), when identifying highly variable genes in Scanpy and to reduce the number of genes. Have you tried especially harmony and was it giving the correct integration? You can the. use scANVI and provide labels_key that correspond to the good integration.

Hi, thanks for the reply. Here are more details about my problem:
Below is the code I used.

This is the head of my adata object:

||sample|doublet|n_genes_by_counts|total_counts|total_counts_mt|pct_counts_mt|total_counts_ribo|pct_counts_ribo|Study|SequencingMethod|
|---|---|---|---|---|---|---|---|---|---|---|
|AAAGAACAGCAACTCT-1|SRR21008843|False|208|6089.0|7.0|0.114961|15.0|0.246346|38015751|Illumina NovaSeq 6000|
|AAAGAACCAATTCTCT-1|SRR21008843|False|308|12585.0|10.0|0.079460|47.0|0.373460|38015751|Illumina NovaSeq 6000|
|AAAGAACGTCTACATG-1|SRR21008843|False|960|1567.0|45.0|2.871729|4.0|0.255265|38015751|Illumina NovaSeq 6000|
|AAAGAACTCCGTATGA-1|SRR21008843|False|443|768.0|10.0|1.302083|2.0|0.260417|38015751|Illumina NovaSeq 6000|
|AAAGGGCCAATGTCAC-1|SRR21008843|False|1121|2088.0|58.0|2.777778|5.0|0.239464|38015751|Illumina NovaSeq 6000|

import scvi
import scanpy as sc

Setup AnnData for scVI

scvi.model.SCVI.setup_anndata(adata, layer=“counts”,
batch_key=‘sample’,
categorical_covariate_keys=[“Study”, “SequencingMethod”],
continuous_covariate_keys=[‘pct_counts_mt’, ‘total_counts’, ‘pct_counts_ribo’])

Train the model

model = scvi.model.SCVI(adata)
model.train()

Get latent representation and normalized expression

adata.obsm[‘X_scVI’] = model.get_latent_representation()
adata.layers[‘scvi_normalized’] = model.get_normalized_expression(library_size=1e4)

Compute neighbors and UMAP

sc.pp.neighbors(adata, use_rep=‘X_scVI’)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=2)

Convert sample to categorical

adata.obs[‘sample’] = adata.obs[‘sample’].astype(‘category’)

Plot UMAP

sc.pl.umap(adata, color=[‘Study’, ‘sample’], frameon=False)

and from the Umap down there, you can tell that the Beige one is not mixing up well with other samples.
Screenshot 2024-08-07 at 20.13.16

I haven’t tried Harmony yet but I am planning on doing it soon and will do hyper parameter tuning with the current scVI as well.

It looks good in my oppinion. I would perform clustering and look at DE genes and see whether it lines up with any biological signal like e.g. proliferation of cell stress or assay. I would recommend subsetting to highly variable genes (not included in your code).