scVI integration set batch_key and poor Umap result

holjol · August 7, 2024, 8:36am

I have a large dataset downloaded from GEO with inherent structural complexities. When integrating data using the scVI model for batch effect correction and conducting DEG analysis on Cancer vs. Normal samples, I need guidance on how to properly set up Anndata and configure the batch_key and categorical_covariate_key.

Currently, I have set batch_key as ‘sample’ and categorical_covariate_keys as ‘study_origin’ and ‘sequencing_method’. The hierarchy of my data is as follows: 12 samples (including Control and Cancer) < 3 sequencing methods < 4 studies from GEO.

Despite these current settings, my UMAP clustering shows poor integration for one cancer sample, with barely any other samples mapping to the same area. Could you help me troubleshoot this issue?

cane11 · August 7, 2024, 4:55pm

Hi, it is very helpful to provide more information especially including plots and code that you used. To provide some ideas what to look for: Are the other samples/studies well integrated? Is this specific study very different from the other ones (like different sequencing technology or similar drastic differences)? It can help to define a batch_key (here study_id), when identifying highly variable genes in Scanpy and to reduce the number of genes. Have you tried especially harmony and was it giving the correct integration? You can the. use scANVI and provide labels_key that correspond to the good integration.

holjol · August 7, 2024, 6:11pm

Hi, thanks for the reply. Here are more details about my problem:
Below is the code I used.

This is the head of my adata object:

||sample|doublet|n_genes_by_counts|total_counts|total_counts_mt|pct_counts_mt|total_counts_ribo|pct_counts_ribo|Study|SequencingMethod|
|---|---|---|---|---|---|---|---|---|---|---|
|AAAGAACAGCAACTCT-1|SRR21008843|False|208|6089.0|7.0|0.114961|15.0|0.246346|38015751|Illumina NovaSeq 6000|
|AAAGAACCAATTCTCT-1|SRR21008843|False|308|12585.0|10.0|0.079460|47.0|0.373460|38015751|Illumina NovaSeq 6000|
|AAAGAACGTCTACATG-1|SRR21008843|False|960|1567.0|45.0|2.871729|4.0|0.255265|38015751|Illumina NovaSeq 6000|
|AAAGAACTCCGTATGA-1|SRR21008843|False|443|768.0|10.0|1.302083|2.0|0.260417|38015751|Illumina NovaSeq 6000|
|AAAGGGCCAATGTCAC-1|SRR21008843|False|1121|2088.0|58.0|2.777778|5.0|0.239464|38015751|Illumina NovaSeq 6000|

import scvi
import scanpy as sc

Setup AnnData for scVI

scvi.model.SCVI.setup_anndata(adata, layer=“counts”,
batch_key=‘sample’,
categorical_covariate_keys=[“Study”, “SequencingMethod”],
continuous_covariate_keys=[‘pct_counts_mt’, ‘total_counts’, ‘pct_counts_ribo’])

Train the model

model = scvi.model.SCVI(adata)
model.train()

Get latent representation and normalized expression

adata.obsm[‘X_scVI’] = model.get_latent_representation()
adata.layers[‘scvi_normalized’] = model.get_normalized_expression(library_size=1e4)

Compute neighbors and UMAP

sc.pp.neighbors(adata, use_rep=‘X_scVI’)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=2)

Convert sample to categorical

adata.obs[‘sample’] = adata.obs[‘sample’].astype(‘category’)

Plot UMAP

sc.pl.umap(adata, color=[‘Study’, ‘sample’], frameon=False)

and from the Umap down there, you can tell that the Beige one is not mixing up well with other samples.
Screenshot 2024-08-07 at 20.13.16

I haven’t tried Harmony yet but I am planning on doing it soon and will do hyper parameter tuning with the current scVI as well.

cane11 · August 7, 2024, 6:25pm

It looks good in my oppinion. I would perform clustering and look at DE genes and see whether it lines up with any biological signal like e.g. proliferation of cell stress or assay. I would recommend subsetting to highly variable genes (not included in your code).

Topic		Replies	Views
scVI integration using two batch keys scvi-tools	5	1365	October 24, 2023
Integration with scVI scvi-tools scvi	2	1005	November 30, 2022
Semi supervised integration update scvi-tools scanvi	2	387	May 3, 2022
Suggestion on parameters for training scvi model scvi-tools integration , scvi	3	1900	December 4, 2023
Integrating different tissues with scVI scvi-tools integration , scvi	3	182	November 15, 2024

scVI integration set batch_key and poor Umap result

Setup AnnData for scVI

Train the model

Get latent representation and normalized expression

Compute neighbors and UMAP

Convert sample to categorical

Plot UMAP

Related topics