Dataset integration and analysis

mdbabumiamssm · April 24, 2023, 10:16pm

Hi everyone,
I am working on multiple GEO datasets on a specific blood cancer. My intention is to download cell ranger processed raw data (as available) from GEO and then pre-process, integrate, and use the datasets for downstream analysis, which includes DE, GO, GSEA, etc. steps for comparative analysis. I have a 67 GB RAM and 2 GPU desktop. I plan to preprocess each dataset first, then integrate. There are datasets in different genera, including scRNA, scATAC, and CITE-Seq. I need to perform cross bridge connections among all datasets in one platform using SCVI-tools. I also need to do cell type annotations.
I would really appreciate any input or suggestions to make this process more comprehensive and standardize using scvi-tools. For example:

Which scvi-tools will best (totalVI, multiVI, etc.)?
How the processing steps will be better batch corrected?
What will be best strategy to work on comparative analysis of specific cell population clusters across all datasets?

Thank you for your time and consideration.

martinkim0 · April 27, 2023, 3:40pm

Hi @mdbabumiamssm, thank you for your question. Currently, MultiVI is the only model that handles all three modalities at the same time, while totalVI will accept paired RNA and protein data. If you plan on subsetting features (genes, regions, proteins) during preprocessing, I would ensure that the same features are being used for all datasets so that concatenation works correctly.

mdbabumiamssm · May 2, 2023, 10:25pm

Hi Martin, i had specific question, actually. I have around 23 datasets (pre-processed). I was concatenating and then running it *. The Epoc was around 70 and UMAP turned messed up, with 488 cluster. 32 CPU was 100% and barely using GPU. Is there anyway to resolve it ?
FYI, when i’m running less datasets (n=2), Epoch is 400/400: 100%|███████| 400/400 [00:35<00:00, 11.15it/s, loss=298, v_num=1!
Is there anyway i can run many datasets ( my target is n=100) at a time using any scvi tools?

*# Register the AnnData objects with scVI
scvi.model.SCVI.setup_anndata(adata, layer=“counts”, categorical_covariate_keys=[“batch”])

Train scVI model

model = scvi.model.SCVI(adata)
model.train()

Compute latent space and save it to the adata object

adata.obsm[“X_scvi”] = model.get_latent_representation()

Perform dimensionality reduction and visualization

sc.pp.neighbors(adata, use_rep=“X_scvi”)
sc.tl.umap(adata)

Visualize the individual datasets on UMAP

sc.pl.umap(adata, color=“batch”, title=“UMAP after scVI batch correction”, frameon=False)

Perform clustering and visualization with adjusted resolution

sc.tl.leiden(adata, key_added=“leiden_scvi”, resolution=0.5) # Adjust the resolution value as needed
sc.pl.umap(adata, color=“leiden_scvi”, title=“UMAP after scVI batch correction, Leiden clusters”, frameon=False)

adata

Thanks in advance
_ Babu

martinkim0 · May 3, 2023, 5:33pm

Hi, does the model training output specify that it is using the GPU? If not, you can pass in use_gpu=True into model.train. How many observations would there be in total with 100 datasets?

Topic		Replies	Views
Thoughts on a more ~realistic tutorial? scvi-tools tutorials	14	1316	February 26, 2022
Spatial datasets integration scvi-tools scvi	4	628	November 18, 2022
Failing to import SCVI-tools modules: has AnnDatasetFromAnnData been replaced? scvi-tools totalvi	2	59	June 27, 2024
Integration of Multiple Multiome Datasets Multiome integration , multivi , totalvi	5	653	March 6, 2024
Scvi-tools and xenium scvi-tools integration , scvi	7	333	March 13, 2025