Dataset integration and analysis

Hi everyone,
I am working on multiple GEO datasets on a specific blood cancer. My intention is to download cell ranger processed raw data (as available) from GEO and then pre-process, integrate, and use the datasets for downstream analysis, which includes DE, GO, GSEA, etc. steps for comparative analysis. I have a 67 GB RAM and 2 GPU desktop. I plan to preprocess each dataset first, then integrate. There are datasets in different genera, including scRNA, scATAC, and CITE-Seq. I need to perform cross bridge connections among all datasets in one platform using SCVI-tools. I also need to do cell type annotations.
I would really appreciate any input or suggestions to make this process more comprehensive and standardize using scvi-tools. For example:

  • Which scvi-tools will best (totalVI, multiVI, etc.)?
  • How the processing steps will be better batch corrected?
  • What will be best strategy to work on comparative analysis of specific cell population clusters across all datasets?

Thank you for your time and consideration.

Hi @mdbabumiamssm, thank you for your question. Currently, MultiVI is the only model that handles all three modalities at the same time, while totalVI will accept paired RNA and protein data. If you plan on subsetting features (genes, regions, proteins) during preprocessing, I would ensure that the same features are being used for all datasets so that concatenation works correctly.

1 Like

Hi Martin, i had specific question, actually. I have around 23 datasets (pre-processed). I was concatenating and then running it *. The Epoc was around 70 and UMAP turned messed up, with 488 cluster. 32 CPU was 100% and barely using GPU. Is there anyway to resolve it ?
FYI, when i’m running less datasets (n=2), Epoch is 400/400: 100%|███████| 400/400 [00:35<00:00, 11.15it/s, loss=298, v_num=1!
Is there anyway i can run many datasets ( my target is n=100) at a time using any scvi tools?

*# Register the AnnData objects with scVI
scvi.model.SCVI.setup_anndata(adata, layer=“counts”, categorical_covariate_keys=[“batch”])

Train scVI model

model = scvi.model.SCVI(adata)
model.train()

Compute latent space and save it to the adata object

adata.obsm[“X_scvi”] = model.get_latent_representation()

Perform dimensionality reduction and visualization

sc.pp.neighbors(adata, use_rep=“X_scvi”)
sc.tl.umap(adata)

Visualize the individual datasets on UMAP

sc.pl.umap(adata, color=“batch”, title=“UMAP after scVI batch correction”, frameon=False)

Perform clustering and visualization with adjusted resolution

sc.tl.leiden(adata, key_added=“leiden_scvi”, resolution=0.5) # Adjust the resolution value as needed
sc.pl.umap(adata, color=“leiden_scvi”, title=“UMAP after scVI batch correction, Leiden clusters”, frameon=False)

adata

Thanks in advance
_ Babu

Hi, does the model training output specify that it is using the GPU? If not, you can pass in use_gpu=True into model.train. How many observations would there be in total with 100 datasets?