Scvi performance on nvidia h100 vs a100

mxposed · February 25, 2024, 5:02pm

We recently ran a small internal benchmark of scvi-tools speed on 2 gpus, and it showed considerably slower results on H100 compared to A100 on the same task, same code.
Have you seen something similar?
Do you have any ideas about what might contribute to this or how to investigate?

I used scvi-tools==1.0.4 with torch==2.1.1

Thank you

martinkim0 · February 25, 2024, 5:12pm

Hi, would you be able to send your benchmark code if possible?

mxposed · February 25, 2024, 5:27pm

Sure, here it is. I used our internal dataset, but pretty sure it should be reproducible with pbmc 3k.

import scvi
import scanpy as sc
import time
import torch


adata = sc.read_h5ad('…')
sc.pp.highly_variable_genes(
    adata,
    flavor="seurat_v3",
    n_top_genes=1000,
    subset=True,
    batch_key="Patient"
)
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="Patient",
    categorical_covariate_keys=["Chemistry"]
)

model_adata = scvi.model.SCVI(adata, n_layers=2, dropout_rate=0.2, n_latent=10)

train_start = time.time()
model_adata.train(
    max_epochs=100,
    use_gpu=True,
    check_val_every_n_epoch=2,
    early_stopping=False
)
print(f'Training on {torch.cuda.get_device_name()} took {time.time() - train_start}s')

martinkim0 · February 25, 2024, 8:23pm

Thanks. If you don’t mind, could you try using one of the built-in Lightning profilers to see where the bottleneck is? You can directly pass it into the train method. Feel free to report the results here and I can take a look.

Also, what is your setup like? Are both GPUs connected to the same motherboard, or are they on different nodes? If they’re on different nodes, do they have different CPUs/data interconnects?

Intron7 · March 1, 2024, 8:52pm

Against which cuda version did you build torch?

Topic		Replies	Views
Training on GPU substantially slower with 0.10.0 vs 0.9.1 scvi-tools	2	553	May 5, 2021
CUDA is available but Training scVI models is too slow scvi-tools scvi	4	135	December 4, 2024
Any suggestions for speeding up model training (vae and solo) on M2 mac scvi-tools	1	288	July 8, 2024
Documentation for training with multiple GPUs scvi-tools	5	306	July 9, 2025
Parameters in training model for integrating datasets with scVI in R scvi-tools integration , scvi , model-fit	13	104	March 9, 2025

Scvi performance on nvidia h100 vs a100

Related topics