Different Latent Time between scvi and harmony integrated datasets

Rinkon · June 15, 2021, 10:38pm

Hi all,
Thank you very much for such a great package. I am testing scvi results to analyze a set of brain tumor samples with scVelo.

The problem I find is that the results of latent time generated over the harmony bach correction look interpretable in terms of degree of differentiation of the tumor cells (cells with strong markers for differentiated cells (oligodendrocytes, Astrocytes, Neuron,…) present a higher value of latent time). However, when I bach correct and integrate the same datasets with scvi, while the general clusterings remain more or less the same (less defined cycling populations but in general quite well), the scVelo analysis get hard to understand and clearly doesn’t match with the observed results in the Harmony integrated analysis.

I am trying to figure out if this could be related someway with how scvi manage the spliced and unspliced matrices or if maybe I would need to bach correct those matrices before moving to scVelo.

Thank you very much

adamgayoso · June 16, 2021, 5:26am

Can you describe the workflow in greater detail? What input are you giving to scVI and what to harmony? Both methods return integrated latent representations, and scVelo will be the same either way, correct? The only difference is the umap will be slightly different?

Rinkon · June 16, 2021, 5:43pm

Hi, Thank you for your reply. the code I use to do the processing is as follow:

I setup the dataset for both harmony and scvi as follow:

import scanpy as sc
   #Create adata and basic filtering
adata1 = sc.read('path/adata1.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata2 = sc.read('path/adata2.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata3 = sc.read('path/adata3.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata4 = sc.read('path/adata4.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata5 = sc.read('path/adata5.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
#the same with the others 5 samples..

adata1.obs["sample"]='glioma 1'
adata2.obs["sample"]='glioma 2'
adata3.obs["sample"]='glioma 3'
adata4.obs["sample"]='glioma 4'
adata5.obs["sample"]='glioma 5'
#the same with the others 5 samples..

adata = adata1.concatenate(adata2, adata3, adata4, adata5, adata6, 
                                                    adata7,adata8,adata9,adata10,  join='outer')
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pl.highest_expr_genes(adata, n_top=20, )
mito_genes = adata.var.index.str.startswith('mt-')
adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

   #Filter cells by QC
adata = adata[adata.obs.n_genes < 6000, :]
adata = adata[adata.obs.n_counts < 25000, :]
adata = adata[adata.obs.n_counts > 1000, :]
adata = adata[adata.obs.percent_mito < 0.06, :]
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')

   #Normalize data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Then my code for scanpy + harmony is:

   #Find variable genes, regress non relevant variants, calculate PCA
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=6, min_disp=0.10)
sc.pl.highly_variable_genes(adata)
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'], n_jobs=56)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack', n_comps=100)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 100)
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=75)

   #First view of all data set without integration
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['leiden'])

   #run harmony and generate new UMAP
import harmonypy as hm
adata.obsm['X_pca_old'] = adata.obsm['X_pca']
adata.obsm['X_umap_old'] = adata.obsm['X_umap']
data_mat = adata.obsm['X_pca']
meta_data = adata.obs
vars_use = ['sample']
ho = hm.run_harmony(data_mat, meta_data, vars_use)     
adjusted_pcs = pd.DataFrame(ho.Z_corr).T
adata.obsm['X_pca'] = adjusted_pcs.values
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=75)
sc.tl.umap(adata) maxiter=300)
sc.tl.leiden(adata)

sc.pl.umap(adata, color=['leiden'], legend_loc='on data' )

Alternatively my code for scvi to reach this point is:

import scvi
sc.pp.highly_variable_genes(adata,flavor="seurat_v3",n_top_genes=2000,batch_key="sample", subset=True)
scvi.data.setup_anndata(adata batch_key="sample")

vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
vae.train()
adata.obsm["X_scVI"] = vae.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.leiden(adata, resolution=1)
sc.tl.umap(adata, maxiter=300)
sc.pl.umap(adata, color=["leiden"], legend_loc = 'on data')

**Finally I run scVelo as follow:**
import scvelo as scv
scv.set_figure_params()
cv.tl.recover_dynamics(adata, n_jobs=56)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding(adata, basis='umap', color='leiden')
scv.pl.velocity_embedding_grid(adata, basis='umap', color='leiden')
scv.pl.velocity_embedding_stream(adata, basis='umap', color='leiden')

scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)

#Latent Time with harmony

Regions close to 1 are by gene markers differentiated states of the tumors

Rinkon · June 16, 2021, 8:00pm

#Latent Time with scvi

Can’t determine any clue of the latent time meaning

However, the UMAP clustering generated by scvi presents similar leiden clusters than with harmony in term of gene markers

I was expecting that even if the UMAP look different, since the clusters generated and the general shape is similar (cycling and differentiated cells generating cluster in the borders of the tumor UMAP) the RNA Velocity should be the same or quite similar in fact the velocity embedding presents the arrows with a similar behaviour but then the latent time looks like that…

adamgayoso · June 16, 2021, 8:13pm

You are using these special parameters for the umap with harmony, which alone could have a large impact on what you’re observing.

It looks like it’s also possible you’re not giving SCVI the count data, as adata.X in your code will be log normalized.

Finally, for the results to be comparable you’d have to use the same exact genes in both scVI and harmony.

Rinkon · June 17, 2021, 11:48pm

Thank you very much for your help, the highly variable gene definition was the key, The UMAP are still a little different and the latent time graphs are in a different scale (a lot more cells closer to 1 than in the harmony derived graphs) but the highlighted regions in both latent times are now a lot more similar and represent the same clusters by differential gene expression.

I just need to figure out how to improve the velocity embedding graphs, since the arrows are hard to define (probably related with an attempt to draw in 2D something that is clearly of a higher dimensionality)

Topic		Replies	Views
scVI batch correction clusters all cells from sample in a circle (potential artifact) scvi-tools integration , scvi	14	104	June 3, 2025
Tuning/setting scvi.model.SCVI parameters scvi-tools scvi	9	1865	March 25, 2023
Shared cell types not mixing when integrating datasets from different species scvi-tools integration , scvi	4	75	June 19, 2025
Help with harmony and scanpy scanpy	3	493	February 14, 2025
Using scVI-integrated data for scVelo and CellRank analysis - what embeddings to supply for velocity? Help	3	629	May 11, 2023

Different Latent Time between scvi and harmony integrated datasets

Related topics