Different Latent Time between scvi and harmony integrated datasets

Hi all,
Thank you very much for such a great package. I am testing scvi results to analyze a set of brain tumor samples with scVelo.

The problem I find is that the results of latent time generated over the harmony bach correction look interpretable in terms of degree of differentiation of the tumor cells (cells with strong markers for differentiated cells (oligodendrocytes, Astrocytes, Neuron,…) present a higher value of latent time). However, when I bach correct and integrate the same datasets with scvi, while the general clusterings remain more or less the same (less defined cycling populations but in general quite well), the scVelo analysis get hard to understand and clearly doesn’t match with the observed results in the Harmony integrated analysis.

I am trying to figure out if this could be related someway with how scvi manage the spliced and unspliced matrices or if maybe I would need to bach correct those matrices before moving to scVelo.

Thank you very much

Can you describe the workflow in greater detail? What input are you giving to scVI and what to harmony? Both methods return integrated latent representations, and scVelo will be the same either way, correct? The only difference is the umap will be slightly different?

Hi, Thank you for your reply. the code I use to do the processing is as follow:

I setup the dataset for both harmony and scvi as follow:

import scanpy as sc
   #Create adata and basic filtering
adata1 = sc.read('path/adata1.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata2 = sc.read('path/adata2.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata3 = sc.read('path/adata3.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata4 = sc.read('path/adata4.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
adata5 = sc.read('path/adata5.loom', cache=True, var_names='var_names', obs_names='obs_names', validate=False)
#the same with the others 5 samples..

adata1.obs["sample"]='glioma 1'
adata2.obs["sample"]='glioma 2'
adata3.obs["sample"]='glioma 3'
adata4.obs["sample"]='glioma 4'
adata5.obs["sample"]='glioma 5'
#the same with the others 5 samples..

adata = adata1.concatenate(adata2, adata3, adata4, adata5, adata6, 
                                                    adata7,adata8,adata9,adata10,  join='outer')
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pl.highest_expr_genes(adata, n_top=20, )
mito_genes = adata.var.index.str.startswith('mt-')
adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

   #Filter cells by QC
adata = adata[adata.obs.n_genes < 6000, :]
adata = adata[adata.obs.n_counts < 25000, :]
adata = adata[adata.obs.n_counts > 1000, :]
adata = adata[adata.obs.percent_mito < 0.06, :]
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')

   #Normalize data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Then my code for scanpy + harmony is:

   #Find variable genes, regress non relevant variants, calculate PCA
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=6, min_disp=0.10)
sc.pl.highly_variable_genes(adata)
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'], n_jobs=56)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack', n_comps=100)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 100)
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=75)

   #First view of all data set without integration
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['leiden'])

   #run harmony and generate new UMAP
import harmonypy as hm
adata.obsm['X_pca_old'] = adata.obsm['X_pca']
adata.obsm['X_umap_old'] = adata.obsm['X_umap']
data_mat = adata.obsm['X_pca']
meta_data = adata.obs
vars_use = ['sample']
ho = hm.run_harmony(data_mat, meta_data, vars_use)     
adjusted_pcs = pd.DataFrame(ho.Z_corr).T
adata.obsm['X_pca'] = adjusted_pcs.values
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=75)
sc.tl.umap(adata) maxiter=300)
sc.tl.leiden(adata)

sc.pl.umap(adata, color=['leiden'], legend_loc='on data' )

Alternatively my code for scvi to reach this point is:

import scvi
sc.pp.highly_variable_genes(adata,flavor="seurat_v3",n_top_genes=2000,batch_key="sample", subset=True)
scvi.data.setup_anndata(adata batch_key="sample")

vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
vae.train()
adata.obsm["X_scVI"] = vae.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.leiden(adata, resolution=1)
sc.tl.umap(adata, maxiter=300)
sc.pl.umap(adata, color=["leiden"], legend_loc = 'on data')

**Finally I run scVelo as follow:**
import scvelo as scv
scv.set_figure_params()
cv.tl.recover_dynamics(adata, n_jobs=56)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding(adata, basis='umap', color='leiden')
scv.pl.velocity_embedding_grid(adata, basis='umap', color='leiden')
scv.pl.velocity_embedding_stream(adata, basis='umap', color='leiden')

scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)

#Latent Time with harmony


Regions close to 1 are by gene markers differentiated states of the tumors

#Latent Time with scvi


Can’t determine any clue of the latent time meaning

However, the UMAP clustering generated by scvi presents similar leiden clusters than with harmony in term of gene markers

I was expecting that even if the UMAP look different, since the clusters generated and the general shape is similar (cycling and differentiated cells generating cluster in the borders of the tumor UMAP) the RNA Velocity should be the same or quite similar in fact the velocity embedding presents the arrows with a similar behaviour but then the latent time looks like that…

You are using these special parameters for the umap with harmony, which alone could have a large impact on what you’re observing.

It looks like it’s also possible you’re not giving SCVI the count data, as adata.X in your code will be log normalized.

Finally, for the results to be comparable you’d have to use the same exact genes in both scVI and harmony.

Thank you very much for your help, the highly variable gene definition was the key, The UMAP are still a little different and the latent time graphs are in a different scale (a lot more cells closer to 1 than in the harmony derived graphs) but the highlighted regions in both latent times are now a lot more similar and represent the same clusters by differential gene expression.

I just need to figure out how to improve the velocity embedding graphs, since the arrows are hard to define (probably related with an attempt to draw in 2D something that is clearly of a higher dimensionality)