Identifying pseudotime-associated genes with GAM

natalia.nutella · March 4, 2023, 12:32am

I am trying to find genes whose expression changes along pseudotime as determined using sc.tl.dpt, but I’m not having much success. All genes seem to be highly correlated to pseudotime values, which isn’t the truth. Hoping to get some insight/advice.

1. Calculate pseudotime with DPT

# Make sure X has log-normalized counts.
adata.X = adata.layers['log_norm']

sc.pp.neighbors(adata, n_neighbors=20, use_rep='X', method='gauss')
sc.tl.diffmap(adata)

# Set the root cell to be the first cell in cluster 8
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden_new_pg_1'] == '8')[0]

sc.tl.dpt(adata, n_branchings=0, n_dcs=15)

sc.pl.diffmap(adata, color=['dpt_pseudotime', 'leiden_new_pg_1'])


# Project the pseudotime onto the UMAP embedding:

sc.pl.umap(rgc_cells, color=['dpt_pseudotime', 'leiden_new_pg_1'])

2. GAM to find pseudotime-associated genes (inspired by this [tutorial], section 11.3 (11 Trajectory inference | Analysis of single cell RNA-seq data))

# Get pseudotime values
pseudotime = adata.obs['dpt_pseudotime']

# Get log1p transformed expression values
log_expr = adata.X.toarray()

# Get the 1000 most variable genes
gene_vars = np.var(log_expr, axis=0) # Calculate variance of each gene across rows (cells).
top_genes_idxs = np.argsort(gene_vars)[-1000:]
top_genes_names = adata.var_names.take(top_genes_idxs)

# Subset expression matrix to top genes only
expr_top_genes = log_expr[:, top_genes_idxs]

# Initialize empty DataFrame to store p-values
gam_pvals = pd.DataFrame(columns=top_genes_names)

# Fit GAM to each gene using pseudotime as predictor
for i, gene_idx in enumerate(top_genes_idxs):
    gam = pygam.LinearGAM().fit(pseudotime, expr_top_genes[:, i])
    gam_pvals.iloc[:, i] = gam.statistics_['p_values']

3. Get genes with significant association to pseudotime

PROBLEM: All p-values are super low, and sig_genes is almost the same length as the input genes (1000).

I’ve tried decreasing the number of splines, e.g. pygam.LinearGAM(splines=5) but get this result either way.


sig_genes = gam_pvals.columns[gam_pvals.max(axis=0) < 0.01]

sig_genes

Index(['SEMA5A', 'TPST1', 'KCNMB4', 'KLHL24', 'ADGRL2', 'ZNF280D', 'MYL12A',
       'MAGED1', 'DENND5A', 'ZNF730',
       ...
       'FSTL4', 'HES1', 'GAP43', 'TOP2A', 'HMGB2', 'ERBB4', 'VIM', 'HIST1H4C',
       'CCND1', 'STMN2'],
      dtype='object', name='featurekey', **length=956**)

# Plot heatmap of 100 significant genes

# Sort cells by pseudotime
adata = adata[adata.obs['dpt_pseudotime'].argsort()]

# Subset the data to the significant genes
adata_sub = adata[:, sig_genes[:100].copy()

# Plot the heatmap
sc.pl.heatmap(adata_sub, standard_scale='var',  groupby='leiden_new_pg_1',
              var_names=sig_genes, show_gene_labels=True,
              use_raw=False, swap_axes=True, dendrogram=False, 
              figsize=(10, 20))

Thank you so much for your help!

yotamcons · March 16, 2023, 8:41am

Hey Natalia,
It seems you are getting exactly the expected results!
The signal most influential on your pseudotime is the cluster association, so it makes perfect sense that the cluster-defining genes are also the pseudotime significant genes, as the heatmap shows.
What were you expecting to get?

Topic		Replies	Views
Two root for the pseudotime? cellrank	3	731	June 20, 2022
Extract high likelihood genes identified in latent time scVelo	0	451	December 6, 2022
Differential Gene Expression scvi-tools diff-exp	2	799	March 20, 2021
Subsetting Anndata based on Multiple Marker Gene Expression Thresholds anndata	0	976	July 12, 2022
Scvi-tools: different DEGs when comparing same data scvi-tools integration , diff-exp , scvi	2	76	November 19, 2024

Identifying pseudotime-associated genes with GAM

1. Calculate pseudotime with DPT

2. GAM to find pseudotime-associated genes (inspired by this [tutorial], section 11.3 (11 Trajectory inference | Analysis of single cell RNA-seq data))

3. Get genes with significant association to pseudotime

Related topics