Identifying pseudotime-associated genes with GAM

I am trying to find genes whose expression changes along pseudotime as determined using sc.tl.dpt, but I’m not having much success. All genes seem to be highly correlated to pseudotime values, which isn’t the truth. Hoping to get some insight/advice.

1. Calculate pseudotime with DPT

# Make sure X has log-normalized counts.
adata.X = adata.layers['log_norm']

sc.pp.neighbors(adata, n_neighbors=20, use_rep='X', method='gauss')
sc.tl.diffmap(adata)

# Set the root cell to be the first cell in cluster 8
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden_new_pg_1'] == '8')[0]

sc.tl.dpt(adata, n_branchings=0, n_dcs=15)

sc.pl.diffmap(adata, color=['dpt_pseudotime', 'leiden_new_pg_1'])


# Project the pseudotime onto the UMAP embedding:

sc.pl.umap(rgc_cells, color=['dpt_pseudotime', 'leiden_new_pg_1'])


2. GAM to find pseudotime-associated genes (inspired by this [tutorial], section 11.3 (11 Trajectory inference | Analysis of single cell RNA-seq data))

# Get pseudotime values
pseudotime = adata.obs['dpt_pseudotime']

# Get log1p transformed expression values
log_expr = adata.X.toarray()

# Get the 1000 most variable genes
gene_vars = np.var(log_expr, axis=0) # Calculate variance of each gene across rows (cells).
top_genes_idxs = np.argsort(gene_vars)[-1000:]
top_genes_names = adata.var_names.take(top_genes_idxs)

# Subset expression matrix to top genes only
expr_top_genes = log_expr[:, top_genes_idxs]

# Initialize empty DataFrame to store p-values
gam_pvals = pd.DataFrame(columns=top_genes_names)

# Fit GAM to each gene using pseudotime as predictor
for i, gene_idx in enumerate(top_genes_idxs):
    gam = pygam.LinearGAM().fit(pseudotime, expr_top_genes[:, i])
    gam_pvals.iloc[:, i] = gam.statistics_['p_values']

3. Get genes with significant association to pseudotime

PROBLEM: All p-values are super low, and sig_genes is almost the same length as the input genes (1000).

I’ve tried decreasing the number of splines, e.g. pygam.LinearGAM(splines=5) but get this result either way.


sig_genes = gam_pvals.columns[gam_pvals.max(axis=0) < 0.01]

sig_genes

Index(['SEMA5A', 'TPST1', 'KCNMB4', 'KLHL24', 'ADGRL2', 'ZNF280D', 'MYL12A',
       'MAGED1', 'DENND5A', 'ZNF730',
       ...
       'FSTL4', 'HES1', 'GAP43', 'TOP2A', 'HMGB2', 'ERBB4', 'VIM', 'HIST1H4C',
       'CCND1', 'STMN2'],
      dtype='object', name='featurekey', **length=956**)

# Plot heatmap of 100 significant genes

# Sort cells by pseudotime
adata = adata[adata.obs['dpt_pseudotime'].argsort()]

# Subset the data to the significant genes
adata_sub = adata[:, sig_genes[:100].copy()

# Plot the heatmap
sc.pl.heatmap(adata_sub, standard_scale='var',  groupby='leiden_new_pg_1',
              var_names=sig_genes, show_gene_labels=True,
              use_raw=False, swap_axes=True, dendrogram=False, 
              figsize=(10, 20))

Thank you so much for your help!

Hey Natalia,
It seems you are getting exactly the expected results!
The signal most influential on your pseudotime is the cluster association, so it makes perfect sense that the cluster-defining genes are also the pseudotime significant genes, as the heatmap shows.
What were you expecting to get?