I am trying to find genes whose expression changes along pseudotime as determined using sc.tl.dpt
, but I’m not having much success. All genes seem to be highly correlated to pseudotime values, which isn’t the truth. Hoping to get some insight/advice.
1. Calculate pseudotime with DPT
# Make sure X has log-normalized counts.
adata.X = adata.layers['log_norm']
sc.pp.neighbors(adata, n_neighbors=20, use_rep='X', method='gauss')
sc.tl.diffmap(adata)
# Set the root cell to be the first cell in cluster 8
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden_new_pg_1'] == '8')[0]
sc.tl.dpt(adata, n_branchings=0, n_dcs=15)
sc.pl.diffmap(adata, color=['dpt_pseudotime', 'leiden_new_pg_1'])
# Project the pseudotime onto the UMAP embedding:
sc.pl.umap(rgc_cells, color=['dpt_pseudotime', 'leiden_new_pg_1'])
2. GAM to find pseudotime-associated genes (inspired by this [tutorial], section 11.3 (11 Trajectory inference | Analysis of single cell RNA-seq data))
# Get pseudotime values
pseudotime = adata.obs['dpt_pseudotime']
# Get log1p transformed expression values
log_expr = adata.X.toarray()
# Get the 1000 most variable genes
gene_vars = np.var(log_expr, axis=0) # Calculate variance of each gene across rows (cells).
top_genes_idxs = np.argsort(gene_vars)[-1000:]
top_genes_names = adata.var_names.take(top_genes_idxs)
# Subset expression matrix to top genes only
expr_top_genes = log_expr[:, top_genes_idxs]
# Initialize empty DataFrame to store p-values
gam_pvals = pd.DataFrame(columns=top_genes_names)
# Fit GAM to each gene using pseudotime as predictor
for i, gene_idx in enumerate(top_genes_idxs):
gam = pygam.LinearGAM().fit(pseudotime, expr_top_genes[:, i])
gam_pvals.iloc[:, i] = gam.statistics_['p_values']
3. Get genes with significant association to pseudotime
PROBLEM: All p-values are super low, and sig_genes
is almost the same length as the input genes (1000).
I’ve tried decreasing the number of splines, e.g. pygam.LinearGAM(splines=5)
but get this result either way.
sig_genes = gam_pvals.columns[gam_pvals.max(axis=0) < 0.01]
sig_genes
Index(['SEMA5A', 'TPST1', 'KCNMB4', 'KLHL24', 'ADGRL2', 'ZNF280D', 'MYL12A',
'MAGED1', 'DENND5A', 'ZNF730',
...
'FSTL4', 'HES1', 'GAP43', 'TOP2A', 'HMGB2', 'ERBB4', 'VIM', 'HIST1H4C',
'CCND1', 'STMN2'],
dtype='object', name='featurekey', **length=956**)
# Plot heatmap of 100 significant genes
# Sort cells by pseudotime
adata = adata[adata.obs['dpt_pseudotime'].argsort()]
# Subset the data to the significant genes
adata_sub = adata[:, sig_genes[:100].copy()
# Plot the heatmap
sc.pl.heatmap(adata_sub, standard_scale='var', groupby='leiden_new_pg_1',
var_names=sig_genes, show_gene_labels=True,
use_raw=False, swap_axes=True, dendrogram=False,
figsize=(10, 20))
Thank you so much for your help!