Hello!
I have a publicly available dataset from Smart Seq2 scRNA seq run that i would like to cluster in ScanPy. The file contains already CPM normalized and log(CPM+1) transformed data, not raw counts.
How could i tell ScanPy that the data is already normalize and log transformed?! Skipping over sc.pp.normalize_total(adata, target_sum=1e4) and sc.pp.log1p(adata) steps produces errors down the line, while normalizing/log-transforming data twice seems like a wrong way to go about it.
Any insight would be amazing, thank you!
can you provide a full script? It should be the case that you can jump straight to the PCA step
Absolutely, thank you for your response
import numpy as np
import pandas as pd
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=120, facecolor='white')
!mkdir write_RenFiltered
results_file = 'write_RenFiltered/Ren2020_5HT_filtered.h5ad'
adata = sc.read_text('GSE135132_Adult_DR_GEO_counts_cpm.txt', delimiter=None).transpose()
adata.var_names_make_unique()
adata
sc.pp.normalize_total(adata, target_sum=1e4) #data already CPM normalized, error without this step
sc.pp.log1p(adata) #data already log transformed, error without this step
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.35) #ValueError: cannot specify integer `bins` when input data contains infinity when normalization/log steps are omitted
sc.pl.highly_variable_genes(adata)
adata.raw = adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['total_counts'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=50)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.9)
sc.pl.umap(adata, color='leiden')
Is it clear that adata.X
contains what you want? What happens if you print(adata)
? The delimiter could be important here
Can you also display output to np.all(np.isfinite(adata.X))
and change your data with data.X[~np.isfinite(adata.X)]=0
and check the pipeline with this modified data?
Is it resolved?
np.all(np.isfinite(adata.X))
True
So the data is finite