How to ScanPy cluster already CPM normalized/log transformed data?

NeuronNewton · May 18, 2022, 11:30pm

Hello!
I have a publicly available dataset from Smart Seq2 scRNA seq run that i would like to cluster in ScanPy. The file contains already CPM normalized and log(CPM+1) transformed data, not raw counts.
How could i tell ScanPy that the data is already normalize and log transformed?! Skipping over sc.pp.normalize_total(adata, target_sum=1e4) and sc.pp.log1p(adata) steps produces errors down the line, while normalizing/log-transforming data twice seems like a wrong way to go about it.
Any insight would be amazing, thank you!

adamgayoso · May 19, 2022, 5:07pm

can you provide a full script? It should be the case that you can jump straight to the PCA step

NeuronNewton · May 19, 2022, 5:28pm

Absolutely, thank you for your response

import numpy as np
import pandas as pd
import scanpy as sc

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=120, facecolor='white')

!mkdir write_RenFiltered
results_file = 'write_RenFiltered/Ren2020_5HT_filtered.h5ad'

adata = sc.read_text('GSE135132_Adult_DR_GEO_counts_cpm.txt', delimiter=None).transpose()

adata.var_names_make_unique() 

adata

sc.pp.normalize_total(adata, target_sum=1e4)  #data already CPM normalized, error without this step

sc.pp.log1p(adata)   #data already log transformed, error without this step 

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.35)  #ValueError: cannot specify integer `bins` when input data contains infinity when normalization/log steps are omitted

sc.pl.highly_variable_genes(adata)

adata.raw = adata

adata = adata[:, adata.var.highly_variable]

sc.pp.regress_out(adata, ['total_counts'])

sc.pp.scale(adata, max_value=10)

sc.tl.pca(adata, svd_solver='arpack')

sc.pl.pca_variance_ratio(adata, log=True)

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=50)

sc.tl.umap(adata)

sc.tl.leiden(adata, resolution=0.9)

sc.pl.umap(adata, color='leiden')

adamgayoso · May 19, 2022, 8:55pm

Is it clear that adata.X contains what you want? What happens if you print(adata)? The delimiter could be important here

NeuronNewton · May 19, 2022, 9:52pm

Absolutely!
I know that delimiter in the txt file is a space [’ '].

cane11 · May 20, 2022, 7:08pm

Can you also display output to np.all(np.isfinite(adata.X)) and change your data with data.X[~np.isfinite(adata.X)]=0 and check the pipeline with this modified data?

hiraksarkar · October 1, 2023, 4:01pm

Is it resolved?

np.all(np.isfinite(adata.X))

True

So the data is finite

Topic		Replies	Views
Normalization on rpkm data scanpy	0	234	September 11, 2023
Apply sc.pp.highly_variable_genes to SCT-normalized residuals or counts scanpy	0	264	July 27, 2023
Scanpy.rank_genes_groups after pp.regress_out Help	0	39	February 27, 2025
Guidelines to use scanpy layers scanpy	3	4062	February 2, 2023
Regress out cell cycle scanpy	0	1664	April 20, 2023

How to ScanPy cluster already CPM normalized/log transformed data?

Related topics