Does processing the data always overwrite the raw counts?

natalia.nutella · February 2, 2023, 4:18am

I’m trying to understand the expected behavior in Scanpy re: what happens to different versions of the data during processing.

It’s my understanding that doing operations on the data always overwrites .X and one has to specifically copy the pre-modified data into a layer if you want to keep it. Is this correct?

ie If I have an anndata object that only has a raw counts matrix:

> adata = sc.read_h5ad('test.h5ad')

> adata

AnnData object with n_obs × n_vars = 4977 × 30721
    obs: 'n_genes'
    var: 'featureid', 'n_cells'
    uns: 'genome', 'modality', 'uid'

> adata.X

<4977x30721 sparse matrix of type '<class 'numpy.uint32'>'
	with 16718362 stored elements in Compressed Sparse Row format>

Then running normalization on it will overwrite the counts and not store them somewhere else:
(does not create a layer to store the raw counts, does not store them in adata.raw)

> scanpy.pp.normalize_total(adata)

adata

AnnData object with n_obs × n_vars = 4977 × 30721
    obs: 'n_genes'
    var: 'featureid', 'n_cells'
    uns: 'genome', 'modality', 'uid'
 # no layers created

# .X is now float, so presumably normalized
adata.X

<4977x30721 sparse matrix of type '<class 'numpy.float32'>'
	with 16718362 stored elements in Compressed Sparse Row format>

> adata.raw.X
AttributeError: 'NoneType' object has no attribute 'X'

This is different from pegasus where sequential operations create new matrices in the object without the user having to specify:

Example, an object that was log-normalized with pegasus:

adata_pg

MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 3538 x 36484
    UID: GRCh38-rna; Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'counts', 'counts.log_norm'
    It currently binds to matrix 'counts.log_norm' as X

And different from Seurat, where default behavior is to store the different versions of the data in @counts, @data, @scaled.

Am I correct in my interpretation? And why is this not the case in Scanpy?

wbrett87 · February 5, 2023, 1:27pm

This is also not completely clear to me either!

Jingheng-ZHANG-1999 · June 1, 2023, 5:24pm

I think you’re right.
I also discovered using adata=adata.raw.to_adata() not only restores anndata.X but also erases layers on it, making it a less preferred approach to store raw count matrix.
Now I use:

adata=sc.read_h5ad('xxx.h5ad')
adata.layers['counts']=adata.X.copy() 
sc.pp.normalize_total(adata,exclude_highly_expressed=False) 
adata.layers['cpm']=adata.X.copy() 
sc.pp.log1p(adata)
adata.layers['data']=adata.X.copy() 
sc.pp.highly_variable_genes(adata,batch_key='sample',flavor='seurat_v3',layer="counts",n_top_genes=2500) 
sc.pp.regress_out(adata, ['pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
adata.layers['scaled']=adata.X.copy()

A bit clumsey, but makes me feel safe.

cdpolt · April 22, 2024, 6:52pm

Does this still work for you? When I try and do this now, the operations also work on all layers so by the end you would have a bunch of layers that all look identical…

Topic		Replies	Views
Guidelines to use scanpy layers scanpy	3	4139	February 2, 2023
How could `adata.raw.X` contain non-integer values? scanpy anndata	2	57	July 13, 2025
Layer (counts) loss after adata.raw.to_adata() scanpy	0	955	April 13, 2023
Differences between .X, .raw.X, and .raw in anndata object anndata	6	6723	May 10, 2024
Can’t change anndata dimensions anndata	6	2065	March 9, 2023

Does processing the data always overwrite the raw counts?

Related topics