Does processing the data always overwrite the raw counts?

I’m trying to understand the expected behavior in Scanpy re: what happens to different versions of the data during processing.

It’s my understanding that doing operations on the data always overwrites .X and one has to specifically copy the pre-modified data into a layer if you want to keep it. Is this correct?

ie If I have an anndata object that only has a raw counts matrix:

> adata = sc.read_h5ad('test.h5ad')

> adata

AnnData object with n_obs × n_vars = 4977 × 30721
    obs: 'n_genes'
    var: 'featureid', 'n_cells'
    uns: 'genome', 'modality', 'uid'

> adata.X

<4977x30721 sparse matrix of type '<class 'numpy.uint32'>'
	with 16718362 stored elements in Compressed Sparse Row format>

Then running normalization on it will overwrite the counts and not store them somewhere else:
(does not create a layer to store the raw counts, does not store them in adata.raw)

> scanpy.pp.normalize_total(adata)

adata

AnnData object with n_obs × n_vars = 4977 × 30721
    obs: 'n_genes'
    var: 'featureid', 'n_cells'
    uns: 'genome', 'modality', 'uid'
 # no layers created

# .X is now float, so presumably normalized
adata.X

<4977x30721 sparse matrix of type '<class 'numpy.float32'>'
	with 16718362 stored elements in Compressed Sparse Row format>

> adata.raw.X
AttributeError: 'NoneType' object has no attribute 'X'

This is different from pegasus where sequential operations create new matrices in the object without the user having to specify:

Example, an object that was log-normalized with pegasus:

adata_pg

MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 3538 x 36484
    UID: GRCh38-rna; Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'counts', 'counts.log_norm'
    It currently binds to matrix 'counts.log_norm' as X

And different from Seurat, where default behavior is to store the different versions of the data in @counts, @data, @scaled.

Am I correct in my interpretation? And why is this not the case in Scanpy?

1 Like

This is also not completely clear to me either!

I think you’re right.
I also discovered using adata=adata.raw.to_adata() not only restores anndata.X but also erases layers on it, making it a less preferred approach to store raw count matrix.
Now I use:

adata=sc.read_h5ad('xxx.h5ad')
adata.layers['counts']=adata.X.copy() 
sc.pp.normalize_total(adata,exclude_highly_expressed=False) 
adata.layers['cpm']=adata.X.copy() 
sc.pp.log1p(adata)
adata.layers['data']=adata.X.copy() 
sc.pp.highly_variable_genes(adata,batch_key='sample',flavor='seurat_v3',layer="counts",n_top_genes=2500) 
sc.pp.regress_out(adata, ['pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
adata.layers['scaled']=adata.X.copy()

A bit clumsey, but makes me feel safe.

1 Like