I’m trying to understand the expected behavior in Scanpy re: what happens to different versions of the data during processing.
It’s my understanding that doing operations on the data always overwrites .X
and one has to specifically copy the pre-modified data into a layer if you want to keep it. Is this correct?
ie If I have an anndata object that only has a raw counts matrix:
> adata = sc.read_h5ad('test.h5ad')
> adata
AnnData object with n_obs × n_vars = 4977 × 30721
obs: 'n_genes'
var: 'featureid', 'n_cells'
uns: 'genome', 'modality', 'uid'
> adata.X
<4977x30721 sparse matrix of type '<class 'numpy.uint32'>'
with 16718362 stored elements in Compressed Sparse Row format>
Then running normalization on it will overwrite the counts and not store them somewhere else:
(does not create a layer to store the raw counts, does not store them in adata.raw)
> scanpy.pp.normalize_total(adata)
adata
AnnData object with n_obs × n_vars = 4977 × 30721
obs: 'n_genes'
var: 'featureid', 'n_cells'
uns: 'genome', 'modality', 'uid'
# no layers created
# .X is now float, so presumably normalized
adata.X
<4977x30721 sparse matrix of type '<class 'numpy.float32'>'
with 16718362 stored elements in Compressed Sparse Row format>
> adata.raw.X
AttributeError: 'NoneType' object has no attribute 'X'
This is different from pegasus
where sequential operations create new matrices in the object without the user having to specify:
Example, an object that was log-normalized with pegasus:
adata_pg
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
It currently binds to UnimodalData object GRCh38-rna
UnimodalData object with n_obs x n_vars = 3538 x 36484
UID: GRCh38-rna; Genome: GRCh38; Modality: rna
It contains 2 matrices: 'counts', 'counts.log_norm'
It currently binds to matrix 'counts.log_norm' as X
And different from Seurat, where default behavior is to store the different versions of the data in @counts,
@data
, @scaled
.
Am I correct in my interpretation? And why is this not the case in Scanpy?