I’m trying to understand the expected behavior in Scanpy re: what happens to different versions of the data during processing.
It’s my understanding that doing operations on the data always overwrites
.X and one has to specifically copy the pre-modified data into a layer if you want to keep it. Is this correct?
ie If I have an anndata object that only has a raw counts matrix:
> adata = sc.read_h5ad('test.h5ad') > adata AnnData object with n_obs × n_vars = 4977 × 30721 obs: 'n_genes' var: 'featureid', 'n_cells' uns: 'genome', 'modality', 'uid' > adata.X <4977x30721 sparse matrix of type '<class 'numpy.uint32'>' with 16718362 stored elements in Compressed Sparse Row format>
Then running normalization on it will overwrite the counts and not store them somewhere else:
(does not create a layer to store the raw counts, does not store them in adata.raw)
> scanpy.pp.normalize_total(adata) adata AnnData object with n_obs × n_vars = 4977 × 30721 obs: 'n_genes' var: 'featureid', 'n_cells' uns: 'genome', 'modality', 'uid' # no layers created # .X is now float, so presumably normalized adata.X <4977x30721 sparse matrix of type '<class 'numpy.float32'>' with 16718362 stored elements in Compressed Sparse Row format> > adata.raw.X AttributeError: 'NoneType' object has no attribute 'X'
This is different from
pegasus where sequential operations create new matrices in the object without the user having to specify:
Example, an object that was log-normalized with pegasus:
adata_pg MultimodalData object with 1 UnimodalData: 'GRCh38-rna' It currently binds to UnimodalData object GRCh38-rna UnimodalData object with n_obs x n_vars = 3538 x 36484 UID: GRCh38-rna; Genome: GRCh38; Modality: rna It contains 2 matrices: 'counts', 'counts.log_norm' It currently binds to matrix 'counts.log_norm' as X
And different from Seurat, where default behavior is to store the different versions of the data in
Am I correct in my interpretation? And why is this not the case in Scanpy?