How could `adata.raw.X` contain non-integer values?

mschilli · July 7, 2025, 1:39pm

Hello scverse communtity,

I need some help troubleshooting an issue I do not understand.

I have a (non-public) H5ad file with a dataset that was analysed (QC, cell/gene/sample annotation, normalisation, dimensionality reduction, clustering, cluster annotation) using scanpy (but I do not have access to the actual code used for these steps).

I want to run (per-cluster) differential gene expression analyses on that dataset, so I need the raw (integer) counts.
However, the AnnData does not have any layers and adata.X contains normalised (and scaled) values.

From the AnnData documentation, I understood that adata.raw.X should contain the raw data, i.e. integer counts.

However,

vals = np.unique(adata.raw.X.data)
vals.sort()
print(vals)

returns

array([0.09674773, 0.10480512, 0.10735171, ..., 4.8433833 , 4.852986  ,
       4.8590107 ], dtype=float32)

.

Does anyone have a clue

what those values could be? (They are not natural logs of integers, for example.) And
how I could end up with these values in adata.raw.X?

I was under the impression that

adata = adata.raw

should

restore the raw data to adata.X and
drop any layers that may exists (in my case: none) but
preserve the metadata associated with the cells (so I could use the existing clusters to generate pseudobulk samples by summing up the raw counts).

Is this a wrong assumption of mine or a bug somewhere?

Thank you in advance for your help!

Cheers,

Marcel

mschilli · July 9, 2025, 11:06am

I managed to get the code that produced the H5ad file and I found a (post-normalisation) adata.raw = adata line in there. So that explains.

I must have misunderstood the API that adata.raw is automatically initialised as adata.raw = adata.copy() and the only supported way to manually edit it was removing it entirely via adata.raw = None (supposedly to generate smaller objects/files for sharing if the raw data is not required anymore downstream).
However, in my case adata.raw was apparently used to store the full (but already normalized!) gene/cell matrix before subsetting adata.X to the set of highly varible genes only for further analyses.

Could anyone point me towards up-to-date documentation on conventions regarding the use of adata.raw and/or the preservation of unnormalized count data (e.g. for differential expression analyses)?

abs51295 · July 13, 2025, 7:56pm

Hey Marcel,

I have also faced a similar issue in the past where adata.raw.X would also get normalized when I normalize counts (adata.X) using sc.pp.normalize_total and sc.pp.log1p. What I have found is that using layers to store raw counts works as expected. I usually do adata.layers['counts'] = adata.X.copy()

Topic		Replies	Views
Differences between .X, .raw.X, and .raw in anndata object anndata	6	7209	May 10, 2024
Does processing the data always overwrite the raw counts? scanpy	3	2398	April 22, 2024
Can’t change anndata dimensions anndata	6	2155	March 9, 2023
Cellxgene datasets raw data? scaled? General	2	106	July 8, 2025
Does Scanpy stores raw count as int or float? scanpy anndata	1	456	March 25, 2024

How could `adata.raw.X` contain non-integer values?

Related topics