How could `adata.raw.X` contain non-integer values?

Hello scverse communtity,

I need some help troubleshooting an issue I do not understand.

I have a (non-public) H5ad file with a dataset that was analysed (QC, cell/gene/sample annotation, normalisation, dimensionality reduction, clustering, cluster annotation) using scanpy (but I do not have access to the actual code used for these steps).

I want to run (per-cluster) differential gene expression analyses on that dataset, so I need the raw (integer) counts.
However, the AnnData does not have any layers and adata.X contains normalised (and scaled) values.

From the AnnData documentation, I understood that adata.raw.X should contain the raw data, i.e. integer counts.

However,

vals = np.unique(adata.raw.X.data)
vals.sort()
print(vals)

returns

array([0.09674773, 0.10480512, 0.10735171, ..., 4.8433833 , 4.852986  ,
       4.8590107 ], dtype=float32)

.

Does anyone have a clue

  1. what those values could be? (They are not natural logs of integers, for example.) And
  2. how I could end up with these values in adata.raw.X?

I was under the impression that

adata = adata.raw

should

  1. restore the raw data to adata.X and
  2. drop any layers that may exists (in my case: none) but
  3. preserve the metadata associated with the cells (so I could use the existing clusters to generate pseudobulk samples by summing up the raw counts).

Is this a wrong assumption of mine or a bug somewhere?

Thank you in advance for your help!

Cheers,

Marcel

I managed to get the code that produced the H5ad file and I found a (post-normalisation) adata.raw = adata line in there. So that explains.

I must have misunderstood the API that adata.raw is automatically initialised as adata.raw = adata.copy() and the only supported way to manually edit it was removing it entirely via adata.raw = None (supposedly to generate smaller objects/files for sharing if the raw data is not required anymore downstream).
However, in my case adata.raw was apparently used to store the full (but already normalized!) gene/cell matrix before subsetting adata.X to the set of highly varible genes only for further analyses.

Could anyone point me towards up-to-date documentation on conventions regarding the use of adata.raw and/or the preservation of unnormalized count data (e.g. for differential expression analyses)?

Hey Marcel,

I have also faced a similar issue in the past where adata.raw.X would also get normalized when I normalize counts (adata.X) using sc.pp.normalize_total and sc.pp.log1p. What I have found is that using layers to store raw counts works as expected. I usually do adata.layers['counts'] = adata.X.copy()

1 Like