Guidelines to use scanpy layers

Hello everyone,

When using scanpy, I am frequently facing issues about what exact data should I use (raw counts, CPM, log, z-score …) to apply tools / plots function. By default, these functions will apply on adata.X (or on adata.raw if is has been stored beforehand, and we select use_raw=True).

Ideally I would like to have the choice on which exact data I want to apply a function. I am starting to use layers for this purpose but I do not find documentation on how exactly they should be used ?

Let’s say I have raw counts stored in adata.X. Does it make sense to store different layers of my data, with different preprocessing strategies ?

# store CPM counts 
xd.layers["CPM"] = xd.X.copy()
sc.pp.normalize_total(xd, target_sum=1e6, layer="CPM")

# store log on CPM counts
xd.layers["log(CPM)"] = xd.layers["CPM"].copy()
sc.pp.log1p(xd, layer="log(CPM)")

Is it the right way to do ? Do you have any other suggestions or guidelines ?

Thanks you very much !
Best,
Paul

Yes, this makes sense and is the current intended way to do it.

To answer the question “when to use which version of the data”, I’d encourage you to read this paper: https://www.embopress.org/doi/full/10.15252/msb.20188746

1 Like

Thank you very much @Zethson