Guidelines to use scanpy layers

Hello everyone,

When using scanpy, I am frequently facing issues about what exact data should I use (raw counts, CPM, log, z-score …) to apply tools / plots function. By default, these functions will apply on adata.X (or on adata.raw if is has been stored beforehand, and we select use_raw=True).

Ideally I would like to have the choice on which exact data I want to apply a function. I am starting to use layers for this purpose but I do not find documentation on how exactly they should be used ?

Let’s say I have raw counts stored in adata.X. Does it make sense to store different layers of my data, with different preprocessing strategies ?

# store CPM counts 
xd.layers["CPM"] = xd.X.copy()
sc.pp.normalize_total(xd, target_sum=1e6, layer="CPM")

# store log on CPM counts
xd.layers["log(CPM)"] = xd.layers["CPM"].copy()
sc.pp.log1p(xd, layer="log(CPM)")

Is it the right way to do ? Do you have any other suggestions or guidelines ?

Thanks you very much !

Yes, this makes sense and is the current intended way to do it.

To answer the question “when to use which version of the data”, I’d encourage you to read this paper:

1 Like

Thank you very much @Zethson

Is this really the intended way?

What is in adata.X before running the code below? Is it the raw counts or normalized counts (CPM?)

Quoting OP’s code with some comments:

# Store CPM counts  

# My comment: (assuming `adata.X` already contains CPMs, but then why normalize again?)
adata.layers["CPM"] = adata.X.copy() 

# Normalize adata.layers['CPM'] 
# My comment: (adata.layers['CPM']  is currently exquivalent to `adata.X.`
# This will update `adata.X` to the result of `normalize_total`.
sc.pp.normalize_total(adata, target_sum=1e6, layer="CPM")

# store log on CPM counts

# My comment: This will create a new layer ["log(CPM)"] which has the exact contents of ['CPM']
adata.layers["log(CPM)"] = adata.layers["CPM"].copy()
# T# My comment: This will log transform the contents of ["log(CPM)"], which is equal to ['CPM'].
sc.pp.log1p(adata, layer="log(CPM)")

I don’t know, but this seems very redundant to and unclear to me. My guess is OP is interpreting the creation of layers incorrectly.

I think what OP is thinking is more along the lines of:

# Save untransformed counts in new layer 'counts' so we have a backup once we start processing the data below:

adata.layers['counts'] = adata.X.copy()  # Assuming adata.X is unprocessed counts.

# Normalize adata.X (counts) and save result in layer['CPM']:

anndata.layers['CPM'] = sc.pp.normalize_total(adata, target_sum=1e6, copy=True) # copy=True: do not update adata.X

# Log-transformed normalized data in `.layers['CPM']` and save result in layer['log(CPM)']:
adata.layers['log(CPM)'] = sc.pp.log1p(adata, layer='CPM', copy=True) # copy=True: do not update adata.X