When using scanpy, I am frequently facing issues about what exact data should I use (raw counts, CPM, log, z-score …) to apply tools / plots function. By default, these functions will apply on adata.X (or on adata.raw if is has been stored beforehand, and we select use_raw=True).
Ideally I would like to have the choice on which exact data I want to apply a function. I am starting to use layers for this purpose but I do not find documentation on how exactly they should be used ?
Let’s say I have raw counts stored in adata.X. Does it make sense to store different layers of my data, with different preprocessing strategies ?
# store CPM counts
xd.layers["CPM"] = xd.X.copy()
sc.pp.normalize_total(xd, target_sum=1e6, layer="CPM")
# store log on CPM counts
xd.layers["log(CPM)"] = xd.layers["CPM"].copy()
sc.pp.log1p(xd, layer="log(CPM)")
Is it the right way to do ? Do you have any other suggestions or guidelines ?
What is in adata.X before running the code below? Is it the raw counts or normalized counts (CPM?)
Quoting OP’s code with some comments:
# Store CPM counts
# My comment: (assuming `adata.X` already contains CPMs, but then why normalize again?)
adata.layers["CPM"] = adata.X.copy()
# Normalize adata.layers['CPM']
# My comment: (adata.layers['CPM'] is currently exquivalent to `adata.X.`
# This will update `adata.X` to the result of `normalize_total`.
sc.pp.normalize_total(adata, target_sum=1e6, layer="CPM")
# store log on CPM counts
# My comment: This will create a new layer ["log(CPM)"] which has the exact contents of ['CPM']
adata.layers["log(CPM)"] = adata.layers["CPM"].copy()
# T# My comment: This will log transform the contents of ["log(CPM)"], which is equal to ['CPM'].
sc.pp.log1p(adata, layer="log(CPM)")
I don’t know, but this seems very redundant to and unclear to me. My guess is OP is interpreting the creation of layers incorrectly.
I think what OP is thinking is more along the lines of:
# Save untransformed counts in new layer 'counts' so we have a backup once we start processing the data below:
adata.layers['counts'] = adata.X.copy() # Assuming adata.X is unprocessed counts.
# Normalize adata.X (counts) and save result in layer['CPM']:
anndata.layers['CPM'] = sc.pp.normalize_total(adata, target_sum=1e6, copy=True) # copy=True: do not update adata.X
# Log-transformed normalized data in `.layers['CPM']` and save result in layer['log(CPM)']:
adata.layers['log(CPM)'] = sc.pp.log1p(adata, layer='CPM', copy=True) # copy=True: do not update adata.X