Differences between .X, .raw.X, and .raw in anndata object

Hello!!!
I have a question about the different layers of an anndata object (==adata).
What is the difference between .X, .raw.X, and .raw? .raw.X and .raw are the same?

After the QC step, I create the layer of raw raw counts in the adata: without any further step, this layer is equal to all the three above (raw.X, .X, and .raw), right?

After the layer creation, I run these three lines
sc.pp.normalize_total(adata_reduced, target_sum = 1e4)
sc.pp.log1p(adata_reduced)
adata_reduced.raw = adata_reduced

Now, in adata.raw.X (or .raw) I have the normalized and log-norm counts, correct?
While in .X, do I still have the raw raw counts?

Thanks!!

Hi @pmarzano97,

.raw is essentially it’s own anndata object whose obs_names should be the same as it’s parent, but whose var_names can be different.

For what you’re doing, I would strongly recommend using .layers instead of .raw. Then you can do something like:

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

Now you’d have your original counts in a layer, and log-normalized values in X.

1 Like

Ok, perfect @ivirshup. I got it!
Thanks a lot!

I get the chance to ask you one more thing :sweat_smile:.
My anndata object is a huge dataset of PBMCs, and I would like to make a subset of clusters to perform a reclustering. The method that I’m currently using to integrate my dataset is scVI.

To make the reclustering, should I just create the subset like this
[1] subset = adata[adata.obs[‘leiden’].isin([‘1’, ‘6’])].copy()
OR it’s necessary to also run this
[2] subset = sc.AnnData(X = subset.X, obs = subset.obs, var = subset.var, layers = subset.layers)?
(where in subset.X there are the norm and log-norm counts, while in subset.layers there are the raw raw counts)

And also, after the subsetting, is it necessary to re-run these lines to re-norm and re-log the .X since I’m doing a subsetting, or do I have to skip these?:
[3] sc.pp.normalize_total(subset, target_sum = 1e4)
[4] sc.pp.log1p(subset)

After the subsetting (line [1]/[2], or after the norm and log [3] and [4]), I’ll run the HVG line (with putting the layer of raw_counts and the flavor ‘seurat_v3’) and the training of the model for the scVI integration step and re-do all the integration part (re-calculate the latent space, and so on), and finally I’ll re-calculate the new neighbors, PAGA, and the new UMAP, and then the leiden clustering. Correct?

Thanks a lot!

Yes, that would be correct.

However, there is already support for reclustering a subset in scanpy. You can do: sc.tl.leiden(adata, restrict_to=["1", "6"]). I would suggest using key_added along with this.

No, the normalized data would still be there.

After the subsetting (line [1]/[2], or after the norm and log [3] and [4]), I’ll run the HVG line (with putting the layer of raw_counts and the flavor ‘seurat_v3’) and the training of the model for the scVI integration step and re-do all the integration part (re-calculate the latent space, and so on), and finally I’ll re-calculate the new neighbors, PAGA, and the new UMAP, and then the leiden clustering. Correct?

Whether you do this would really depend on what you’re seeing right now. If you just want to see subclusters, I would just the the resolution and restrict_to leiden parameters.

Broadly I would just try reclustering a subset with the embedding space you’ve already computed unless you have reason to believe the current embedding does not capture some cells well (due to feature selection, not enough dimensions, or something else). However, even then I would probably try to see if I could capture these cell types better in an embedding of the whole dataset.

When I run

sc.tl.leiden(adata, resolution = 1, restrict_to = ['16', '10'], neighbors_key= "neighbors_30")```

I get error

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()
...
-> 3363         raise KeyError(key) from err
   3365 if is_scalar(key) and isna(key) and not self.hasnans:
   3366     raise KeyError(key)

@Lockhaa, could you open an issue on the scanpy issue tracker for this?

Most likely, the question is no longer relevant, but perhaps it will help someone :slightly_smiling_face:

restrict_to takes a tuple of two elements: a key and its corresponding values. In your case, I can assume that it would be correct to pass restrict_to=('leiden', ['16', '10']).
Likewise, you can use another keys (e.g. obs key with cell type annotation) to clustering specific groups

1 Like