Differences between .X, .raw.X, and .raw in anndata object

pmarzano97 · March 4, 2023, 12:16pm

Hello!!!
I have a question about the different layers of an anndata object (==adata).
What is the difference between .X, .raw.X, and .raw? .raw.X and .raw are the same?

After the QC step, I create the layer of raw raw counts in the adata: without any further step, this layer is equal to all the three above (raw.X, .X, and .raw), right?

After the layer creation, I run these three lines
sc.pp.normalize_total(adata_reduced, target_sum = 1e4)
sc.pp.log1p(adata_reduced)
adata_reduced.raw = adata_reduced

Now, in adata.raw.X (or .raw) I have the normalized and log-norm counts, correct?
While in .X, do I still have the raw raw counts?

Thanks!!

ivirshup · March 6, 2023, 10:44am

Hi @pmarzano97,

.raw is essentially it’s own anndata object whose obs_names should be the same as it’s parent, but whose var_names can be different.

For what you’re doing, I would strongly recommend using .layers instead of .raw. Then you can do something like:

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

Now you’d have your original counts in a layer, and log-normalized values in X.

pmarzano97 · March 7, 2023, 8:34am

Ok, perfect @ivirshup. I got it!
Thanks a lot!

I get the chance to ask you one more thing .
My anndata object is a huge dataset of PBMCs, and I would like to make a subset of clusters to perform a reclustering. The method that I’m currently using to integrate my dataset is scVI.

To make the reclustering, should I just create the subset like this
[1] subset = adata[adata.obs[‘leiden’].isin([‘1’, ‘6’])].copy()
OR it’s necessary to also run this
[2] subset = sc.AnnData(X = subset.X, obs = subset.obs, var = subset.var, layers = subset.layers)?
(where in subset.X there are the norm and log-norm counts, while in subset.layers there are the raw raw counts)

And also, after the subsetting, is it necessary to re-run these lines to re-norm and re-log the .X since I’m doing a subsetting, or do I have to skip these?:
[3] sc.pp.normalize_total(subset, target_sum = 1e4)
[4] sc.pp.log1p(subset)

After the subsetting (line [1]/[2], or after the norm and log [3] and [4]), I’ll run the HVG line (with putting the layer of raw_counts and the flavor ‘seurat_v3’) and the training of the model for the scVI integration step and re-do all the integration part (re-calculate the latent space, and so on), and finally I’ll re-calculate the new neighbors, PAGA, and the new UMAP, and then the leiden clustering. Correct?

Thanks a lot!

ivirshup · March 7, 2023, 11:06am

Yes, that would be correct.

However, there is already support for reclustering a subset in scanpy. You can do: sc.tl.leiden(adata, restrict_to=["1", "6"]). I would suggest using key_added along with this.

No, the normalized data would still be there.

After the subsetting (line [1]/[2], or after the norm and log [3] and [4]), I’ll run the HVG line (with putting the layer of raw_counts and the flavor ‘seurat_v3’) and the training of the model for the scVI integration step and re-do all the integration part (re-calculate the latent space, and so on), and finally I’ll re-calculate the new neighbors, PAGA, and the new UMAP, and then the leiden clustering. Correct?

Whether you do this would really depend on what you’re seeing right now. If you just want to see subclusters, I would just the the resolution and restrict_to leiden parameters.

Broadly I would just try reclustering a subset with the embedding space you’ve already computed unless you have reason to believe the current embedding does not capture some cells well (due to feature selection, not enough dimensions, or something else). However, even then I would probably try to see if I could capture these cell types better in an embedding of the whole dataset.

Lockhaa · February 6, 2024, 7:41pm

When I run

sc.tl.leiden(adata, resolution = 1, restrict_to = ['16', '10'], neighbors_key= "neighbors_30")```

I get error

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()
...
-> 3363         raise KeyError(key) from err
   3365 if is_scalar(key) and isna(key) and not self.hasnans:
   3366     raise KeyError(key)

ivirshup · February 13, 2024, 5:50pm

@Lockhaa, could you open an issue on the scanpy issue tracker for this?

estakathersun · May 10, 2024, 5:19pm

Most likely, the question is no longer relevant, but perhaps it will help someone

restrict_to takes a tuple of two elements: a key and its corresponding values. In your case, I can assume that it would be correct to pass restrict_to=('leiden', ['16', '10']).
Likewise, you can use another keys (e.g. obs key with cell type annotation) to clustering specific groups

Topic		Replies	Views
Does processing the data always overwrite the raw counts? scanpy	3	2226	April 22, 2024
Normalized data found instead of raw counts scanpy scvi , anndata	2	772	March 18, 2024
Guidelines to use scanpy layers scanpy	3	3970	February 2, 2023
Run scanpy.pp.neighbors and UMAP on a different layer other than X? scanpy	2	885	October 11, 2023
Can’t change anndata dimensions anndata	6	2036	March 9, 2023

Differences between .X, .raw.X, and .raw in anndata object

Related topics