Hello!!!
I have a question about the different layers of an anndata object (==adata).
What is the difference between .X, .raw.X, and .raw? .raw.X and .raw are the same?

After the QC step, I create the layer of raw raw counts in the adata: without any further step, this layer is equal to all the three above (raw.X, .X, and .raw), right?

After the layer creation, I run these three lines
sc.pp.normalize_total(adata_reduced, target_sum = 1e4)
sc.pp.log1p(adata_reduced)
adata_reduced.raw = adata_reduced

Now, in adata.raw.X (or .raw) I have the normalized and log-norm counts, correct?
While in .X, do I still have the raw raw counts?

I get the chance to ask you one more thing .
My anndata object is a huge dataset of PBMCs, and I would like to make a subset of clusters to perform a reclustering. The method that I’m currently using to integrate my dataset is scVI.

To make the reclustering, should I just create the subset like this
[1] subset = adata[adata.obs[‘leiden’].isin([‘1’, ‘6’])].copy()
OR it’s necessary to also run this
[2] subset = sc.AnnData(X = subset.X, obs = subset.obs, var = subset.var, layers = subset.layers)?
(where in subset.X there are the norm and log-norm counts, while in subset.layers there are the raw raw counts)

And also, after the subsetting, is it necessary to re-run these lines to re-norm and re-log the .X since I’m doing a subsetting, or do I have to skip these?:
[3] sc.pp.normalize_total(subset, target_sum = 1e4)
[4] sc.pp.log1p(subset)

After the subsetting (line [1]/[2], or after the norm and log [3] and [4]), I’ll run the HVG line (with putting the layer of raw_counts and the flavor ‘seurat_v3’) and the training of the model for the scVI integration step and re-do all the integration part (re-calculate the latent space, and so on), and finally I’ll re-calculate the new neighbors, PAGA, and the new UMAP, and then the leiden clustering. Correct?

However, there is already support for reclustering a subset in scanpy. You can do: sc.tl.leiden(adata, restrict_to=["1", "6"]). I would suggest using key_added along with this.

No, the normalized data would still be there.

After the subsetting (line [1]/[2], or after the norm and log [3] and [4]), I’ll run the HVG line (with putting the layer of raw_counts and the flavor ‘seurat_v3’) and the training of the model for the scVI integration step and re-do all the integration part (re-calculate the latent space, and so on), and finally I’ll re-calculate the new neighbors, PAGA, and the new UMAP, and then the leiden clustering. Correct?

Whether you do this would really depend on what you’re seeing right now. If you just want to see subclusters, I would just the the resolution and restrict_toleiden parameters.

Broadly I would just try reclustering a subset with the embedding space you’ve already computed unless you have reason to believe the current embedding does not capture some cells well (due to feature selection, not enough dimensions, or something else). However, even then I would probably try to see if I could capture these cell types better in an embedding of the whole dataset.