Can’t change anndata dimensions

crhodes · February 19, 2023, 9:42pm

From https://scanpy.discourse.group/t/cant-change-anndata-dimensions/519/1:

Hello

I clustered some cells using the top 3000 genes with the highest variance, removed some cell types I am not interested in and now I want to cluster them again using a new set of top variable genes.

My object has 78016 obs and 3000 vars. However, all genes are still there in adata.raw.X. Because I want to select these new genes, I have tried using adata2.X=adata2.raw.X. This returns the message ValueError: Data matrix has wrong shape (78016, 22633), need to be (78016, 3000).

Anyway to overcome this?

I have same question as the unanswered question above: after normalizing, filtering and scaling genes, how can I convert back to the original count matrix (assuming no cells have been removed). This is trivial in Seurat switching between “counts” and “scaled” or other arbitrary slots.

Mayank05121 · February 20, 2023, 5:59am

Try updating your var first I think as that still has 22633 genes. Make the first var to store 3000 genes and then try this

grst · February 20, 2023, 11:46am

adata.raw was specifically designed to keep around all genes, even when selecting highly variable genes.

If you want to subset different representations of the count matrix together with .X, use adata.layers instead. See this example:

import scanpy as sc

adata = sc.datasets.pbmc3k()
adata.layers["raw_counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
adata.layers["normalized"] = adata.X.copy()
sc.pp.log1p(adata)
adata.layers["log_norm"] = adata.X.copy()

sc.pp.highly_variable_genes(adata, subset=True)

adata.X.shape
# (2700, 1870)

adata.layers["raw_counts"].shape
# (2700, 1870)

Scanpy functions should accept a layer argument to specify which representation to work on.

crhodes · February 20, 2023, 9:00pm

Hi All,

Thanks for such quick responses! Apologies, I don’t think I was clear in my original post. I want to go in the “opposite” direction, to get gene info when the gene is no longer in the var/var_names.

Use case: After standard processing, I want to explore expression using multiple visualizations (e.g. sc.pl.umap, sc.pl.dotplot, etc) for genes that may have been removed during typical processing steps (e.g. sc.pl.umap(adata, color=[‘FAM138A’]) in example below). I want to keep the reduced dimensional cell embeddings and other metadata from the original adata workflow, while also viewing a gene that was originally in the count matrix but no longer exists.

Since the original cell x counts matrix is stored in adata.raw, it seems like there would be functionality to pull a gene(s) from the original matrix back into a new layer for future use.

Using the pbmc3k dataset an an example:

import scanpy as sc

adata = sc.datasets.pbmc3k()

adata.X.shape
#  (2700, 32738)

'FAM138A' in adata.var_names
#  True

adata.layers["raw_counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
adata.layers["normalized"] = adata.X.copy()
sc.pp.log1p(adata)
adata.layers["log_norm"] = adata.X.copy()
sc.pp.highly_variable_genes(adata, subset=True)
sc.pp.scale(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.louvain(adata, flavor = 'igraph')

adata.X.shape
# (2700, 1870)

'FAM138A' in adata.var_names
#  False

sc.pl.umap(adata, color=['NKG7'], layer="raw_counts")    # Works fine
sc.pl.umap(adata, color=['NKG7'], layer="log_norm")    # Works fine
sc.pl.umap(adata, color=['NKG7'], layer=None)    # Works fine

sc.pl.umap(adata, color=['FAM138A'], layer="raw_counts")    # KeyError: 'Could not find key FAM138A in .var_names or .obs.columns.'
sc.pl.umap(adata, color=['FAM138A'], layer="log_norm")    # KeyError: 'Could not find key FAM138A in .var_names or .obs.columns.'
sc.pl.umap(adata, color=['FAM138A'], layer=None)    # KeyError: 'Could not find key FAM138A in .var_names or .obs.columns.'

So, I guess the question is something like: “given the adata above with adata.X.shape of (2700, 1870) lacking ‘FAM138A’ but present in adata.raw, what is the easiest/most efficient way to view ‘FAM138A’ in sc.pl.umap?”

grst · March 3, 2023, 7:07pm

Sorry for the late response, I somehow didn’t get a notification. Layers are subset together with the main AnnData, therefore, this doesn’t work in your case. There are multiple possible solutions.

Don’t use subset=True with highly_variable_genes. Is there any specific reason you need to subset? tl.pca by default uses only highly variable genes also if all genes are present.
Use adata.raw. You can store one representation for visualization (usually log-norm transformed) in adata.raw. So in your example, you could add adata.raw = adata after the log1p transformation. You cannot reassign it to adata, but this should not be necessary in the first place. Visualization functions (e.g. sc.pl.umap or sc.pl.dotplot) by default automatically use the values in adata.raw (determined by the use_raw parameter).
Use MuData. The MuData container is designed to work with multiple modalities (e.g. RNA + protein expression), but it can also be used to store different subsets of the same dataset.

crhodes · March 9, 2023, 5:11pm

Thanks for the explanation! These options should work well.

For 1, I was following the scvi-tools recommendation:

Finally, we perform feature selection, to reduce the number of features (genes in this case) used as input to the scvi-tools model.

sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)

grst · March 9, 2023, 6:06pm

I see! You may also try running scvi-tools with all genes. It will increase the runtime a bit, but shouldn’t have a big influence on the integration, see also the discussion here: Differential expression and highly variable genes

Topic		Replies	Views
Differences between .X, .raw.X, and .raw in anndata object anndata	6	6604	May 10, 2024
How could `adata.raw.X` contain non-integer values? scanpy anndata	2	31	July 13, 2025
Layer (counts) loss after adata.raw.to_adata() scanpy	0	939	April 13, 2023
Subset/subcluster and reprocess scRNA-seq	0	1058	August 3, 2022
How to subset anndata variables, but still store the removed variables elsewhere for downstream analysis? anndata	1	128	July 23, 2024

Can’t change anndata dimensions

Related topics