Can’t change anndata dimensions

From https://scanpy.discourse.group/t/cant-change-anndata-dimensions/519/1:

Hello

I clustered some cells using the top 3000 genes with the highest variance, removed some cell types I am not interested in and now I want to cluster them again using a new set of top variable genes.

My object has 78016 obs and 3000 vars. However, all genes are still there in adata.raw.X. Because I want to select these new genes, I have tried using adata2.X=adata2.raw.X. This returns the message ValueError: Data matrix has wrong shape (78016, 22633), need to be (78016, 3000).

Anyway to overcome this?

I have same question as the unanswered question above: after normalizing, filtering and scaling genes, how can I convert back to the original count matrix (assuming no cells have been removed). This is trivial in Seurat switching between “counts” and “scaled” or other arbitrary slots.

Try updating your var first I think as that still has 22633 genes. Make the first var to store 3000 genes and then try this

adata.raw was specifically designed to keep around all genes, even when selecting highly variable genes.

If you want to subset different representations of the count matrix together with .X, use adata.layers instead. See this example:

import scanpy as sc

adata = sc.datasets.pbmc3k()
adata.layers["raw_counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
adata.layers["normalized"] = adata.X.copy()
sc.pp.log1p(adata)
adata.layers["log_norm"] = adata.X.copy()

sc.pp.highly_variable_genes(adata, subset=True)

adata.X.shape
# (2700, 1870)

adata.layers["raw_counts"].shape
# (2700, 1870)

Scanpy functions should accept a layer argument to specify which representation to work on.

Hi All,

Thanks for such quick responses! Apologies, I don’t think I was clear in my original post. I want to go in the “opposite” direction, to get gene info when the gene is no longer in the var/var_names.

Use case: After standard processing, I want to explore expression using multiple visualizations (e.g. sc.pl.umap, sc.pl.dotplot, etc) for genes that may have been removed during typical processing steps (e.g. sc.pl.umap(adata, color=[‘FAM138A’]) in example below). I want to keep the reduced dimensional cell embeddings and other metadata from the original adata workflow, while also viewing a gene that was originally in the count matrix but no longer exists.

Since the original cell x counts matrix is stored in adata.raw, it seems like there would be functionality to pull a gene(s) from the original matrix back into a new layer for future use.

Using the pbmc3k dataset an an example:

import scanpy as sc

adata = sc.datasets.pbmc3k()

adata.X.shape
#  (2700, 32738)

'FAM138A' in adata.var_names
#  True

adata.layers["raw_counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
adata.layers["normalized"] = adata.X.copy()
sc.pp.log1p(adata)
adata.layers["log_norm"] = adata.X.copy()
sc.pp.highly_variable_genes(adata, subset=True)
sc.pp.scale(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.louvain(adata, flavor = 'igraph')

adata.X.shape
# (2700, 1870)

'FAM138A' in adata.var_names
#  False

sc.pl.umap(adata, color=['NKG7'], layer="raw_counts")    # Works fine
sc.pl.umap(adata, color=['NKG7'], layer="log_norm")    # Works fine
sc.pl.umap(adata, color=['NKG7'], layer=None)    # Works fine

sc.pl.umap(adata, color=['FAM138A'], layer="raw_counts")    # KeyError: 'Could not find key FAM138A in .var_names or .obs.columns.'
sc.pl.umap(adata, color=['FAM138A'], layer="log_norm")    # KeyError: 'Could not find key FAM138A in .var_names or .obs.columns.'
sc.pl.umap(adata, color=['FAM138A'], layer=None)    # KeyError: 'Could not find key FAM138A in .var_names or .obs.columns.'

So, I guess the question is something like: “given the adata above with adata.X.shape of (2700, 1870) lacking ‘FAM138A’ but present in adata.raw, what is the easiest/most efficient way to view ‘FAM138A’ in sc.pl.umap?”

Sorry for the late response, I somehow didn’t get a notification. Layers are subset together with the main AnnData, therefore, this doesn’t work in your case. There are multiple possible solutions.

  1. Don’t use subset=True with highly_variable_genes. Is there any specific reason you need to subset? tl.pca by default uses only highly variable genes also if all genes are present.

  2. Use adata.raw. You can store one representation for visualization (usually log-norm transformed) in adata.raw. So in your example, you could add adata.raw = adata after the log1p transformation. You cannot reassign it to adata, but this should not be necessary in the first place. Visualization functions (e.g. sc.pl.umap or sc.pl.dotplot) by default automatically use the values in adata.raw (determined by the use_raw parameter).

  3. Use MuData. The MuData container is designed to work with multiple modalities (e.g. RNA + protein expression), but it can also be used to store different subsets of the same dataset.

1 Like

Thanks for the explanation! These options should work well.

For 1, I was following the scvi-tools recommendation:

Finally, we perform feature selection, to reduce the number of features (genes in this case) used as input to the scvi-tools model.

sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)

I see! You may also try running scvi-tools with all genes. It will increase the runtime a bit, but shouldn’t have a big influence on the integration, see also the discussion here: Differential expression and highly variable genes