Filtering out a subset of genes in adata

AlejandraRodelaRo · July 30, 2024, 8:18pm

Hello,

I am trying to filter out a particular set of genes from a Anndata object.
So far I tried:

# Assuming 'adata' is your AnnData object
# List of genes to remove
genes_to_remove

# Convert genes_to_remove to a set for efficient lookup
genes_to_remove_set = set(genes_to_remove)

# Check which genes are in the AnnData object
genes_in_adata = set(adata.var_names)

# Find the intersection of genes_to_remove and genes_in_adata
common_genes_to_remove = genes_to_remove_set.intersection(genes_in_adata)

# Print the genes that will be removed
print("Genes to be removed:", common_genes_to_remove)

# Create a boolean mask to keep only the genes not in 'common_genes_to_remove'
mask = [gene not in common_genes_to_remove for gene in adata.var_names]

# Subset the AnnData object to exclude the genes in 'common_genes_to_remove'
adata_filtered = adata[:, mask].copy()

# Verify that the genes have been removed
print("Remaining genes:", adata_filtered .var_names)

This seems to initially work, so if I run :

for gene in genes_to_remove:
    print(gene)
    # Ensure the gene exists in the data
    if gene not in adata_filtered .var_names:
        print(f"Gene {gene} not found in the dataset.")

it does indeed state that my list of genes is gone. However, if I run a umap for one of the genes in particular or if I run a wilcoxon analysis these genes seem to still be there…
Any thoughts?

gtca · July 30, 2024, 9:57pm

Welcome @AlejandraRodelaRo!

Do you have more details on what these genes seem to still be there means in the context of UMAPs?
I am also not sure what run a umap for one of the genes in particular means, do you have code examples for those steps?

AlejandraRodelaRo · July 31, 2024, 3:13pm

Thank you @gtca for your quick reply!

Here’s the code I use to check the gene in the context of umaps:
sc.pl.umap(adata_filtered , color="PAX3") #where PAX3 is a gene on my list to be filtered out

Similarly, when I compute DEG with wilcoxon I can still see some of the genes (e.g. PAX3) differentially expressed even after filtering them out:

sc.tl.rank_genes_groups(adata_filtered , 'sample', reference="WT",method='wilcoxon', key_added = "wilcoxon")
sc.pl.rank_genes_groups(adata_filtered , n_genes=20, sharey=False, key="wilcoxon")

AlejandraRodelaRo · July 31, 2024, 10:54pm

Hello again,

Just realized what was the issue and might be of use to somebody.
When you exclude some genes either by the methods I mentioned earlier or by using:

adata_filtered = adata[:, ~adata.var_names.isin(list_tobeexcluded)].copy()

This will erase them on adata, but not on adata.raw.
sc.pl.umap uses use_raw= True by default, so that’s why I was still able to detect them. Similar thing would happen when doing wilcoxon analysis…

I guess this is as simple as to run:

adata_filtered.raw = adata_filtered.copy()

after the exclusion, or use use_raw=False, when running umap.

Topic		Replies	Views
Filter genes in a subset of cells Help	0	347	August 8, 2022
Subsetting anndata using genelist anndata	4	3972	May 5, 2024
Subsetting Anndata based on Multiple Marker Gene Expression Thresholds anndata	0	978	July 12, 2022
Adata gene intersection problem anndata	0	339	January 2, 2023
Remove certain cells that look like doublets by gene expression levels scanpy	0	434	July 4, 2023

Filtering out a subset of genes in adata

Related topics