How to subset anndata variables, but still store the removed variables elsewhere for downstream analysis?

bhavyasingh · July 19, 2024, 5:20pm

Hi there! I’m analyzing single-cell and single-nucleus data. My matrix has both genes and repeat elements (TEs, such as LINEs, HERVs, etc). I want to do the QC, normalization, and scaling all together, but for the purpose of HVG, PCA, leiden, and UMAP, I’d like to only use genes. How can I remove the TEs so that they are not in the HVG and matrices, but still be able to calculate their expression in the resulting leiden clusters?

gtca · July 23, 2024, 11:25pm

Welcome, @bhavyasingh!

I think there are multiple options, and it’s a good question, which one is the most convenient.
E.g. for some functions like sc.tl.pca you can modify the .var['highly_variable'] variable to exclude TEs.
I can imagine that might get pretty convoluted and error-prone pretty quickly.

Another option would be to make an explicit copy of the subset of your dataset when you slice your matrices to have only genes. You can store both AnnData objects as modalities in a MuData object (Axes in MuData — mudata documentation) to keep it all in one place. You will be able to pull your cluster labels to the multimodal level and push to the modality that also contains TEs if needed (Managing annotations — mudata documentation).

Topic		Replies	Views
Subsetting anndata using genelist anndata	4	3983	May 5, 2024
Filtering out a subset of genes in adata anndata	3	634	July 31, 2024
Subsetting Anndata based on Multiple Marker Gene Expression Thresholds anndata	0	979	July 12, 2022
Can’t change anndata dimensions anndata	6	2033	March 9, 2023
Subsetting anndata is causing error problems scanpy	3	729	May 5, 2024

How to subset anndata variables, but still store the removed variables elsewhere for downstream analysis?

Related topics