Hi there! I’m analyzing single-cell and single-nucleus data. My matrix has both genes and repeat elements (TEs, such as LINEs, HERVs, etc). I want to do the QC, normalization, and scaling all together, but for the purpose of HVG, PCA, leiden, and UMAP, I’d like to only use genes. How can I remove the TEs so that they are not in the HVG and matrices, but still be able to calculate their expression in the resulting leiden clusters?
Welcome, @bhavyasingh!
I think there are multiple options, and it’s a good question, which one is the most convenient.
E.g. for some functions like sc.tl.pca
you can modify the .var['highly_variable']
variable to exclude TEs.
I can imagine that might get pretty convoluted and error-prone pretty quickly.
Another option would be to make an explicit copy of the subset of your dataset when you slice your matrices to have only genes. You can store both AnnData objects as modalities in a MuData object (Axes in MuData — mudata documentation) to keep it all in one place. You will be able to pull your cluster labels to the multimodal level and push to the modality that also contains TEs if needed (Managing annotations — mudata documentation).