I have a large dataset downloaded from GEO with inherent structural complexities. When integrating data using the scVI model for batch effect correction and conducting DEG analysis on Cancer vs. Normal samples, I need guidance on how to properly set up Anndata and configure the batch_key and categorical_covariate_key.
Currently, I have set batch_key as ‘sample’ and categorical_covariate_keys as ‘study_origin’ and ‘sequencing_method’. The hierarchy of my data is as follows: 12 samples (including Control and Cancer) < 3 sequencing methods < 4 studies from GEO.
Despite these current settings, my UMAP clustering shows poor integration for one cancer sample, with barely any other samples mapping to the same area. Could you help me troubleshoot this issue?
Hi, it is very helpful to provide more information especially including plots and code that you used. To provide some ideas what to look for: Are the other samples/studies well integrated? Is this specific study very different from the other ones (like different sequencing technology or similar drastic differences)? It can help to define a batch_key (here study_id), when identifying highly variable genes in Scanpy and to reduce the number of genes. Have you tried especially harmony and was it giving the correct integration? You can the. use scANVI and provide labels_key that correspond to the good integration.
It looks good in my oppinion. I would perform clustering and look at DE genes and see whether it lines up with any biological signal like e.g. proliferation of cell stress or assay. I would recommend subsetting to highly variable genes (not included in your code).