Hey!
I am trying to understand harmony a bit better. From what I understand it changes the PCA embeddings to improve the clustering of the different cell types right?
Yes, from their paper: “Clusters disproportionately containing cells from a small subset of datasets are penalized by an information theoretic metric.” With the help of that metric they get a linear correction, and iteratively apply that correction until the clusters are stable.
For more info on different approaches to batch effect correction you can check out, section 3.3.4 (Batch effects and data integration) from this paper.
Once I have the different clusters/cell types annotated then I can use the log normalized data within my object to do DE analysis between the control and treated for each cell type?
No, the problem is that your scaled and normalized data would still suffer from batch effects, because normalization is controlling for sequencing depth in a cell, and log scaling for variance in the cell also, whereas the technical effects would be at the sample level.
So the naive way to calculate DE won’t work. I don’t know how you are planning to calculate DE, but I imagine it should be something like this:
sc.tl.rank_genes_groups(adata, 'louvain_0.6', groups=['1'], reference='2', method='wilcoxon')
The most direct approach, that avoids installing any new packages would be to subsample cells in each cluster based on manual observations, like they do here. But that would account for the contribution of many cells to the differential analysis. It does still mantain an error in the variance of the gene.
The approaches I mention above would work. Tough, I just remembered one of the tools with the best benchmarks available is DeSeq2 (R package) and they do DE with generalized linear models (GLMs) and that allows them to input the batch as a cofactor in the comparison. I believe diffxpy does kind of the same thing in python. So you could try with any those choices.