Help with harmony and scanpy

Hello,

I am interested in using harmony to integrate some samples but I am not sure if I am doing it correctly. I have 6 samples with 2 conditions (3 control/3 treated)

I do the following steps:
Merge samples
QC
Normalization+ Log
Regress out effects of total counts per cell
Scale the data to unit variance
PCA
UMAP
Harmony using sample as a batch key
UMAP
Leiden Clustering
Identify marker genes

Then I use that information to assign cell types and then do DE analysis for each cluster between the control and treatment.

There are a few steps I am concerned. Does harmony use the log normalized or the regressed data? In a lot of scanpy commands they use as default the adata.raw with the log normalized data.
Also, it does not seem like harmony changes the adata.X or adata.raw.X which would mean that I can still do DE using the object i get after harmony right?

Thank you

Hello!

It seems like what you want to do is do differential expression with integrated data. Sadly, harmony wont help you to do that, because it works with the reduced dimensions of the PCA. That is why you don’t see any changes in the actual adata layers.

One look at the docs will tell you that harmony returns an object with modified PCA embeddings:

Returns:
Updates adata with the field adata.obsm[obsm_out_field], containing principal components adjusted by Harmony such that different experiments are integrated.

Most integration methods work on reduced dimensions or graphs. But there are some alternatives.

The easiest one i know of is to use scTransform from Seurat, the v2 flavor returns corrected UMI counts. From there you can extract marker genes or do differential expression analysis. Though, you may have to check for pseudo-replication bias.

Another that is integrated into this ecosystem is lvm-DE, but for that one you will have to have trained an scVI model on your data. If your expression matrix is big you will need a GPU so that it doesnt take forever, and it uses up a lot of RAM. There is a flavor that corrects por pseudo-replication bias but it is the most resource expensive.

Hello,

Thank you for replying!
I am trying to understand harmony a bit better. From what I understand it changes the PCA embeddings to improve the clustering of the different cell types right?
Once I have the different clusters/cell types annotated then I can use the log normalized data within my object to do DE analysis between the control and treated for each cell type?

Thank you for your help!

Hey!

I am trying to understand harmony a bit better. From what I understand it changes the PCA embeddings to improve the clustering of the different cell types right?

Yes, from their paper: “Clusters disproportionately containing cells from a small subset of datasets are penalized by an information theoretic metric.” With the help of that metric they get a linear correction, and iteratively apply that correction until the clusters are stable.

For more info on different approaches to batch effect correction you can check out, section 3.3.4 (Batch effects and data integration) from this paper.

Once I have the different clusters/cell types annotated then I can use the log normalized data within my object to do DE analysis between the control and treated for each cell type?

No, the problem is that your scaled and normalized data would still suffer from batch effects, because normalization is controlling for sequencing depth in a cell, and log scaling for variance in the cell also, whereas the technical effects would be at the sample level.

So the naive way to calculate DE won’t work. I don’t know how you are planning to calculate DE, but I imagine it should be something like this:

sc.tl.rank_genes_groups(adata, 'louvain_0.6', groups=['1'], reference='2', method='wilcoxon')

The most direct approach, that avoids installing any new packages would be to subsample cells in each cluster based on manual observations, like they do here. But that would account for the contribution of many cells to the differential analysis. It does still mantain an error in the variance of the gene.

The approaches I mention above would work. Tough, I just remembered one of the tools with the best benchmarks available is DeSeq2 (R package) and they do DE with generalized linear models (GLMs) and that allows them to input the batch as a cofactor in the comparison. I believe diffxpy does kind of the same thing in python. So you could try with any those choices.