How to handle data lognormalization when using highly_variable_genes() with flavor seurat_v3?

Hi,

The documentation of highly_variable_genes() says: “Expects logarithmized data, except when flavor=‘seurat_v3’, in which count data is expected.”

Does it mean that instead of coding in this order (1):

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")

we should code in this order (2)?

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

What is the consequence in the final results of coding (1) instead of (2)?

Is there a tutorial available where flavor=‘seurat_v3’ is used?

Besides, has someone already checked that using code (2) in Scanpy provides exactly the same results, for the same input data, as using the following code in Seurat v3+?

pbmc_object <- NormalizeData(object = pbmc_object, normalization.method = 'LogNormalize', scale.factor = 10000)

pbmc_object <- FindVariableFeatures(object = pbmc_object, selection.method = 'vst', nfeatures = 2000)

Thanks in advance for your precious help.

The order does not matter if you use the layer argument.

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")

We use it a lot in scvi-tools. Tutorial here.

The code is tested. https://github.com/scverse/scanpy/blob/d26be443373549f26226de367f0213f153556915/scanpy/tests/test_highly_variable_genes.py#L336

There may be some inconsistency when finding HVG with the batch_key argument in which the way genes are ranked is slightly different. The core computation of normalized variances is approximately equal though.

Thanks a lot for your detailed answers!

Regarding the equivalence between “Seurat v3” and “Scanpy with flavor seurat_v3”, I ran a test on a given count matrix and I measured 98.65% of common genes detected as HVG among 2000 genes, which means that 27 genes were not detected as HVG by both methods.

By using the layer argument as you recommended, the agreement rate is now much better (98.65% instead of 52.14%) but it is not 100%. Note that I did not use the “batch_key” argument.

  • Is there an approximation somewhere is Scanpy code compared to Seurat v3 code which could explain that we don’t always reach an agreement rate of 100%?

  • Did you also get non exact matching when running your equivalence tests?

  • If yes, what maximum mismatch rate did you allow?

  • If not, could you please detail all the parameter values we should use in Scanpy and in Seurat v3 to get exactly the same results?

Many thanks for your precious help.

There is no approximation, but there can be heuristic differences. For example, with this approach many genes can get the same normalized variance score, so which genes get included comes down to how they are sorted, which is what I expect to be happening in this case.

After running you should see adata.var. variances_norm this is the score that is used for ranking genes. We would very much appreciate if you compared this to Seurat