How to handle data lognormalization when using highly_variable_genes() with flavor seurat_v3?

Hi,

The documentation of highly_variable_genes() says: “Expects logarithmized data, except when flavor=‘seurat_v3’, in which count data is expected.”

Does it mean that instead of coding in this order (1):

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")

we should code in this order (2)?

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

What is the consequence in the final results of coding (1) instead of (2)?

Is there a tutorial available where flavor=‘seurat_v3’ is used?

Besides, has someone already checked that using code (2) in Scanpy provides exactly the same results, for the same input data, as using the following code in Seurat v3+?

pbmc_object <- NormalizeData(object = pbmc_object, normalization.method = 'LogNormalize', scale.factor = 10000)

pbmc_object <- FindVariableFeatures(object = pbmc_object, selection.method = 'vst', nfeatures = 2000)

Thanks in advance for your precious help.

The order does not matter if you use the layer argument.

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")

We use it a lot in scvi-tools. Tutorial here.

The code is tested. scanpy/test_highly_variable_genes.py at d26be443373549f26226de367f0213f153556915 · scverse/scanpy · GitHub

There may be some inconsistency when finding HVG with the batch_key argument in which the way genes are ranked is slightly different. The core computation of normalized variances is approximately equal though.