How to handle data lognormalization when using highly_variable_genes() with flavor seurat_v3?

barbara-andre-pro · January 23, 2023, 3:05am

Hi,

The documentation of highly_variable_genes() says: “Expects logarithmized data, except when flavor=‘seurat_v3’, in which count data is expected.”

Does it mean that instead of coding in this order (1):

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")

we should code in this order (2)?

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

What is the consequence in the final results of coding (1) instead of (2)?

Is there a tutorial available where flavor=‘seurat_v3’ is used?

Besides, has someone already checked that using code (2) in Scanpy provides exactly the same results, for the same input data, as using the following code in Seurat v3+?

pbmc_object <- NormalizeData(object = pbmc_object, normalization.method = 'LogNormalize', scale.factor = 10000)

pbmc_object <- FindVariableFeatures(object = pbmc_object, selection.method = 'vst', nfeatures = 2000)

Thanks in advance for your precious help.

adamgayoso · January 23, 2023, 3:58am

The order does not matter if you use the layer argument.

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")

We use it a lot in scvi-tools. Tutorial here.

The code is tested. https://github.com/scverse/scanpy/blob/d26be443373549f26226de367f0213f153556915/scanpy/tests/test_highly_variable_genes.py#L336

There may be some inconsistency when finding HVG with the batch_key argument in which the way genes are ranked is slightly different. The core computation of normalized variances is approximately equal though.

github.com/scverse/scanpy

`flavor="seurat_v3"` HVG aggregation over batches not intended

opened 06:04PM - 18 Dec 21 UTC

closed 02:17PM - 31 Oct 23 UTC

adamgayoso

Bug 🐛

- [x] I have checked that this issue has not already been reported. - [x] I hav…e confirmed this bug exists on the latest version of scanpy. - [x] (optional) I have confirmed this bug exists on the master branch of scanpy. In the test of the flavor, the sorting over batches is correct with respect to how Seurat actually does it: https://github.com/theislab/scanpy/blob/83f90141fd18943a1795772d3d39f4e9eefd65c3/scanpy/tests/test_highly_variable_genes.py#L138-L142 First sort by number of batches it's called a HVG and then second break ties by the median rank. In the actual function, it's the reverse: https://github.com/theislab/scanpy/blob/83f90141fd18943a1795772d3d39f4e9eefd65c3/scanpy/preprocessing/_highly_variable_genes.py#L136-L143 I propose we put in the fix (which is currently being tested with high accuracy). The simplest solution would be to call it a new flavor. Ideally this would get in as soon as possible as this is a high volume function from what I understand.

barbara-andre-pro · February 3, 2023, 11:14am

Thanks a lot for your detailed answers!

Regarding the equivalence between “Seurat v3” and “Scanpy with flavor seurat_v3”, I ran a test on a given count matrix and I measured 98.65% of common genes detected as HVG among 2000 genes, which means that 27 genes were not detected as HVG by both methods.

By using the layer argument as you recommended, the agreement rate is now much better (98.65% instead of 52.14%) but it is not 100%. Note that I did not use the “batch_key” argument.

Is there an approximation somewhere is Scanpy code compared to Seurat v3 code which could explain that we don’t always reach an agreement rate of 100%?
Did you also get non exact matching when running your equivalence tests?
If yes, what maximum mismatch rate did you allow?
If not, could you please detail all the parameter values we should use in Scanpy and in Seurat v3 to get exactly the same results?

Many thanks for your precious help.

adamgayoso · February 3, 2023, 8:02pm

There is no approximation, but there can be heuristic differences. For example, with this approach many genes can get the same normalized variance score, so which genes get included comes down to how they are sorted, which is what I expect to be happening in this case.

After running you should see adata.var. variances_norm this is the score that is used for ranking genes. We would very much appreciate if you compared this to Seurat

Topic		Replies	Views
How to find HVG in Seurat-integrated data? scanpy integration	1	560	July 27, 2023
Apply sc.pp.highly_variable_genes to SCT-normalized residuals or counts scanpy	0	260	July 27, 2023
Error in highly variable gene selection scanpy scrna-seq , gene-selection	8	3867	March 21, 2022
Understanding the behaviour of sc.pp.highly_variable_genes with a batch_key and different values of n_top_genes scanpy	1	147	August 15, 2024
Why scanpy computes HVGs based on the data after log1p? scanpy	0	298	August 28, 2023

How to handle data lognormalization when using highly_variable_genes() with flavor seurat_v3?

Related topics