Hi,
The documentation of highly_variable_genes() says: “Expects logarithmized data, except when flavor=‘seurat_v3’, in which count data is expected.”
Does it mean that instead of coding in this order (1):
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
we should code in this order (2)?
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
What is the consequence in the final results of coding (1) instead of (2)?
Is there a tutorial available where flavor=‘seurat_v3’ is used?
Besides, has someone already checked that using code (2) in Scanpy provides exactly the same results, for the same input data, as using the following code in Seurat v3+?
pbmc_object <- NormalizeData(object = pbmc_object, normalization.method = 'LogNormalize', scale.factor = 10000)
pbmc_object <- FindVariableFeatures(object = pbmc_object, selection.method = 'vst', nfeatures = 2000)
Thanks in advance for your precious help.
The order does not matter if you use the layer argument.
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")
We use it a lot in scvi-tools. Tutorial here .
barbara-andre-pro:
Besides, has someone already checked that using code (2) in Scanpy provides exactly the same results, for the same input data, as using the following code in Seurat v3+?
The code is tested. scanpy/test_highly_variable_genes.py at d26be443373549f26226de367f0213f153556915 · scverse/scanpy · GitHub
There may be some inconsistency when finding HVG with the batch_key
argument in which the way genes are ranked is slightly different. The core computation of normalized variances is approximately equal though.
opened 06:04PM - 18 Dec 21 UTC
Bug 🐛
- [x] I have checked that this issue has not already been reported.
- [x] I hav… e confirmed this bug exists on the latest version of scanpy.
- [x] (optional) I have confirmed this bug exists on the master branch of scanpy.
In the test of the flavor, the sorting over batches is correct with respect to how Seurat actually does it:
https://github.com/theislab/scanpy/blob/83f90141fd18943a1795772d3d39f4e9eefd65c3/scanpy/tests/test_highly_variable_genes.py#L138-L142
First sort by number of batches it's called a HVG and then second break ties by the median rank.
In the actual function, it's the reverse:
https://github.com/theislab/scanpy/blob/83f90141fd18943a1795772d3d39f4e9eefd65c3/scanpy/preprocessing/_highly_variable_genes.py#L136-L143
I propose we put in the fix (which is currently being tested with high accuracy). The simplest solution would be to call it a new flavor. Ideally this would get in as soon as possible as this is a high volume function from what I understand.