Error in highly variable gene selection


I am trying to use SCVI - tools for batch correction. I followed as in tutorial. I got the error in the step below.


The error I got is

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
ValueError                                Traceback (most recent call last)
<ipython-input-11-e00c5f508a6a> in <module>
----> 1 sc.pp.highly_variable_genes(
      2     adata,
      3     n_top_genes=2000,
      4     subset=True,
      5     layer="counts",

/opt/conda/lib/python3.8/site-packages/scanpy/preprocessing/ in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key)

    414     if flavor == 'seurat_v3':
--> 415         return _highly_variable_genes_seurat_v3(
    416             adata,
    417             layer=layer,

/opt/conda/lib/python3.8/site-packages/scanpy/preprocessing/ in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, span, subset, inplace)
     82         x = np.log10(mean[not_const])
     83         model = loess(x, y, span=span, degree=2)
---> 84
     85         estimat_var[not_const] = model.outputs.fitted_values
     86         reg_std = np.sqrt(10 ** estimat_var)

_loess.pyx in

ValueError: b'reciprocal condition number  1.0638e-15\n'

I am not sure what the issue and what cause this error. Could you please advise?

The exception stems from the scanpy module. Could you open an issue in scanpy?

ok, I have posted on scanpy discussion. Hope someone can help me there. Thanks

Also this is usually due to having many many lowly expressed (almost 0 genes). It will likely work if you run filter_genes first

Hi @adamgayoso thanks so much for advice. I tried sc.pp.filter_genes(adata, min_counts=3) but it still shows the error. The strange thing is when I remove the batch_key='patient_id' argument in sc.pp.highly_variable_genes, it works. I am a bit confused. Any suggestion?


It could be that within one batch category you have many all zero genes, if I had to guess.

Hi @adamgayoso , thanks for advice. May I ask an alternative. Does SCVI really need to run exactly this line


Will it cause any effect on scvi downstream analysis if I remove batch_key = 'cell_source

Or can I just run the routine scanpy highvar



You can select highly variably genes with any procedure. If you are selecting a small number of genes, it is of course important that you are obtaining genes that vary due to the processes you are interested in within your data.

More succinctly: it will all work to do what you propose. Alternative HVG selection methods might give ‘cleaner’ clusters, but you can try other methods if you are disappointed with the results.


Exactly! Though important to check what the expected input layer is… e.g., seurat v3 wants counts, the others want log normalized