Error in highly variable gene selection

Hi

I am trying to use SCVI - tools for batch correction. I followed as in tutorial. I got the error in the step below.

sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2000,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key='patient_id'
)

The error I got is

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-e00c5f508a6a> in <module>
----> 1 sc.pp.highly_variable_genes(
      2     adata,
      3     n_top_genes=2000,
      4     subset=True,
      5     layer="counts",

/opt/conda/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key)
    413

 
    414     if flavor == 'seurat_v3':
--> 415         return _highly_variable_genes_seurat_v3(
    416             adata,
    417             layer=layer,

/opt/conda/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, span, subset, inplace)
     82         x = np.log10(mean[not_const])
     83         model = loess(x, y, span=span, degree=2)
---> 84         model.fit()
     85         estimat_var[not_const] = model.outputs.fitted_values
     86         reg_std = np.sqrt(10 ** estimat_var)

_loess.pyx in _loess.loess.fit()

ValueError: b'reciprocal condition number  1.0638e-15\n'

I am not sure what the issue and what cause this error. Could you please advise?
Thanks

The exception stems from the scanpy module. Could you open an issue in scanpy?

ok, I have posted on scanpy discussion. Hope someone can help me there. Thanks

Also this is usually due to having many many lowly expressed (almost 0 genes). It will likely work if you run filter_genes first

Hi @adamgayoso thanks so much for advice. I tried sc.pp.filter_genes(adata, min_counts=3) but it still shows the error. The strange thing is when I remove the batch_key='patient_id' argument in sc.pp.highly_variable_genes, it works. I am a bit confused. Any suggestion?

thanks

It could be that within one batch category you have many all zero genes, if I had to guess.

Hi @adamgayoso , thanks for advice. May I ask an alternative. Does SCVI really need to run exactly this line

    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source"
)

Will it cause any effect on scvi downstream analysis if I remove batch_key = 'cell_source

Or can I just run the routine scanpy highvar sc.pl.highly_variable_genes(adata)

Thanks

Hi,

You can select highly variably genes with any procedure. If you are selecting a small number of genes, it is of course important that you are obtaining genes that vary due to the processes you are interested in within your data.

More succinctly: it will all work to do what you propose. Alternative HVG selection methods might give ‘cleaner’ clusters, but you can try other methods if you are disappointed with the results.

/Valentine

Exactly! Though important to check what the expected input layer is… e.g., seurat v3 wants counts, the others want log normalized