Usage of HVG in scVI

adamgayoso · August 23, 2021, 4:42pm

Question copied from GitHub

github.com/YosefLab/scvi-tools

Usage of HVG in scVI

opened 04:01PM - 23 Aug 21 UTC

closed 04:43PM - 23 Aug 21 UTC

ling-lyang

question

Please raise usage questions at [discourse.scvi-tools.org](https://discourse.scv…i-tools.org/) (I have noticed exactly the same question was raised in #850 but no clear answer was given and it shows closed.) The question is, does scVI use highly variable genes selected by scanpy? I ran scvi 1) with no highly variable genes selected, 2) with highly variable genes selected, 3) with highly variable genes selected but manually removed some genes from HVG. The results are different for 1) and 2) but the same for 2) and 3). This is confusing. Anyone has any idea? Thanks! Ling

The question is, does scVI use highly variable genes selected by scanpy?

I ran scvi 1) with no highly variable genes selected, 2) with highly variable genes selected, 3) with highly variable genes selected but manually removed some genes from HVG.

The results are different for 1) and 2) but the same for 2) and 3).

This is confusing. Anyone has any idea?

Thanks!
Ling

adamgayoso · August 23, 2021, 4:45pm

Hi Ling,

scVI uses whatever genes are present in the adata object. Empirically, we tend to see better dimensionality reduction results when adata only contains highly variable genes, which is what we do in our tutorials.

This is also the case for most if not all integration/batch-correction methods. E.g., this paper shows using HVG leads to better integration

You would likely see the same phenomenon with a simple PCA.

pseudonym2 · August 24, 2021, 2:05pm

Dear Adam,

I wonder whether there is a possibility to choose which genes will be used for the model. Let’s assume I compute HVGs using scanpy but don’t subset my anndata object (that is, HVGs will be tagged instead of the removal of non-variable genes) the model will use all genes instead of the HVGs. Therefore, it is necessary to subset the anndata before model setup if one wishes to use the HVGs for training. However, it would be desirable not to subset the data and keep the original anndata object intact.

I think if there was a possiblity to make scvi only use the tagged HVGs, that would increase compatibility with scanpy when using multiple layers for raw counts, normalized counts and simplify later sub-clustering approaches which would require to re-compute the HVGs (which is not possible when only keeping the HVGs from level 1 clustering).

Valentine_Svensson · August 24, 2021, 2:17pm

Hi,

The strategy I use to maintain the full dataset after selecting the HVGs is to store all genes in to the adata.raw field: anndata.AnnData.raw — anndata 0.7.7.dev7+g9620645 documentation

If you put all genes the .raw, and do analysis using a subset of genes that results in information stored in .obs or .obsm, then you can recover all genes using adata.raw.to_adata(). This will leave all the new obs and obsm in the anndata.

pseudonym2 · August 24, 2021, 2:36pm

Dear Valentine,

this is useful information, thank you. I never fully grasped the concept of the .raw in scanpy (I must say the description is not very intuitive in my eyes though).

Will this also recover the layer of raw counts (not only log-normalized)?

Kind regards

adamgayoso · August 24, 2021, 3:31pm

Following this pattern that is in our tutorials and suggested by @Valentine_Svensson

adata.layers["counts"] = adata.X.copy() # preserve counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata # freeze the state in `.raw`

leads to adata.raw.X being log normalized. The counts can’t be retrieved from .raw. raw is invariant to adata being subset at the gene level.

We considered this pattern, but it proved to be difficult to implement as we then needed to keep a gene-level mask around. It also was slower as every minibatch of data that goes through the model needed to be subset. While there is some inconvenience on the user-end for managing data, we felt that a what you see is what you get approach is also more straightforward to use in a pipeline.

maartenslagter · November 20, 2021, 9:16am

Hi Adam
I’m fine with ‘losing’ the lowly variable genes for follow up analyses. What I do worry about is that scVI’s library size estimation will next take place on only the remaining variable genes. Is this worry justified or is library size estimation done on adata.raw.X?
Thanks!

Florian_Deckert · January 21, 2022, 10:03am

I second this question. As I understand the sequencing depth variation is modeled as unobserved random variable l_n. And in Fig. Supp. 13a Lopez et al. 2018 you show that for homogenous cell population the inferred size factor correlates with the sequencing depth. But I am not sure if all genes were used for the analysis.

Would it make a difference (using HVG vs all genes) for size factor estimates in heterogeneous cell population and maybe for zero estimates too?

If we can get some clarification that would be awesome!

adamgayoso · January 21, 2022, 4:33pm

Indeed the library size estimation will only consider the genes included in the model. In previous versions of the older scvi package, this was a latent variable with an informative prior. In newer versions we take the observed library size (again, of the genes used in the model).

In a future release, we can include a size_factor_key for manual setting of the size factor if that’s a desired feature.

Valentine_Svensson · February 1, 2022, 3:43am

That sounds like a great feature! I’ve developed a pattern where I manually calculate total counts and store in e.g. .obs['total_counts_all']. Then after I do gene selection I make a new .obs['total_counts_sel']. This way I can use these ratios when comparing predicted gene expressions from scVI with raw data, to see if the model is missing some variation pattern in the data that I can see using unused metadata. If I could pass size_factor_key = 'total_counts_all' to the model I think it would be easier to relate predicted expression to the raw data.

adamgayoso · February 1, 2022, 8:46pm

kanefos · March 1, 2022, 5:41pm

Hi authors,

Follow-up question. Normally, when I sub-cluster a single cluster from a whole dataset, I get a better finer resolution of sub-clusters than these same cells in the initial clustering of the whole dataset.

My question is does this logic apply to scVI? So, for instance, could I run scVI on HVGs calculated on whole tissue, then in the latent representation identify cell A, then take these barcodes back to the HVG selection step and re-run scVI on a group of more cell A-specific HVGs?

adamgayoso · March 1, 2022, 5:56pm

Sounds reasonable to me

Topic		Replies	Views
Selection of HVG in scVI scvi-tools scvi	3	1115	December 20, 2022
All genes or highly variable genes? scvi-tools gene-selection , scvi , totalvi	10	3566	March 31, 2022
Scanpy HVG batch_key scanpy	1	1030	March 16, 2023
HVG selection with multiple batches scRNA-seq	3	970	July 4, 2022
Gene filtering prior to batch correction scRNA-seq scrna-seq , integration	2	738	July 9, 2021

Usage of HVG in scVI

Related topics