Gene filtering prior to batch correction


I was wondering what your thoughts are on whether gene filtering should be performed prior to highly variable gene selection and scVI training?

Publicly available datasets can have vastly different total captured genes which can influence HVG selection. In other non vae integrations I have noticed that even when you specify the dataset level batch IDs as a parameter, the datasets fail to integrate. If you remove dataset specific genes, they do integrate, but this is not ideal if cellular composition changes from dataset to dataset.

For scVI, would the best approach be to specify batch as dataset IDs and categorical covariates as within dataset sample_IDs?

1 Like

I have experienced similar things to what you describe regarding gene selection affecting ability to integrate.

A couple of strategies I use to work with this:

  1. Either with or without batch integration, use the differential expression modules to do DE between batches that fail to integrate. This will point you to gene families to dig into. This will help figuring out a reason why integration becomes difficult. Is it mostly RPS genes? Tends to happen when combining scRNA and snRNA. A lot of markers of a particular cell type? That tends to happen in the case you describe of uneven cell type contribution, which affect the ‘soup’ of free RNA that gets into the droplets. In particular, is the difference due to red blood cell related genes? Tends to indicate dissociation differences.

  2. Try using randomly selected genes rather than highly variable genes. Or if you have enough cells, try using the entire genome.

  3. Try selecting genes that are highly variable in just one batch and use those. It is good then to think of that batch as a kind of cell type definition reference. Not optimal, but can be helpful in a stretch.

  4. Give changing the number of layers to 2 a try. If for example the soups of the datasets are very different, there will be more interactions between soup-genes and cell-genes, and deeper networks can model more complex interactions.

In general to remove genes you have found to be dataset specific, I would make an AnnData object which have the .var rows of those genes removed.


Thanks, this was all really useful. Increasing the model complexity did the trick.

What also helped was ranking the highly variables genes per dataset by the number of datasets they were variable in and then selecting the top hvgs by rank. This appeared to stop datasets separating out due to dataset specific genes.

1 Like