What to do when the query dataset has different genes from the reference?

Hello everyone! I’m new here (and relatively new to the world of scRNA data analysis). I recently generated a scRNA dataset for my phd project and have been trying my best to use scanpy and scanVI to analyze it. I was wondering if I could get some help with this. Here’s some info which is relevant:

My query dataset is generated from a transgenic mouse embryo E11.5 which have been FACSed for mCherry expression. It’s a mix of cell types which I only vaguely know the composition to. Additionally, there are 30 artificial genes which I’ve inserted into the genome (they don’t exist in the WT transcriptome). ~15,000 cells.

The reference dataset is from the Ontogeny of Mouse, Graphed (https://omg.gs.washington.edu/). I’ve filtered for cells from E11.25, E11.5 and E11.75 embryos. ~500,000 cells.

My questions are as follows:

  1. How should I best go about the 30 artificial genes? Is it better to append them to the reference object and fill them with 0’s (after highly variable gene selection)? Or should I copy my query object, remove the 30 genes, do the cell type predictions, and then copy the cell types to the original object?
  2. I notice a significant difference in cell type composition when I concatenate the two datasets and do training & prediction vs if I train solely on the reference dataset and then predict only on the query dataset. I unfortunately don’t know enough about machine learning to understand this difference. In my mind, adding only 15,000 cells to such a huge reference shouldn’t impact the training so much.
  3. I noticed that the cell types in my reference dataset have different population sizes. In total, it’s 147 cell types. Here is a plot of the counts:

    There are several cell types with a count of 1.
    Should I use n_samples_per_label to limit the number of cells seen? How could I determine a sensible value for it? I tried setting it to 100, and I noticed that the predictions really biased some of the less frequent cell types.

I’m happy to go into more detail if anyone needs. Thanks a lot in advance!

Hi,

  1. I have some preference to recommend removing the genes if you don’t need them in a joint model (predict expression of those in the full atlas or similar - which I wouldn’t really recommend) and use it afterward with the annotated cells/latent spaces. If the expression of these genes is low, it likely is fine to pad with zeros.
  2. I assume you do highly variable gene selection on both datasets, when first concatenating. This can have drastic effects. Indeed, I would expect otherwise both embeddings to be comparable (mind though UMAP tends to exaggerate differences between two runs).
  3. This is a lot of celltypes and scANVI is not tested with this amount of celltypes. I would doubt that label transfer is accurate for all cell-types and expect major manual clean-up to be necessary (could you split it by organ or celltype lineages maybe?). I don’t think other tools will perform well in this task (maybe scTab?).
  4. n_samples_per_label will be pretty important to get reasonable results (again using that many cell-types expect a lot of wrong annotations). 100 sounds reasonable. I would check the errors for predicting the labels of the labelled cells (if those are wrong and biased towards frequent celltypes, reduce it slightly.

Hi Can, thanks for the detailed response!

You’re correct, I don’t need to predict the expression of those genes or anything, so I’ll remove them from the joint dataset and then add them in after the predictions.

Yes to the HVG selection after concatenation. My current “workflow” is as follows:

  1. Concatenate datasets,
  2. normalize/log transform,
  3. HVG selection,
  4. train SCVI on unnormalized data (saved in layer “counts”),
  5. train SCANVI on SCVI model (I’ve tried varying numbers of n_sampels_per_label here)
  6. predict on the “Unknown” cell types.

Additionally, I’ve tried removing cells belonging to certain celltypes before step 4. Specifically, I removed all cells and celltypes which have fewer than 100 cells which was probably 1/3 of the total # of cell types. Do you think this is a reasonable filtration step? Or should I lower the threshold further? There are a few cell types in the reference dataset consisting of fewer than 10 cells.

The other thing I’m uncertain about regarding my dataset is that my dataset is obviously biased toward certain cell types as a result of the FACS. Additionally, I have only a vague idea of which cell types should be in the resulting cells, but I can’t be certain. So it’s hard for me to tell whether the resulting predictions are accurate, or if certain cell types are over/under estimated.

Some updates in the last few days:

I’ve tested the resulting model (trained with n_samples_per_label = 1000) on a dataset of mouse embryonic limbs and it seems to perform quite well.

However, I’m uncertain as to how to perform some downstream analyses with my artificial genes of interest. Since they were removed prior to training, I can’t use SCANVI’s built in differential_expression function. Is there a way around this? Or should I just revert to scanpy’s rank_genes_groups function on log-normalized counts?

I would do rank_genes_groups or pseudobulk DE (better practice if you have several replicates) for downstream analysis.