Understanding batch-corrected counts in scVI

Hi,
Thank you for such a great tool!

I’m analyzing transcriptional heterogeneity in a single cell type associated with a condition of interest. I have multiple scRNA-seq datasets, each containing samples from both conditions, but some datasets are quite small (~300 cells or less).

I’m using scVI for dataset integration and generating batch-corrected counts with get_normalized_expression(). When sampling counts, I specify a list of samples from the three largest datasets as the batch (to average across them).

My main concerns:

  1. Could this procedure introduce artificial signal, particularly affecting the smaller datasets when using larger datasets as the batch?
  2. What would be a reasonable correlation between original and batch-corrected counts? Should a certain threshold indicate potential issues?
  3. How can I explore potential issues with the corrected counts? Are there recommended diagnostic checks or visualizations?

The results I obtain from the corrected counts look nice and meaningful, however, I want to ensure that I haven’t artificially imposed any signals through this process. I couldn’t find much documentation on batch-corrected get_normalized_expression(), so I’d really appreciate any insights on best practices for validating these corrections. Thanks!

Hey @noraneko

I refer you to here:

For additional info on get_normalized_expression

And here:

About how to see the quality of it in terms of imputed genes.

Besides that, if you want to compare the quality of your batch integration and parhaps bio conservations of the orignal counts and the batch-corrected counts you did in your method, you can use scib-metrics (see Atlas-level integration of lung data — scvi-tools)

Hi @ori-kron-wis,

Thanks for your response! I’ve checked out the documentation, tutorials, and scib-metrics. However, these approaches don’t apply to me. I’m working with a single cell type, so I don’t have distinct clusters that could serve as labels in Benchmarker methods. Additionally, I don’t expect clear spatial separation between my conditions.

My goal is to identify subpopulations of cells within this cell type and compare them between conditions, so I would like to ensure that my batch correction hasn’t artificially imposed structure. Do you have any suggestions for evaluating batch effects or unintended signal introduction in this context?
Thank you!

I would apply a DE method orthogonal to scVI-Tools and compare your identified cell populations. In the end, it will come down to either prior knowledge or validation experiments.
To check whether the trained model makes sense, you can also use scVI criticism that tests whether generated and raw data are similar.

@cane11, thanks! Could you clarify what you mean by ‘cell populations’ - are you referring to e.g. Leiden clusters derived from scVI’s latent variables, or the clusters I identified independently using a different algorithm on scVI-corrected counts?

scVI-criticism sounds really cool - I hadn’t come across it before. Do you know if it can be applied to batch-corrected counts sampled with get_normalized_expression()? I’ve started developing similar metrics to compare raw and corrected counts, but using PPC from the package would be a more robust approach.

Yes your Leiden clusters are what I referred to as “cell populations”.
No, it can’t be applied to batch-corrected counts directly as it requires corresponding raw data. However, if the generated data without projection is bad, it is likely the same for the batch projected ones (doesn’t approximate real data well) and I wouldn’t trust it.

@cane11, thanks again for your suggestions!

I’m still exploring the criticism module; the tutorial seems a bit outdated, and I’m not sure which metrics are the most informative for my case. Proper DE analysis (e.g., pseudobulked DESeq2) isn’t feasible for the small clusters, but comparing top markers per cluster per dataset before and after correction looks good. As I would expect, larger datasets show higher agreement, likely due to lower noise.

I realized that my original concern about the validity of analysis involving batch-corrected counts can be also asked in the context of DE analysis performed by scVI. From other forum discussions (e.g., here), I learned that batch-corrected counts can be used for DE. Additionally, from this documentation, I saw that in Scenario 2, we can specify identical sets of batches for group1 and group2 for DE.

If I compare young vs. aged samples across multiple datasets, would it be theoretically valid to set:
batchid1 = batchid2 = ['aged1_dataset1', 'young1_dataset1', 'aged1_dataset2', 'young1_dataset2']
That is, including representative samples from both cohorts across different datasets? If I understand correctly, the sampled counts would be conditioned on the average of the specified samples.

  1. would this be a valid approach?
  2. would it be valid to include young and aged samples from smaller datasets in the analysis, if they are small and unrepresentative and cannot be included in the batch list?

Would appreciate your thoughts on this!