Understanding batch-corrected counts in scVI

@cane11, thanks again for your suggestions!

I’m still exploring the criticism module; the tutorial seems a bit outdated, and I’m not sure which metrics are the most informative for my case. Proper DE analysis (e.g., pseudobulked DESeq2) isn’t feasible for the small clusters, but comparing top markers per cluster per dataset before and after correction looks good. As I would expect, larger datasets show higher agreement, likely due to lower noise.

I realized that my original concern about the validity of analysis involving batch-corrected counts can be also asked in the context of DE analysis performed by scVI. From other forum discussions (e.g., here), I learned that batch-corrected counts can be used for DE. Additionally, from this documentation, I saw that in Scenario 2, we can specify identical sets of batches for group1 and group2 for DE.

If I compare young vs. aged samples across multiple datasets, would it be theoretically valid to set:
batchid1 = batchid2 = ['aged1_dataset1', 'young1_dataset1', 'aged1_dataset2', 'young1_dataset2']
That is, including representative samples from both cohorts across different datasets? If I understand correctly, the sampled counts would be conditioned on the average of the specified samples.

  1. would this be a valid approach?
  2. would it be valid to include young and aged samples from smaller datasets in the analysis, if they are small and unrepresentative and cannot be included in the batch list?

Would appreciate your thoughts on this!