I am working on longitudinal 10X scRNA data of differentiation human pluripotent stem cells to neural progenitor cells.
In total, there are 8 timepoints sampled along 10 days of the differentiation protocol (0h, 8h, 24h, 48h, 72h, 96h, 168h, 240h).
When integrating the data using the scVI model with
batch_key='timepoints', it seems to overcorrect and not preserve biological/temporal information (plot attached).
I am not really an expert, so I was wondering if you have any suggestions on how to best proceed with the integration using scVI. I thought of assigning additional covariates or changing the
Looking forward to your suggestions!
Thanks and best,
When you provide ‘batches’ to scVI for integration, you are explicitly telling the model ‘Create a low-dimensional representation of the data that does not include variation due to the batches’.
If you did not want to remove variation explained by the time points, what did you actually want to do?
There are a number of potential ways to analyze these data, but what strategy you go for depends greatly on what you are aiming to learn from the data. If you describe briefly what you want to infer from the data, we might be able to point towards strategies that can help with this.
thanks for reaching out!
In this case, I am interested in the developmental process and transcriptomic changes underlying the transition from pluripotent cells into differentiated neural progenitors. Since our differentiation protocol is relatively quick, I would for example expect that most transcriptomic changes occur over the first 5 to 6 timepoints while cells from timepoints 7 and 8 should not be very different from each other.
So my goal here is not to completely remove any of the timepoint-attributed variation but rather try to embed cells so their biological heterogeneity over time gets preserved for follow up analyses using eg. RNA velocity or Waddington OT.
Hope this helps, happy to answer any questions!
Thank you for the details. It is not so clear what you are planning to infer using the embedding, but it definitely sounds like you wish to create an embedding that includes the variation due to the time points. In what you describe in the first you are explicitly removing any variation in gene expression due to the time points when creating the embedding. Including the time points T as ‘batch’ in scVI gives you a Z such that Z \perp T. To explore which time points have cells that are similar to cells from other time points, then you would want a Z that includes this variation. That is, what it seems you want to do is to not provide ‘timepoint’ as batches in scVI.
Just for the sake of completeness, if you were interested in gene expression, rather than embedding, it wouldn’t matter whether you include ‘timepoint’ as batch or not. If you do include it, gene expression is given by X = f(Z, T) where Z \perp T, while if you don’t include time point as batch, gene expression is given by X = f(Z) , where Z may include variation due to T .
On the other hand, would you ever want to integrate out variation due to time points? Yes, say there are two distinct cell types with very different expression profiles. There may be temporal variation in the expression of genes in those cell types, but largely they are very distinct. To define the cell types without contribution of the genes that change in expression over time, you can integrate out that variation and use the embedding to label the cells. Now the labels are consistent across time points, and you can compare the fractions of cells in either of the cell types over the time points.
Hopefully this gives some examples of how analysis with time points differ depending on focus of the question.