I am very new to the scvi tool and moslty have been working with scRNA-seq data within R environment. I came around a specific issue which scvi could potentially be well suited for so I would love to learn more and would appreciate any guidance.
I would like to adapt scVI to learn a transformation between two batches being RNA sequencing protocols. I have data where both protocols were applied to cells from the same samples (so I have “ideal” comparison pairs between batches).
Ideally, I’d like to pre-train a model on these “ideal” matched samples that could take the sample of one batch and map it into the “space” of another batch. I want to then save this model for further use on non-ideal/regular data to map it from one protocol space to the other. If this works, the “corrected/mapped” gene expression from one batch/protocol would be reconstructed to resemble that of the other batch/protocol.
Do you have recommendations or existing tooling within scvi-tools that supports this kind of use case? And if not directly supported, would it be best to subclass SCVI and adjust the loss, or is there a better entry point for this kind of modelling?
I am aware that scvi is known for its developer tools and would love to know if they would be useful/applicable in this case and how I can do so in a best way.
Theres no out of the box function to do this, but I suggest to run a scarches method with a tweak at the end:
For the reference: train a scvi model with tech/protocol as batch key with both protocols “matched” data. You can also do it as a scanvi model if you have the cell type labels data for each sample, it will be even more accurate, but not mandatory.
For the query: you use the new data that came from just one of the protocols and run the scarches scheme on it to create a new, fine tuned model (it will still “remember” the missing protocol batch)
Finally apply the fine tuned query model get_normalized_expression function with transform_batch to the missing protocol and library_size=“latent”. A sort of imputing (or reconstructing) the other protocol gene expression.
You can validate is works by taking your original data and split it to train and test (matched couples separated them) and follow this process. Any preprocess done on train set before training should be applied to the test set in such case.
Thank you so much for the detailed explanation — this approach makes a lot of sense.
I had a follow-up question regarding the use of paired samples: as I understand, the model will learn from having both protocols represented in training via the batch_key, but is there a way to more directly leverage the fact that many of these samples are “paired” — they come from the same individual but were processed using two different protocols?
In my dataset, most of these matched pairs come from different tissues, so while there’s substantial biological variation across sample_ids, within a given pair (same sample_id) the main difference should only be the protocol.
So is there a way to integrate sample_id into training — perhaps via additional conditioning or contrastive learning — help the model better disentangle protocol effects from biological signal? Or is this outside the scope of current scvi-tools functionality?
I see — I’ll definitely look into those suggestions!
I think what I’m aiming for might be slightly different: I’d like to treat the paired samples as a kind of ideal supervision — where a sample from one protocol (e.g., batch 1) can be directly compared to its counterpart from the other protocol (batch 2) because they come from the same individual. Rather than treating sample_id as a batch effect, I was thinking of using it to guide the model — so that batch 1 and batch 2 representations from the same sample are explicitly aligned or used in a loss.
The unmatched samples from different individuals wouldn’t really need to “see” each other at all — the focus would just be on learning a mapping via those matched pairs.
Does that sound reasonable, or would that require deeper customization?
is sample_id the same as donor_id (individual)? is the same donor samples comes from the same tissue? otherwise you can use donor_id as the batch key together with protocol
Im not so sure I understand what is “paired” here (like in the typical sense that we want to isolate the effect a specific gene) except that each of the donor’s samples were running in 2 protocols.
Is it a case where the counts of one protocol always > than the other protocol counts? if it does, does also modelling of the difference signal between their expressions make sense?
I think the method I suggested before, perhaps after modelling with sysvi (not scvi), will be interesting to watch
Sorry, yes, the sample_id I was talking about is the donor_id (same individual). The same donor samples for batch 1 and batch 2 are from the same tissue.
In terms of “paired”, I meant that for each donor_id, this dataset has two samples - one sequence with protocol A (batch 1) and the other with protocol B (batch 2). The whole dataset is a combination of such matched/paired samples from many donors. Thus, I thought if I could pre-train the model on matched donor samples (since the only difference between them should technically be the protocol), it would be more accurate.
It is not necessary the case that the counts of one protocol are always greater than the other, it depends on a gene.
My apologies for the confusion, I am very new to this but very interested in possibility of this project.
I wanted to thank you for your help as I did end up using your approach to get the “reconstructed” gene expression of the other protocol. Now I want to compare that expression to the actual expression, which I have in a “raw” format. I was wondering if I should also preprocess it with a model or run a standard log transformation to be able to compare it to the “reconstructed” gene expression?
Here we’ll need to think a bit:
You basically made a transfer learning + batch projection. get_normalized_expression function returns the normalized gene expression, so in order to return to the original “Raw counts” (of that batch transform) you need to reverse the same normalization and transformation (usually you normalised the counts of original adata , e.g by 1e4 and might did a log1p transform, so you need to use library_size=1e4 and potentially exp(x)-1). But this will still be an approximation .
I would also check what library_size=“latent” will bring you.
Theoretically in order to generate the raw gene counts from a trained model, we should use the posterior_predictive_sample() function over the trained model, but in this case we dont have the option to do the batch projection, but to run this function with the queried model with input data that has the other batch (manually force the batch to be the other batch). I wonder how this will turn out to be.
Those are my thoughts right now, perhaps there can be another way. You can use train and validation sets to assess the quality of this sort of reconstruction.
@ori-kron-wis can you create a PR to do batch projection for posterior predictive samples? Pretty much in line with another private request from last week. Thanks!
Thank you, Ori! I will try both approaches, starting from the simpler one first.
So, to reiterate, you are saying that the output of the get_normalized_expression is approximately equivalent to scaling by library size (typically 1e4, but as I understood I could change the parameter library_size to desired scale) and then performing log1p?
I have check the summary statistics (row sums and column means) of the output and it doesn’t look like it has been log1p transformed, however the row sums range around the library_size specified. Is it possible that the get_normalized_expression returns expression where each cell are divided by the total counts for that cell and multiplied by the scale.factor (library_size that is chosen in this case)?
I appreciate all your help! Apologies for so many questions.
we usually normalized the total cell counts before running the model (sc.pp.normalize_total) and also do log1p over the raw data (sc.pp.log1p), see our quick start tutorial.
so the scaling you do now in get_normalized_expression (library size) dependes on that. It is possible you havent done any normalization before modelling, so it will stay 1.
Also the total counts might be not exactly the same size (like you said its in “around the range”) as you probably ran HVG selection before modelling.
So for summary,
Its a reverse transformation.
This will bring you the estimated raw counts eventually.
no worries with too many questions, we are always happy to help