I have at my disposal a bunch of datasets, of which one is a cancer patient while the other ones are the same organ but in a healthy condition.
The idea is to use the cell types of the healthy samples as reference cells for inferCNVpy, but I first want to integrate them with the basic script for scVI.
Now, I have some questions that originate from basically being a newbie in working in the scverse ecosystem (I’ve always used R/Bioc tools) but I think this could be useful for many users.
The history of the samples is the following:
read alignment and quantification with cell ranger
QC, normalization, HVF selection, dim. reduction, clustering and annotation in R/Bioc (scater + Seurat)
3a. Merge all healthy samples in a single Seurat object
3a. Convert both (tumor + healthy) Seurat objects to SingleCellExperiment objects so that to retain raw counts only and metadata (so cell types too)
3b. Convert SCE objects to H5AD’s with zellkoverter, so that adata.X contains raw counts
Import anndata objects in a notebook and merge them with anndata.concat
scVI is not going to alter the adata.X matrix (unless I indicate to do so, I guess)
infercnvpy works on adata.X by default, but it has the layer argument (in infercnvpy.tl.infercnv) to take different layers as input
infercnvpy expect a “gene expression matrix, appropriately preprocessed” in the adata layer it is going to work on
I am still open to comments on the steps I list in the original post
The open questions that sums it up is the following: does it make sense to run infercnvpy on the normalized (decoded) gene expression from scVI? (basically the output of scvi.model.SCVI.get_normalized_expression)
Again, thanks a lot to anyone that can provide comments, help or answers!
Sorry for the late reply, I must have missed the notification. I don’t want to smooth gene expression necessarily.
My doubt is what and how should I use the results from scvi model training for further analysis?
I have seen that the scvi normalized expression values are took into account for visualization across all batches.
But can you use the same values as input for other tools? (in this example, infercnv)
I am encountering a similar issue and wanted to follow up on this question. In a typical tumor atlas project, you would want to verify your annotations of tumor cells by CNV. But the tumor samples are typically run in multiple batches, and infercnvpy (as well as inferCNV and other CNV inference packages) is susceptible to batch effect. I’d like to be able to use scvi.model.SCVI.get_normalized_expression() with the transform_batch parameter to obtain a batch-corrected gene expression matrix for direct input into infercnvpy. I’ve done this, but the results look much worse (overly smoothed, without clear CNV clusters) than the original non-batch-corrected results.
Has anyone used scVI output as input for infercnvpy? How did you do it?
Does the strategy above make sense?
Any alternatives that people have run into?
@grst I’d greatly appreciate your insight if you have the time.
Sorry, I haven’t attempted yet to use infercnv with batch correction. The best approach would probably be a dedicated model that learns the copy number variations taking batch effects into account, but that would be a research project by itself.