Hi all,
I have a dataset with two modalities (RNA, protein) which I want to analyze using TOTALIV. I am more familiar with R so ideally I would like to run the model in Python and then transfer the latent repsresentations and denoised counts to continue my analysis there. I went through the user guide and I have a couple of questions that I needed some clarifications. I attach them below for anyone who may be able to assist:
- how can one store the protein raw counts in the adata file? It is obvious for the RNA counts
adata.layers[“counts”] = adata.X.copy()
but not for the protein ones.
-
Is it possible to extract the denoised counts for RNA, proteins to add in a Seurat object?
-
Is it possible to calculate DE genes/proteins for each of the clusters AND one additional condition? For example, cluster 1 stim vs ctrl? I can see that this information can be filled in the
groupby
argument, but I understand this can only apply for a one vs all comparison.
Thanks very much in advance,
Theo
Something like adata.obsm["protein_counts"] = protein_data_frame
would work. But it all depends how you’re loading the protein data in python. Is it cell ranger output? If so we have this function to help organize.
Yes, if you follow the tutorial command rna, protein = vae.get_normalized_expression
, the rna
and protein
objects should be dataframes that you can write to csvs with Pandas.
Yes, you can put the category of group 1 and group 2 (in a group 1 versus group 2 test) where group 1 and group 2 are string categories present in the groupby
cell metadata vector.
See more about these options here. As an example, groupby="cell_types"
and group1="cd4"
and group2="cd8"
Adam, thank you for the replies.
Something like adata.obsm["protein_counts"] = protein_data_frame
would work. But it all depends how you’re loading the protein data in python. Is it cell ranger output? If so we have this function to help organize.
I was planning to load the protein data from the respective Seurat object slot. So, it would be:
In R:
adata<-sc$AnnData(X=t(as.matrix(pbmc@assays$RNA@counts)),
obs=pbmc[],
var=GetAssay(pbmc)[])
adata$obsm[‘protein_counts’]=data.frame(pbmc@assays$ADT@counts)
Then save as .h5ad
and read in Python where I would set the up the anndata and run the model as shown in the user guide.
Yes, you can put the category of group 1 and group 2 (in a group 1 versus group 2 test) where group 1 and group 2 are string categories present in the groupby
cell metadata vector. See more about these options here. As an example, groupby="cell_types"
and group1="cd4"
and group2="cd8"
Here I maybe did not explain myself well. Let’s assume I have a ‘disease’ factor with two levels: ‘control’ and ‘disease’. I also have three cell types, ‘cd4’, ‘cd8’, b cell’. I want to compare control vs disease for each of the three cell types. From the TOTALVI.
differential_expression
documentation, I understand that the group1
and group2
arguments must be levels of the same parameter (defined by groupby
). Thus, I could only think of concatenating the two factors into one string for such a comparison. Would there be an alternative to this?
I’ve had trouble with this line in reticulate, you can make protein into a separate anndata and then combine in Python like
adata.obsm["protein"] = protein_adata.to_df()
What you suggested would work, but you could also use the indices argument to manually select the cells of group1 and group2.
Ok got it and all working fine now so far!
Just a couple more (hopefully last) clarifications as I get more familiar with the tool.
-
In the DGE analysis of a dual modality experiment, the model.differential_expression
compares features from both the e.g. RNA and protein and uses the corrected/denoised data for this comparison, am I correct? The reason why I am asking is that the results do not include any protein markers, but a few integers which I am not sure where they come from.
[1] “S100P” “14” “HLA-DRA” “S100A8” “CCL3” “CXCL8” “RPS18” “SRGN” “HLA-DRB1” “IL1B” “RPS8” “CD52” “RPS12” “LEF1” “TCF7” “34” “18” “LDHB” “MAL” “IL7R” “RPS6” “AQP3” “TRABD2A” “RPS13”
-
Reading other members’ posts, I wanted to ask what your recommendation for the n_number
argument should be. Default is 25, but it feels to me like a small posterior sample for a DE comparison. On the other hand, higher values would be computationally more costly. Do you have a feeling for this?
I think what is happening is that when you put the protein data inside the anndata, it’s a numpy array and not a pandas dataframe. If a dataframe is not provided scvi-tools will give numerical names to each protein. So “14” is the 15th column of the protein matrix (0 indexed in python)
You need to quadruple the number of samples to halve the standard deviation of the estimate. Too many samples becomes too computationally expensive and from our experience the variance is already fairly low. You should feel free to experiment with it though. At the end of the day the entire method is stochastic, so variance from run to run might even be higher than the monte carlo estimator variance.