I have annotated my data containing 400k cells, and now I have query data consisting of 100k cells. I would like to perform label transfer for this data using scvi-tools.
During label transfer with scANVI, I noticed that some classical clusters were not accurately predicted. For example, the reference data has 51 clusters, but the predicted data only has 30 clusters. I am wondering if there is a way to improve the classification of these clusters by adjusting any other parameters in the model. Please suggest any potential solutions.
Hi, thank you for your question. Some parameters you may consider changing:
SCANVI.train when training on the reference data. This will enable label subsampling such that the model will sample
x observations from each cell type label at the start of each epoch. The consequence of this is that rare cell types will be sampled more frequently, which can significantly affect your model’s performance depending on the distribution of cell types in your dataset. In my experience testing this out, it leads to more stable classifier performance.
linear_classifier=True when initializing
SCANVI. The default classifier includes multiple layers and could be overfitting on the training data, so a simpler linear classifier might help. I would try plotting the validation accuracy and/or classification loss during training to compare both options.
Both of these options are available in the latest version of scvi-tools (1.0.0).
Thank you for your reply. I have utilized both the linear classifier and the n_samples_per_label parameter. These adjustments improved the accuracy, resulting in a 90% accurate prediction of cell labels in the reference data.
I have a naive question regarding another dataset. This dataset consists of 100k cells, where 90% were used for training and 10% for testing the model. I already know the class labels for this test dataset. The model successfully predicted the test set and reference set with high confidence (soft=True).
Now, I would like to use this trained model to predict similar class labels for another dataset (two times more than reference dataset) containing approximately 200k cells (Predict the class labels). Is it feasible to apply the same method? How should I tune the parameters for this new dataset?