Hi everyone,
I am trying to run SOLO for doublet detection. However, I am not sure what is the correct way.
At first, I thought I have to run Solo after sc.pp.highly_variable_genes(adata), but later I realized it makes more sense if I ran Solo in the unfiltered raw dataset. Also, I was not sure if I had to remove the empty droplets, and if this has an effect in the solo training.
Another thing I did since I do not have much resources is to train solo in 30 percent of the data and then transfer the training weights to predict the doublets in the whole dataset. Is this a reliable option?
Could someone help me to clarify this matter?
Remove empty droplets.
Try to increase the data you use as much as possible. Limited resources in what way?
If you mean it’s a too large adata to fit your memory - you can run with adata backed from disk or use a custom dataloader. Otherwise, if it’s the GPU memory, decrease the batch size.
Also just use the raw counts, so no metadata is needed in adata (take more space)
Of course, you can split your data to train and test and predict doublets on test. No need to transfer weights. Check you training and validation curves to see if your model works with no overfit. But even then I would choose at least 60% for training (20% validation,20% test).