Training ML model for classifying cell types based on several annotated datasets

I am new to scRNA research and am trying to train a model to classify the cell types based on the published annotated datasets. Could someone guide me how to preprocess several different datasets to serve as one dataset and then train it as input for ML model to do supervised learning. Cuz each dataset would have different gene expressions as var, and sometimes maybe some parts of it would be the same across different datasets, I would like to know how others deal with it, thank you!