I was reading about pseudo-labelling in order to improve my model performance. I found some posts on the internet and I'm still not sure I fully understood it.
At first, I thought it would mean doing something like this: run a model, get the probabilities from the submission set and those probabilities above some threshold - maybe above 95%, for instance - I would aggregate to the training set. So I would have the some datapoints from the submission set and the training data. Is this the correct definition?
I also found that scikit-learn have semi-supervised algorithms which could be used here, such as SelfTraining. But I couldn't find a good tutorial on that. Is SelfTraining related to pseudo-labels? Does anyone have any material on this topic?
Best regards!
Your first thought is just about it @yukioandre!
Nice, thank you so much Professor!