Hello everyone!
Given that not everyone might be following the discussion from my previous post, and I'm concerned that the content might be overlooked without being noticed by the Zindi team, I've decided to create a dedicated post concerning the ensemble rule for this competition.
There's ambiguity in the term "ensemble". In many ML Competitions, "ensemble" typically refers to the act of combining predictions from various models to derive a final decision. However, certain algorithms, such as Random Forest and Gradient Boosting, are inherently "ensemble models" by their very design. Given this, the concerns raised by fellow competitors are quite valid.
Here's a proposed, clearer set of guidelines regarding ensemble models for this competition:
1. Definition: In the context of this competition, an "ensemble" is defined as the integration of predictions from several distinct models. This definition does not include the inherent ensemble mechanisms that are part of standard algorithms like Random Forest or Gradient Boosting Machines (LightGBM/XGBoost)
2. Restriction on Ensembles: Competitors can merge predictions from up to three unique models to create an ensemble. For instance, if one wishes to form an ensemble, predictions from models like Linear Regression, Random Forest, and a Neural Network can be combined. This combination will be recognized as a single ensemble.
3. Cross-Validation Clarification: Using cross-validation and averaging the predictions from multiple folds does NOT count as multiple models. It's considered a part of the training and evaluation process for a single model. For instance, if you train a Neural Network using 5-fold cross-validation and then average the predictions of those 5 folds to create a submission, it's still considered a single Neural Network model.
4. Inherent Ensemble Models: Algorithms like Random Forest, Gradient Boosting Machines, etc., which inherently use ensemble mechanisms, are treated as a single model regardless of the number of trees/estimators they use. For instance, a Random Forest with 100 trees is considered one model, not 100.
Of course, this is merely a suggestion. I'm eager to hear opinions from both the competitors and the Zindi team. Regardless of the eventual decision, I believe there's consensus that the text in the Data Section needs revision.
this is what i think they mean when they say that each participant in the competition is allowed to submit a maximum of 3 ensembled models : The rule says each person can only combine predictions from up to 3 models at most. So, they can submit up to three combined predictions eg (XGB, Random Forest,lighGb) made from different models. lets say for example you use cross-validation with 10 folds and generating a submission by averaging the predictions from these 10 folds is considered as using a single model, not an ensemble of 10 models, is just my analysis of the rule meaning we cant use more than 3 models in our ensembles, (Computational cost i guess)
If you decide to use a LOOCV strategy, you will have more than 3800 models. And according to you we have to consider it as a single model if we averaging the predictions of all these models ?
Thanks for consolidating that discussion into a single post. I agree on the clarifications that you've made.
Point 3 can be a discussion point for the hosts though, since this one is on the host's commitment to make the resulting predictions lightweight. If they want to encourage small footprint, then they could consider each fold's trained model count towards the limit. This change is the difference between allowing the contribution of 3 models and 15 models even with just the 5 fold suggestion the host made in the previous discussion.
Ok but for eg. XGB, Random Forest, lighGbm are 3 models.
That would imply that you wouldn't be allowed any sort of fusion layer to mix/stack/bag/boost the results of the 3. You'd only be allowed to do some sort of averaging.
With a regression model ensembling your models, that's already 1 model, so you'd only be allowed 2 models below it, not 3 as mentioned.
Is this correct?
Maximum Number of Single Models Allowed: 3
so,
Maximum Number of Ensembles Using Predictions from Above Models: 1
So you have to choose:
Your submission should consist of up to 3 single models used to create ONE ensemble
Note: If you are not satisfied with the predictions of one of the models and wish to replace it, your selection might look like this:
If you prefer to use only 2 models for the ensemble, your selection might be:
Should you wish to use just 1 model, that's allowed too. In such cases, the ensemble doesn't count.
Ok thanks, what I understand from that is, I'm allowed to have 3 models, plus a model on top that ensembles the 3.
In this competition context, the ensemble is a combination of the predictions of the three models into one final prediction.
On point 3, if you decide to use a LOOCV strategy, you will have more than 3800 models. And according to you we have to consider it as a single model if we averaging the predictions of all these models ?