Well seems like this competition will be won by post processing. Just need clarification from @Zindi @Amy_Bray @ZINDI
based on this:
"Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution."
Does post processing solutions adhere to this rule? This is the same case as the Africa Credit Transaction Challenge where the top solutions were propelled via post processing. The clarification will help us choose the right submissions because we might choose post processing solutions and get deranked later on. Thank you
Hmmm, setting thresholds is discouraged as we as data scientists can't make the final say on what a doctor might know better than us. It is best to leave your predictions as is and maybe write a paragraph on what you would do further if you were to round.
This means medical professionals in communication with the researchers who created the dataset can set their own thresholds and add what is important to them.
"It is best to leave your predictions as is"
okay thanks for the clarifications
Just for Confirmation, Post-Processing for values not from model is discouraged or prohibted ? i mean not good action or not accepted and disqualified for prize?
I don't agree. For unbalanced problems. It is good to set threshold. Because for instance the 0..5 natural threshold for binary problems is not adapted. Yet threshold should not be chosen with no data evidence. There is proper way to fine-tune the thresholds well known by kagglers and others. The true issue here is the use of data leakage not the thresholds as such.
Use of Data leakage. hmmm. interesting :)
Post processing is not a kind of black box. It is generally guided for a good EDA or systematic model errors debugging. Maybe other people have another view on it.
Using African Credit Scoring as an example, how does setting all Ghanaian records to 1 benefit the organizer or the client, as one of the discussions suggested? Does this imply that when expanding into a new market, they would simply assume all records to be 1?
While I agree that in cases of unbalanced problems, we can adjust the default threshold of 0.5, I tend to disagree with applying this approach in this particular scenario.
These are my thoughts.
I see the Ghana example of setting all clients to 1 is not good example. I saw this discussion and I know that the author shared aisleading information. That's what I said, this was a choice not backed by data. Again this choice has no actionable value.
So, if I understand you correctly, your argument is that as long as post-processing is guided by data, can be clearly explained, and is generalizable to real-world scenarios, it should be allowed. If that is the case, I agree with you.
However, shady post-processing techniques should not be allowed, as they often rely on luck rather than sound methodology.
@Koleshjr, are you saying we should not set threshold in the competition because it is a form of postprocessing?
Yes @Koleshjr. You summarised my thoughts
@CodeJoe
That's a yes-or-no question. The answer depends on how the threshold is determined.
✅ Allowed if:
❌ Disallowed if:
Anyone using thresholds to win should provide a sound explanation for their choice, ensuring it is data-driven and generalizable. Additionally, the client must agree with the approach. This ensures that only well-justified post-processing techniques are allowed, maintaining fairness and real-world applicability
Alright. Thanks for the heads up.
But remember that's just my personal opinion not Zindi's stand, maybe they will discourage all thresholds solutions so the best thing is to wait for @Zindi @ZINDI @Amy_Bray final say.
I believe you did something which we can't see, I am really impressed!
You also did something we can not see😂😂. We are really impressed.
I am from earth but I am not sure about someone.
🤣🤣🤣
It is almost over Yisak 😂
Let me pray 🙏 for upcoming disaster.
May the shake up be with us 🙏
🤣🤣🤣🤣
The current approach tends to deter others, as it relies heavily on intuition rather than data-driven methodologies. As @koleshjr pointed out, we should prioritize developing a generalized model pipeline that avoids post-processing based on subjective assumptions. Instead, it should be grounded in empirical data, ensuring it is adaptable and not constrained by rigid, predefined standards.
My two cents!
This is a good question, but unfortunately this is something which all platforms struggle with and that is why competitions are slightly different from real world, especially if the organizer has not split the data well.
Now lets take few scenarios, tell me which ones would you consider as illegal post processing?
- set minimum prediction to be 0, if model is giving you negative values.
- round the predictions (using round function of python which uses 0.5 threshold) as the cases should always be an integer
- round the predictions such that you get best local CV as the cases should always be an integer
- since we have almost no training data for cholera and also all the data that cholera has are 0, and assuming you have non zero predictions from you model (obviously because for test cholera cases tree split will be based on location mostly, and you will get that particular locations ~ mean). Now, you have two options for final submission, either choose the cholera predictions to submit which model provided, or you choose to overwrite cholera predictions to zero (which is logical, but a post processing)
- for disease and locations which are intersecting in train and test, you select a multiple or tranformation which gives you best OOF and use the same multiples for test.
Now in all the above examples, you say that you used OOF to get the best post processing number, but all you did is leaderboard probing. How will you catch someone :)
Basically either be smart and split the data well or stay happy with overfitted models and give people the confidence to name the teams as overfitting etc and make them proud.
and now since we are on this topic, i just spent 10 mins of my life in post processing now and let me share the results.
There is a disease called Intestinal Worms. The mean cases per year has stayed around 18 in training data. My model on the test set has a mean prediction of Intestinal Worms cases as 12. I just multiplied my predictions by 1.2 looking at the training data and my leaderboard score reduced from 5.89 to 5.87
The question is, should the above post processing be allowed (it is based on eda, and i am using it to probe the LB, could go any way in the private LB) ?
To be honest these are the sort of post processing techniques I tend to disagree with.
My question is , If such adjustments were valid, shouldn't they be learned by the model itself, not applied post hoc??
My preference would be to discourage all coefficient based post processing , but that is just my take
exactly. agreed. however, a counter could be that, we donot have a homogenous data in training set (meaning the locations of Intestinal Worms in training is very different from test and that is why the model is destined to biased) and me being smart, i know this should not happen so i am doing it. Honestly this is a "problem formulation" problem and not a DS problem. The problem could very well be i want a generalizable model but in that case the training split had to be better (which i could go into the details of, but it would be an essay :) )
well I hope the organizers will give us the final say before the end of the competition :)
One more scenario, spent 5 more minutes on google search rather than ML.
https://en.wikipedia.org/wiki/2023%E2%80%942024_cholera_outbreak_in_South_Africa
based on the article above there was a cholera outbreak in 2023, using this info, someone could take a bet for private leaderboard by increasing the cholera predictions on test set. Should this be legal?
There is no justifier for this even in EDA tbf. How will you post process Cholera case by increasing with no DATA (provided data) to support it?? (we only have very few samples). On the other hand how about someone who sets all cholera cases to zero? with a justification from the DATA that there is no enough data points to predict for this class?
exactly :)
I think they shouldn't add Cholera cases in the test set. Since this is a massive domain shift ML models cannot catch. I also saw reporter of Who about cholera outbreaks in 2023. I think it is going to be lottery for people who guess luckily for Cholera 😂