Primary competition visual

SUA Outsmarting Outbreaks Challenge

Helping Tanzania, United Republic of
$12 500 USD + AWS credits
Completed (~1 year ago)
Prediction
815 joined
395 active
Starti
Dec 06, 24
Closei
Jan 31, 25
Reveali
Feb 01, 25
User avatar
Koleshjr
Multimedia university of kenya
Post Processing Tricks
Platform · 29 Jan 2025, 06:47 · 32

Well seems like this competition will be won by post processing. Just need clarification from @Zindi @Amy_Bray @ZINDI

based on this:

"Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution."

Does post processing solutions adhere to this rule? This is the same case as the Africa Credit Transaction Challenge where the top solutions were propelled via post processing. The clarification will help us choose the right submissions because we might choose post processing solutions and get deranked later on. Thank you

Discussion 32 answers
User avatar
Amy_Bray
Zindi

Hmmm, setting thresholds is discouraged as we as data scientists can't make the final say on what a doctor might know better than us. It is best to leave your predictions as is and maybe write a paragraph on what you would do further if you were to round.

This means medical professionals in communication with the researchers who created the dataset can set their own thresholds and add what is important to them.

29 Jan 2025, 06:55
Upvotes 1
User avatar
Koleshjr
Multimedia university of kenya

"It is best to leave your predictions as is"

okay thanks for the clarifications

Just for Confirmation, Post-Processing for values not from model is discouraged or prohibted ? i mean not good action or not accepted and disqualified for prize?

User avatar
marching_learning
Nostalgic Mathematics

I don't agree. For unbalanced problems. It is good to set threshold. Because for instance the 0..5 natural threshold for binary problems is not adapted. Yet threshold should not be chosen with no data evidence. There is proper way to fine-tune the thresholds well known by kagglers and others. The true issue here is the use of data leakage not the thresholds as such.

User avatar
Krishna_Priya

Use of Data leakage. hmmm. interesting :)

User avatar
marching_learning
Nostalgic Mathematics

Post processing is not a kind of black box. It is generally guided for a good EDA or systematic model errors debugging. Maybe other people have another view on it.

29 Jan 2025, 08:16
Upvotes 1
User avatar
Koleshjr
Multimedia university of kenya

Using African Credit Scoring as an example, how does setting all Ghanaian records to 1 benefit the organizer or the client, as one of the discussions suggested? Does this imply that when expanding into a new market, they would simply assume all records to be 1?

While I agree that in cases of unbalanced problems, we can adjust the default threshold of 0.5, I tend to disagree with applying this approach in this particular scenario.

These are my thoughts.

User avatar
marching_learning
Nostalgic Mathematics

I see the Ghana example of setting all clients to 1 is not good example. I saw this discussion and I know that the author shared aisleading information. That's what I said, this was a choice not backed by data. Again this choice has no actionable value.

User avatar
Koleshjr
Multimedia university of kenya

So, if I understand you correctly, your argument is that as long as post-processing is guided by data, can be clearly explained, and is generalizable to real-world scenarios, it should be allowed. If that is the case, I agree with you.

However, shady post-processing techniques should not be allowed, as they often rely on luck rather than sound methodology.

User avatar
CodeJoe

@Koleshjr, are you saying we should not set threshold in the competition because it is a form of postprocessing?

User avatar
marching_learning
Nostalgic Mathematics

Yes @Koleshjr. You summarised my thoughts

User avatar
Koleshjr
Multimedia university of kenya

@CodeJoe

That's a yes-or-no question. The answer depends on how the threshold is determined.

Allowed if:

  • The threshold is derived from the data through exploratory data analysis (EDA).
  • It is generalizable to unseen real-world scenarios.

Disallowed if:

  • The threshold is chosen arbitrarily or based on intuition ('vibes') rather than data-driven insights.

Anyone using thresholds to win should provide a sound explanation for their choice, ensuring it is data-driven and generalizable. Additionally, the client must agree with the approach. This ensures that only well-justified post-processing techniques are allowed, maintaining fairness and real-world applicability

User avatar
CodeJoe

Alright. Thanks for the heads up.

User avatar
Koleshjr
Multimedia university of kenya

But remember that's just my personal opinion not Zindi's stand, maybe they will discourage all thresholds solutions so the best thing is to wait for @Zindi @ZINDI @Amy_Bray final say.

User avatar
Yisakberhanu
wachemo university

I believe you did something which we can't see, I am really impressed!

User avatar
CodeJoe

You also did something we can not see😂😂. We are really impressed.

User avatar
Yisakberhanu
wachemo university

I am from earth but I am not sure about someone.

User avatar
CodeJoe

🤣🤣🤣

User avatar
marching_learning
Nostalgic Mathematics

It is almost over Yisak 😂

User avatar
Yisakberhanu
wachemo university

Let me pray 🙏 for upcoming disaster.

User avatar
marching_learning
Nostalgic Mathematics

May the shake up be with us 🙏

User avatar
CodeJoe

🤣🤣🤣🤣

The current approach tends to deter others, as it relies heavily on intuition rather than data-driven methodologies. As @koleshjr pointed out, we should prioritize developing a generalized model pipeline that avoids post-processing based on subjective assumptions. Instead, it should be grounded in empirical data, ensuring it is adaptable and not constrained by rigid, predefined standards.

29 Jan 2025, 09:59
Upvotes 1
User avatar
Krishna_Priya

My two cents!

This is a good question, but unfortunately this is something which all platforms struggle with and that is why competitions are slightly different from real world, especially if the organizer has not split the data well.

Now lets take few scenarios, tell me which ones would you consider as illegal post processing?

- set minimum prediction to be 0, if model is giving you negative values.

- round the predictions (using round function of python which uses 0.5 threshold) as the cases should always be an integer

- round the predictions such that you get best local CV as the cases should always be an integer

- since we have almost no training data for cholera and also all the data that cholera has are 0, and assuming you have non zero predictions from you model (obviously because for test cholera cases tree split will be based on location mostly, and you will get that particular locations ~ mean). Now, you have two options for final submission, either choose the cholera predictions to submit which model provided, or you choose to overwrite cholera predictions to zero (which is logical, but a post processing)

- for disease and locations which are intersecting in train and test, you select a multiple or tranformation which gives you best OOF and use the same multiples for test.

Now in all the above examples, you say that you used OOF to get the best post processing number, but all you did is leaderboard probing. How will you catch someone :)

Basically either be smart and split the data well or stay happy with overfitted models and give people the confidence to name the teams as overfitting etc and make them proud.

29 Jan 2025, 19:53
Upvotes 2
User avatar
Krishna_Priya

and now since we are on this topic, i just spent 10 mins of my life in post processing now and let me share the results.

There is a disease called Intestinal Worms. The mean cases per year has stayed around 18 in training data. My model on the test set has a mean prediction of Intestinal Worms cases as 12. I just multiplied my predictions by 1.2 looking at the training data and my leaderboard score reduced from 5.89 to 5.87

The question is, should the above post processing be allowed (it is based on eda, and i am using it to probe the LB, could go any way in the private LB) ?

29 Jan 2025, 21:03
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

To be honest these are the sort of post processing techniques I tend to disagree with.

My question is , If such adjustments were valid, shouldn't they be learned by the model itself, not applied post hoc??

My preference would be to discourage all coefficient based post processing , but that is just my take

User avatar
Krishna_Priya

exactly. agreed. however, a counter could be that, we donot have a homogenous data in training set (meaning the locations of Intestinal Worms in training is very different from test and that is why the model is destined to biased) and me being smart, i know this should not happen so i am doing it. Honestly this is a "problem formulation" problem and not a DS problem. The problem could very well be i want a generalizable model but in that case the training split had to be better (which i could go into the details of, but it would be an essay :) )

User avatar
Koleshjr
Multimedia university of kenya

well I hope the organizers will give us the final say before the end of the competition :)

User avatar
Krishna_Priya

One more scenario, spent 5 more minutes on google search rather than ML.

https://en.wikipedia.org/wiki/2023%E2%80%942024_cholera_outbreak_in_South_Africa

based on the article above there was a cholera outbreak in 2023, using this info, someone could take a bet for private leaderboard by increasing the cholera predictions on test set. Should this be legal?

29 Jan 2025, 21:10
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

There is no justifier for this even in EDA tbf. How will you post process Cholera case by increasing with no DATA (provided data) to support it?? (we only have very few samples). On the other hand how about someone who sets all cholera cases to zero? with a justification from the DATA that there is no enough data points to predict for this class?

User avatar
Krishna_Priya

exactly :)

User avatar
marching_learning
Nostalgic Mathematics

I think they shouldn't add Cholera cases in the test set. Since this is a massive domain shift ML models cannot catch. I also saw reporter of Who about cholera outbreaks in 2023. I think it is going to be lottery for people who guess luckily for Cholera 😂