🩺 Let's Talk About: Post Processing Tricks

Multimedia university of kenya

Post Processing Tricks

Platform · 29 Jan 2025, 06:47 · 28

Well seems like this competition will be won by post processing. Just need clarification from @Zindi @Amy_Bray @ZINDI

based on this:

"Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution."

Does post processing solutions adhere to this rule? This is the same case as the Africa Credit Transaction Challenge where the top solutions were propelled via post processing. The clarification will help us choose the right submissions because we might choose post processing solutions and get deranked later on. Thank you

Discussion 28 answers

Multimedia university of kenya

"It is best to leave your predictions as is"

okay thanks for the clarifications

29 Jan 2025, 06:57

Upvotes 0

Elshahed

Just for Confirmation, Post-Processing for values not from model is discouraged or prohibted ? i mean not good action or not accepted and disqualified for prize?

29 Jan 2025, 07:45

Upvotes 2

Nostalgic Mathematics

I don't agree. For unbalanced problems. It is good to set threshold. Because for instance the 0..5 natural threshold for binary problems is not adapted. Yet threshold should not be chosen with no data evidence. There is proper way to fine-tune the thresholds well known by kagglers and others. The true issue here is the use of data leakage not the thresholds as such.

29 Jan 2025, 08:14

Upvotes 0

Nostalgic Mathematics

Post processing is not a kind of black box. It is generally guided for a good EDA or systematic model errors debugging. Maybe other people have another view on it.

29 Jan 2025, 08:16

Upvotes 1

replied to marching_learning29 Jan 2025, 08:23

Multimedia university of kenya

Using African Credit Scoring as an example, how does setting all Ghanaian records to 1 benefit the organizer or the client, as one of the discussions suggested? Does this imply that when expanding into a new market, they would simply assume all records to be 1?

While I agree that in cases of unbalanced problems, we can adjust the default threshold of 0.5, I tend to disagree with applying this approach in this particular scenario.

These are my thoughts.

Upvotes 2

replied to Koleshjr29 Jan 2025, 08:25

Nostalgic Mathematics

I see the Ghana example of setting all clients to 1 is not good example. I saw this discussion and I know that the author shared aisleading information. That's what I said, this was a choice not backed by data. Again this choice has no actionable value.

Upvotes 0

replied to marching_learning29 Jan 2025, 08:29

Multimedia university of kenya

So, if I understand you correctly, your argument is that as long as post-processing is guided by data, can be clearly explained, and is generalizable to real-world scenarios, it should be allowed. If that is the case, I agree with you.

However, shady post-processing techniques should not be allowed, as they often rely on luck rather than sound methodology.

Upvotes 1

CodeJoe

@Koleshjr, are you saying we should not set threshold in the competition because it is a form of postprocessing?

replied to marching_learning29 Jan 2025, 08:38

Upvotes 0

replied to Koleshjr29 Jan 2025, 08:39

Nostalgic Mathematics

Yes @Koleshjr. You summarised my thoughts

Upvotes 0

replied to CodeJoe29 Jan 2025, 08:44

Multimedia university of kenya

@CodeJoe

That's a yes-or-no question. The answer depends on how the threshold is determined.

✅ Allowed if:

The threshold is derived from the data through exploratory data analysis (EDA).
It is generalizable to unseen real-world scenarios.

❌ Disallowed if:

The threshold is chosen arbitrarily or based on intuition ('vibes') rather than data-driven insights.

Anyone using thresholds to win should provide a sound explanation for their choice, ensuring it is data-driven and generalizable. Additionally, the client must agree with the approach. This ensures that only well-justified post-processing techniques are allowed, maintaining fairness and real-world applicability

Upvotes 2

CodeJoe

Alright. Thanks for the heads up.

replied to Koleshjr29 Jan 2025, 08:52

Upvotes 0

replied to CodeJoe29 Jan 2025, 08:55

Multimedia university of kenya

But remember that's just my personal opinion not Zindi's stand, maybe they will discourage all thresholds solutions so the best thing is to wait for @Zindi @ZINDI @Amy_Bray final say.

Upvotes 1

Liquid

The current approach tends to deter others, as it relies heavily on intuition rather than data-driven methodologies. As @koleshjr pointed out, we should prioritize developing a generalized model pipeline that avoids post-processing based on subjective assumptions. Instead, it should be grounded in empirical data, ensuring it is adaptable and not constrained by rigid, predefined standards.

29 Jan 2025, 09:59

Upvotes 1

CodeJoe

You also did something we can not see😂😂. We are really impressed.

Upvotes 0

🤣🤣🤣

Upvotes 0

Nostalgic Mathematics

It is almost over Yisak 😂

29 Jan 2025, 12:16

Upvotes 0

Nostalgic Mathematics

May the shake up be with us 🙏

Upvotes 1

🤣🤣🤣🤣

Upvotes 0

My two cents!

This is a good question, but unfortunately this is something which all platforms struggle with and that is why competitions are slightly different from real world, especially if the organizer has not split the data well.

Now lets take few scenarios, tell me which ones would you consider as illegal post processing?

- set minimum prediction to be 0, if model is giving you negative values.

- round the predictions (using round function of python which uses 0.5 threshold) as the cases should always be an integer

- round the predictions such that you get best local CV as the cases should always be an integer

- since we have almost no training data for cholera and also all the data that cholera has are 0, and assuming you have non zero predictions from you model (obviously because for test cholera cases tree split will be based on location mostly, and you will get that particular locations ~ mean). Now, you have two options for final submission, either choose the cholera predictions to submit which model provided, or you choose to overwrite cholera predictions to zero (which is logical, but a post processing)

- for disease and locations which are intersecting in train and test, you select a multiple or tranformation which gives you best OOF and use the same multiples for test.

Now in all the above examples, you say that you used OOF to get the best post processing number, but all you did is leaderboard probing. How will you catch someone :)

Basically either be smart and split the data well or stay happy with overfitted models and give people the confidence to name the teams as overfitting etc and make them proud.

29 Jan 2025, 19:53

Upvotes 2

and now since we are on this topic, i just spent 10 mins of my life in post processing now and let me share the results.

There is a disease called Intestinal Worms. The mean cases per year has stayed around 18 in training data. My model on the test set has a mean prediction of Intestinal Worms cases as 12. I just multiplied my predictions by 1.2 looking at the training data and my leaderboard score reduced from 5.89 to 5.87

The question is, should the above post processing be allowed (it is based on eda, and i am using it to probe the LB, could go any way in the private LB) ?

29 Jan 2025, 21:03

Upvotes 0

replied to Krishna_Priya29 Jan 2025, 21:16

Multimedia university of kenya

To be honest these are the sort of post processing techniques I tend to disagree with.

My question is , If such adjustments were valid, shouldn't they be learned by the model itself, not applied post hoc??

My preference would be to discourage all coefficient based post processing , but that is just my take

Upvotes 0

replied to Koleshjr29 Jan 2025, 21:25

exactly. agreed. however, a counter could be that, we donot have a homogenous data in training set (meaning the locations of Intestinal Worms in training is very different from test and that is why the model is destined to biased) and me being smart, i know this should not happen so i am doing it. Honestly this is a "problem formulation" problem and not a DS problem. The problem could very well be i want a generalizable model but in that case the training split had to be better (which i could go into the details of, but it would be an essay :) )

Upvotes 1

replied to Krishna_Priya29 Jan 2025, 21:34

Multimedia university of kenya

well I hope the organizers will give us the final say before the end of the competition :)

Upvotes 0

https://en.wikipedia.org/wiki/2023%E2%80%942024_cholera_outbreak_in_South_Africa

One more scenario, spent 5 more minutes on google search rather than ML.

based on the article above there was a cholera outbreak in 2023, using this info, someone could take a bet for private leaderboard by increasing the cholera predictions on test set. Should this be legal?

29 Jan 2025, 21:10

Upvotes 0

replied to Krishna_Priya29 Jan 2025, 21:19

Multimedia university of kenya

There is no justifier for this even in EDA tbf. How will you post process Cholera case by increasing with no DATA (provided data) to support it?? (we only have very few samples). On the other hand how about someone who sets all cholera cases to zero? with a justification from the DATA that there is no enough data points to predict for this class?

Upvotes 0

replied to Koleshjr29 Jan 2025, 21:26

exactly :)

Upvotes 0

replied to Koleshjr29 Jan 2025, 22:06

Nostalgic Mathematics

I think they shouldn't add Cholera cases in the test set. Since this is a massive domain shift ML models cannot catch. I also saw reporter of Who about cholera outbreaks in 2023. I think it is going to be lottery for people who guess luckily for Cholera 😂

Upvotes 1