Guys, we should share our learning, if not via exact code then via discussion atleast which will help the new people learn.
For me, from the start itself since this was a skewed target, tweedie objective performed the best.
There was a grey area in the creation of target as the definition was not clear. For the whole competition, i took sum of all the cases grouped by ID.
however, on the last day when i experimented with MAX instead of sum, it gave me a boost from public 5.89 to 5.7 keeping everything else same.
None of the feature engineering techniques worked for me.
My final set of features were just the ones provided in the base data. Additional data did not help too.
What was interesting was that i thought cholera would lead to a shakeup, but it did not.
At the end i had selected two models
1. Single model (10 fold stratified LGBM on - stratification on month_year) : Public: 5.71 and Private 6.78
2. The second one was just replacing cholera predictions by a groupkfold on disease removing disease as a feature and just replacing cholera predictions of the above model by this one. Intrestingly mean cholera predictions from the first model was around ~1 and from this model was around ~9. However, the scores of this model was Public: 5.71 and Private 6.79
One more interesting case is that i had gotten a better CV and public score by removing Latitude and logitude features, but since test had that extra location, i did not select it as a final submission otherwise i had thought this could lead to worse predictions for that new location (categorical) in the private. But that did not happen surprisingly (may be those samples were not in private and only in public)
This submission of mine has Public: 5.70 and Private 6.70 (which would place me 10th)
Instead of max aggregation on the last day i tried all other (mean, median, etc) which did not benefit me.
No other model apart from LGBM came even close for me. So i did not care to ensemble.
"My final set of features were just the ones provided in the base data. Additional data did not help too."
you mean just disease, lat, lon, facility and location?
Code:
https://colab.research.google.com/gist/krishnapriya-18/d6b9eb6cc952884baa64b4afff44e7f7/welcome-to-colaboratory.ipynb?authuser=1
the above code is for 16th place, if you just remove 'Transformed_Latitude', 'Transformed_Longitude', you get 10th place
Thanks @Krishna_Priya . OKay to share your code here on the discussion,
Use google colab , save your notebook as a github gist. Then in the discussions do this:
::GIST::/[your username]/[gist id] where the text after the ::GIST:: is the unique URL From the Gist :)
Other than that, It surprises me that such a simple implementation could get top 20. DId you try a time series approach? That is what I did, I grouped by mean too (max also worked fairly good to me) then since in the test set each lat_on_disease pair is 84, (12months * 7 diseases) , I decided to transform my training dataset to reflect that too, so (12months * 4years * 7 diseases = 336 for each combination) the missing diseases were filled by zero since I assumed they are non-existent. Then with this you get a good temporal dataset and you can do time series feature engineering. Anyways seems like it was a good idea in theory but never worked magic for me but maybe someone else made it work.
Thanks again for sharing your work
Thanks. Created a link.
I exactly tried what you did and tried classical time series model as well as deep learning based time series models but LGBM was always dominating everything.
Also in LGBM, things which did not work were:
- resampling and target imputation (by 0, mean, median, forward fill)
- resampling and any time series based feature engineering (lag etc)
- any target encoding
- other loss functions, classic or custom
Oh great, and do you have any theories or assumptions as to why the time series approach did not work? Intuitively it should work no?
Yes, when lags did not work, I looked at the auto correlation and partial auto correlations and they were very poor. Basically there is no yearly seasonality and i did not want to experiment with lags below 12 as it could go either way in private LB. That is why boosting algos were just trying to predict the median of the splits to reduce MAE, nothing else. Year to Year cases are very different for diseases, i did not see a trend as such. But I believe @Yisakberhanu must have something, because there is a big delta from his model.
But the bigger question is why were there duplicates in training data? what was the actual aggregation to be used? we modelled on a target whose definition is still not clear :)
True, and what aggregation technique did they then use in the test data?
Yeah waiting for Yisak to spill the beans :)
I also did that. Correlating CV but no magic😅
This is so amazing,
I didn't even realize that the Category_Health_Facility_UUID column was actually a category with 4 classes. I was too focused on the addon data and distance.
For the final model I combined SVR and tabpfn.
Thank you for sharing your solution. It was much simpler than I would have imagined. TIL about the tweedie objective function. Thank you!
Thanks for sharing. just learned about the tweedie objective function. tabular is still mostly a mystery for me.
Thank you for sharing, this is something to
@Krishna_Priya, That's a good implementation. I think it will also be fair to share my solution. My best private score is 6.6190 though so let me try and share what I did. I used catboost and LightGBM. Mine wasn't any fancy trick at all. No stratifiedKfold, no groupkfold. Just a train test split and training on all features. That is the surprising part. And by the way, post processing will make you lose the competition big time. For the score on the PL, 5.375 and Private 6.74, I used an ensemble of both catboost and LightGBM but in different ways. Later, I multiplied the typhoid cases from the ensemble model by 0.4. You will get a serious boost in the PL but a drastic drop on the private board from 6.6190 to 6.74.
Generally, this was my trick:
I trained catboost models and lightgbm models on different features. Let me explain this better.
So I trained a catboost model on some features ( I dropped Month, Category_Health_UUID, location and some few categorical columns from the additional datasets). This gave me 5.82. I did same for lightgbm (5.84), which gave me 5.75. I used the default train test split.
The second set of models I trained also had different features( The features were only from the main dataset with some cap on outliers). LightGBM only exceled here (5.85). So I only used LightGBM. Before that, here I used the mean aggregation which was great. I trained on all.
You have to convert that new location in the test set to a more closer location. That gave a boost too.
Now this is the funny part XD,
I ensembled these until I got a score of 5.57 and private of 6.6190. So generally Post processing wasn't friendly in this competition at all. And I'm sorry for @marching_learning.
Anyways, the best aggregation is max. But I decided to go with take the first instance and drop the rest for the first set of models which worked pretty well than any aggregation.
Another lesson I have learnt from this competition is not to probe on the PL. You will lose it all 😭.
lol. Nice post. Thanks for sharing @CodeJoe :)
@CodeJoe in training data there were many duplicate rows (about 10k) did you remove the duplicate rows?
Yes. I only took the first instance for the first set of models. But for the second set, I aggregated with the mean.
😂😭
@machine_learning I am very curious as to how you got the 3.97 if you are willing to share
Thanks for sharing your solution @Krishna_Priya, and congratulations to everyone.
I still wonder how people approached the aggregation problem and what's the right way to decide it. Since we have consistent 3 duplicates for "Diarrhea" and consistent 4 duplicates for "Malaria" in almost each (location-year-month) combination, then this should be expected to be found on the test set, too right?
And this is the most confusing part, which is how the data is being handled on the test side, whether it's aggregated or not aggregated and if aggregated then which type is used. We can't decide that from just CV score, and public LB score was unreliable (for me) in this decision. So any trick in approaching this problem?
tweedie loss also gave better cv, but worse lb, I wondered why, then I realized that it didn't work well for time-serie cv, and mae was better
yeah. And thank you for sharing your solution :)