This was a great competition thanks @Zindi and @PredictiveInsights. It has reminded everyone(some more than others) on the importance of cross-validation and overfitting. personally I used CV for certain models when I was training but nearing the end I decided to choose predictions where I did not have any CV, it was a gamble... that gamble did not pay off :-/ . Nevertheless it was a great learning experience(as always). My solution(best scoring Private Lb) was an ensemble of two models LightGBM and CatBoost, it was a .65 weight on LightGBM and .35 on the weight of CatBoost which gave me a decent score in the private LB (0.860637559) but alas my good people I did not select it as one of my scores.
My FE was basic(little to none)... I focused primarily on the model, and yes I know great features > great model(everytime).
Thank you for the insight
please can you share your solution... I want to learn
This was a great contest(good for learning as always ) i personally did the same too, few features more emphasies on the models. i had mismatch between my Cv and Lb and i knew it will hunt me at the end but took the risk,
Hi - thanks for sharing. Sorry about not selecting the best one, but don't worry, the fact that you implemented a good model is a much weightier matter.
Yes, CV is so tricky and the answer is not entirely linear ... Here my best public LB would give me best on private LB also (I could have walked away with 0.863!) but I really struggled here and sort of later on only relied on my own, local CV. (The results were too unstable and in the end I only trusted my own CV. I attributed it to AUC and so for local CV I used RMSE in stead. tbh I don't care too much about the result, the learning is more important, so I'm satisfied with the way this turned out.)
I sort of gravitate towards this approach nowadays, ignore public LB just use local CV. That or the other extreme, which is no local CV just use public LB. In a few competitions after some kind of a breakthrough, public LB and local CV would start to align, so I take that as a great indicator that I am on the right track and making progress with the model, but it also a sparse enough event that I don't make it a requirement to find every time.
There is a famous pianist - Lang Lang, like #2 in the world. I once saw him in an interview, responding to the question "What is the best way to become a good pianist" with "Anything goes". His argument was that he travelled the world, and saw so many really good pianists, all with different backgrounds and training methods, that it does not really make such a big difference. Sure, you need to master some disciplines, such as CV I suppose, but in the end you can do well with almost any approach.
Absolutely the learning is the most rewarding part. The knowledge you gain is yours and yours alone.
But always do share said knowledge
Yes ... fwiw this one, I think, was meant for GBMs. I tried few other models and also some imputations but GBM that can handle missing values always did better, so in the end I tried fe and also feature interactions but always using a NAN-capable GBM. I'd be interested if somebody was able to dent this using either imputations or something other than GBM.
I was actually experiementing with TFDF(Tensorflow Decision Forests), which is basically Decision trees running on tensorflow. To see if it might dethrone LightGBM and XGBoost it did fairly decently getting a score of .84 in my private LB
Nice - saw those but never used them. How did you treat NANs? Impute? Or do those TFDF allow them? I tried some variable selection networks and few other NN approaches and some showed promise, but to get a good score seemed like lots of work, so I stayed with the GBMs.
Problem here with trees is that I think there is nice interaction between some vars and the trees won't pick it up easily ...
While I didn't make significant progress on the leaderboard (LB), I did make some intriguing observations during my exploration of various algorithms such as LightGBM, DecisionTrees, XGBoost, CATBoost, and more. Progressing on the LB seemed elusive, almost as if there were unseen challenges lurking within the dataset.
My approach involved both Cross-Validation (CV) and Stratified CV, revealing a consistent pattern: one fold consistently performed well, another consistently performed poorly, while the remaining three hovered somewhere in between. Delving deeper, I conducted correlation and heatmap analyses on both the training and test sets. These analyses highlighted a strong correlation between the 'Round' column and the target variable. Surprisingly, nearly 90% of the other columns had minimal to negligible influence on the target.
Motivated by this insight, I decided to focus my attention on the 'Round' column. A quick examination revealed that it consisted of just four unique values: 0, 1, 2, and 3. To harness this information, I conducted a group-by analysis on the 'Round' column, applying a RandomForestClassifier to each group. Remarkably, I achieved an impressive 99% accuracy on Round 2, possibly indicative of overfitting. Rounds 0 and 3 presented more challenges, with Round 0 having only 27 rows. However, by employing the Catboost Classifier, I managed to improve its performance from 0.62 to 0.78.
In parallel, I delved into research on enhancing Round 3 and explored various aspects of feature engineering, optimization, building hyperparameter wrappers, PCA, and dimensionality reduction. My endeavors became more intricate as I juggled these tasks alongside participation in the CPI/inflation competition. Although I may not have achieved remarkable results, the learning experience was invaluable, and I thoroughly enjoyed the competition.
Throughout this journey, I grappled with several key questions:
1. How can strong predictors be constructed from initially weak ones?
2. I also wrestled with the challenge of generating meaningful columns from data sources like Labour Force Participation Rate (LFPR), Absorption Rate (AR), and Unemployment Rate (UR) obtained from the Quarterly Labour Force Survey.
3. I encountered valuable advice from data science experts - the importance of developing DOMAIN KNOWLEDGE beyond the dataset to elevate LB performance. This insight was particularly enlightening.
My research and participation in this competition provided me with a wealth of knowledge and insights. I eagerly look forward to reviewing the strategies and notebooks of the competition winners, as I believe this will further enrich my learning experience."
@Jaw22!!!!
Wow, impressive. No, revelatory!!!! I initially had round as categorical but it did not make a dent so later on I treated it as numerical. I think I got better results takings logs, but not sure if I retained that.
Wow! So your strata in Stratified CV was round? Anyhow, I want to sit next to you on the bus ... in between you and @wuuthraad of course. I'll just listen ... wow, that was like a flash of lightning on a clear day. I wonder if the top guys was able to exploit that a bit in their models.
@Jaw22 Great stuff... I used groupby on the Round to see how long the average tenure was from province to province, job to job, geography to geography... I did not really use it outside of that, I just OHE after gaining some "meaningful" insights. One important thing i've noticed about "Domain Knowledge" is that you do not need to be an expert in the field to gain meaningful insights from the data. Robust EDA can give you pretty solid information.
Domain Experts seem to ask the right questions but overall tend to have little to poor performance in the overall lifecycle of ML Models. It's what I personally have remarked, do take it with a pinch of salt. Obviously different people have different experirences but personally that has been my experience.
@skaak as for the bus, you my good sir are the one I would not mind listening to. I'd just be a wallflower, soaking in your Rays of Knowledge.
Hmmm ... better be careful, I'm a mere mortal ... those rays emanating from me may not be knowledge ...
Thanks for all the discussion and the feedback on your modelling approaches. This is super-interesting and we're glad that you found the competition useful.
It was our first time hosting a Zindi competition and we learnt a lot too.
Thanks for all your time and efforts.
Thanks @neilr - you know, I've done quite a few competitions. You learn a lot and it always feels like more dialogue with the host can be mutually beneficial. Some comments here contain really useful stuff. Here is another one e.g. I tried very hard to create nice feature interactions or segmentations for some of the categories, e.g. male and female. That one in particular, since it is a binary, I tried very hard to use to gain some edge. In the end, it yield very little benefit, so, fwiw, it seems employment is mostly gender-blind.
also, ito the status, it seems those who are 'other' or 'self-employed' have a real edge while those that are 'studying' or 'employment programme' don't have any edge
This is particulalry interesting, since gender is (in the general population) such an important corrleate with employment outcomes.
I was surprised also ... I think what we see as gender I may have captured in other features e.g. in the academics. I even tried to split that along gender lines, using e.g. verbal and non-verbal features and interacting it with gender and others. Oh well ... the outcome is not entirely orthogonal to gender but it is far far less aligned than I expected and have little interaction with other features.
@Jaw22 - something tells me you may have put this under the microscope. Do you have any comments on this one?
Hi @Skaak, on my correlation heatmap, female col had a .10(10%) correlation with target. I did not explore it further for rengineering, etc. I had my eye on the tenure col, that had a much stronger correlation at 0.29. With hindsight, I figure that was the col that will bring home the bacon, but time was not on my side. Apologies for late reply, I am running around a bit, pivoting for sustainability and livelihood...lol.