thanks for the platform, you guys are really doing a great job.
I have few issues i will like the Zindi team to look into;
1. Data: As far as know fraud is a behavioural issue and has nothing to do with the transaction itself, therefore detecting this behaviours should be the goal, begging the question of why the amount/value of the transaction are included in the data and i'm sure this was the reason behind choosing F1 score has the metric for this challenge because with little or no feature engineering one can get 98+ accuracy and auc score easily. if possible xente should provide an entirely different set of data without these two features. disclamier: i dont have any problem with it should you choose to continue as it is.
2 Account ID: well i don't know if your team decided to do this intentionally to cause a big change between the public/private test data but from what i've seen, the account ID leaks into the test data. if intentionally well ok and if not, its better to look into this.
3 Timeline: Competition timeline is too long and makes the last few weeks disinteresting, resulting in overfiiting the leaderboard, submissions dated back as 3-4 months to the end of the competition (sea turtle) still comes good,so why wait . i'd advice 2 month is enough to achieve optimal solution to the challenge.
1. Hello Holar, thanks for your insight on the competition, its true that a baseline submission minus doing any feature engineering and optimization can lead to 0.6666+ F 1 score. its not only on AccountID where the magic and leak is, even CustomerId......... but if you observe the test data very well there is a trap and a shake up on this competition due overfitting is coming . Take a look at these Customers: CustomerId_909 ,CustomerId_48781 , CustomerId_19881, CustomerId_44531 , CustomerId_23031 CustomerId_50541 ,CustomerId_22661 ,CustomerId_30751, CustomerId_51551 , CustomerId_37681, CustomerId_15351,CustomerId_23531 ,CustomerId_856 , CustomerId_42751,CustomerId_13021, CustomerId_865, CustomerId_74141, CustomerId_43911 ,CustomerId_18581, CustomerId_22161 ,CustomerId_73391 ,CustomerId_49251 ,CustomerId_24451 ,CustomerId_15671 ,CustomerId_11751, CustomerId_44541, CustomerId_41341, CustomerId_22141 ,CustomerId_11221, CustomerId_51051,CustomerId_698 ,CustomerId_21431, CustomerId_806 ,CustomerId_34671 ,CustomerId_51231 , CustomerId_41281 ,CustomerId_25281,CustomerId_18911, CustomerId_22921, CustomerId_40751,CustomerId_16801,CustomerId_27031,CustomerId_39561,CustomerId_28771,CustomerId_19841, CustomerId_74011 , CustomerId_74291 , CustomerId_16531, CustomerId_16981 ,CustomerId_18501 ,CustomerId_26561 , CustomerId_16021 ,CustomerId_74161, CustomerId_682.
2. I think part of the solution to the problem is good visualisation to develop an intial flag before even the deployed model and that should be probably the reason they placed those details but it should have not been that very straight forward
3. I agree that the timing is always very long and makes the competitions disinteresting but if you observe what actually happens is that guys submit baselines and keep quiet till last few weeks of the competitions
Hello Holar, i believe you will do some featuring engineering as you twerk your model parameters .Its your decission to remove amount or value or both as you may feel need be but remeber shouldnot affect prediction result what so ever.
Thanks mark.. I will definitely take a look at them.
I also thought of the duration is always too long. The dataset is Directly Imbalance it very delicate to Generize anyway winning a competition sometimes require a little bit of
I think this timeline is fine because it accomodates different schedules and availability patterns. It's like that many people don't work on it throughout anyway.