🏦 Let's Talk About: (lots of) missing data

Ferra Solutions

(lots of) missing data

Data · 5 Nov 2022, 07:09 · 25

@zindi

Seems there is lots of missing data ... big problem.

If there are no target events for a given user in a given timeslot, then it would make sense to assume that no target events occurred and to model accordingly.

It seems though that this does not hold - no target events for a user in a timeslot is perhaps due to the way in which events are captured or perhaps the dataset is not complete but has a lot of holes in it.

Zindi (amy!) - can you perhaps check with ABSA and comment on how complete or how tatty the data is.

TIA

Discussion 25 answers

cobusburger

there appears to no transactions for a time period within 2021-03-16. is this what you are referring to? i agree, ignoring it would lead one to underestimation the target, but there are ways to control for it. my strategy would be to just treat that period of that day as NA. I dont think this is a big issue.

5 Nov 2022, 08:38

Upvotes 0

replied to cobusburger5 Nov 2022, 09:31

Ferra Solutions

Thanks!

If the data is good and there is no observation for userid x in time slot y then the reasonable assumption to make is that x did not transact during that slot. However, if this does not hold, the assumption that has to give is that the data is good.

I just want to hear ABSAs comment. If it is something like "The data is, in fact, good" then we have big trouble as it seems some disconnect between train and test. If, on the other hand, it is something like "Yip, the data is a bit ragged", then at least the disconnect makes a bit more sense.

Anyhow, early days, so perhaps the fault is in my analysis, perhaps, once I have a better grip on this, it turns out everything makes sense. But at the moment it seems like a real waste of time to assume the data is highly accurate, so I'd love some comment from ABSA, perhaps even just a description of the data collection process can be really helpful.

@cobusburger any comment on this?

(EDIT specifically, do you think the gap you mention can explain all this?)

(EDIT2 then ABSA's response would be something like "Data's good, there was this one time when due to a glitch we got a gap in the data ...")

Upvotes 0

You seem to be the most active @skaak.

IMO the reason the dataset is the way it is, is answered in the "Data" page of the competiton. Which clearly states

"This challenge is also unique in that the data is provided in the same way it would be encoded for machine learning - testing participants' data analysis and reasoning skills from the outset, having to discover the logical relationships before building models.".

When we start getting a better understanding of the dataset and how to engineer more features, scores on the LB might grow exponentially from the "coin flip" scores that seem to be reigning at the top... I feel like I am getting a better grasp of everything. I'll test my hypothesis by making a submission(Hopefully by sunday) and seeing if my hunch was correct and I'll let you know

5 Nov 2022, 10:19

Upvotes 0

replied to wuuthraad5 Nov 2022, 10:27

Ferra Solutions

Thanks @wuuthraad

Yes, I have a lot of subs. Some of them are sort of baseline, frequency based ones. Even they score terribly, so I am wondering what beast we are dealing with here. But hey, this is zindi, we get 10 subs a day and 300 before the comp ends, which we won't exhaust given the current timeline, so I am not apologising for lots of subs. Also, either you do this for fun and use the LB to guide you along, or you do this for blood and build this incredible CV strategy that sucks the life out of the whole comp.

That and bad data, that can suck the life out of this :-)

Waiting for your sub and comment expectantly! This is not really my cup of tea, but what else to do at the moment. Either zindi or gardening ...

Upvotes 1

replied to skaak5 Nov 2022, 11:42

Ferra Solutions

You know what, now that I've had time to think about this a bit, here's the correct response. Yes, I worked on this a bit early, but the data is tricky. You'll see. You get events and have to predict for a bucket that is a week ahead in time, or so I think. So I have basically been struggling with data manipulation thus far.

At some stage I fit a vanilla model and used the LB as a type of data check. By now I have collected quite a few 0 entries and have debugged my data reading code quite a bit. I think the code is ok by now, but I am starting to suspect the data. (EDIT heck, if I had more subs I'd know by now if it is the data or not).

He he, when this is over, I'll set up some session for us to discuss this comp. Just ~20 days and it is over. Would be nice to discuss how you approached some of these challenges.

Upvotes 0

replied to wuuthraad10 Nov 2022, 19:59

Any luck with beating the coinflip scores @wuuthraad?

Upvotes 0

replied to Yudheezus11 Nov 2022, 15:15

hahaha dude! way to put me on the spot. Trust me If I cracked it, it'll show.

Upvotes 0

replied to wuuthraad12 Nov 2022, 09:20

Ferra Solutions

Yip I remember from fossil ... personally, I'm more of a release early and release often kind of guy ...

Getting bored of this one, don't have the hardware to do it justice ...

Upvotes 0

replied to wuuthraad12 Nov 2022, 09:27

Ferra Solutions

dragon - you have a GPU?

Upvotes 0

replied to skaak12 Nov 2022, 14:39

Yeah I have a GPU on my Desktop.

Personally I think then overthink the features I can create hence why I mainly make little subs, I am trying to kick that habit

@skaak I remember you did well on the fossil challenge, why are you getting bored of this competition?

Upvotes 0

replied to wuuthraad12 Nov 2022, 15:43

Ferra Solutions

Ai tog - this one ...

You don't know it but dragon you opened my eyes for what technique to use. One I've always wanted to learn and apply, but with little experience in it, I need to experiment a lot. Technically it is a steep challenge, but nice also, but to fit a model takes ~ 24h. I'm just waiting for results all the time.

In fossil the models were easier to train, so I could fit a few simultaneously and also spread it to my laptop, but this one is too big for that. I think GPU would help a lot, if my hunch on the technique is correct, GPU is almost a requirement.

Upvotes 0

replied to skaak12 Nov 2022, 18:42

If you feel a GPU will be ther difference then why not use Google Colab? or Kaggle to train you model?

(EDIT:) Like I said in a previous competition, You're my Obi-wan @skaak you've helped me a great deal

Upvotes 0

replied to wuuthraad28 Nov 2022, 15:48

Ferra Solutions

Well, look at the LB ... anyhow, it does not reflect the whole story, what a comp ...

Your comment from earlier opened my eyes!

I subbed a few using defaults e.g. GBM and RF. Later I realised, these were just a comedy of errors. I predicted for the wrong date, I mixed up the actual target event and my (OrdinalTransformed) encoded one, and a few more ... perhaps you should be in the market for a new obi-wannabe.

I then tried a few frequencies and got best result just simulating number of times a given user performed target event in a given time slot. Score was ~35, best I ever got. But your comment was revelatory, to me it said, this comp is a seq2seq thingy. Thank you friend!!! I will take that advice!!!

seq2seq - something I always wanted to do, and now was my chance. But it is much easier said than done, especially if you have less than a month. Anyhow, you never subbed, but was this (seq2seq) your thinking?

Too little time, I never really had a good model, but now I feel like I have sort of mastered seq2seq. I even added attention layers at the end. My score never really out the blocks, hovering around 25 at the best of times, but more than worth it given learning seq2seq.

Have you any seq2seq experience? Seems quite complex to set up and understand ...

Upvotes 1

replied to skaak29 Nov 2022, 18:25

what comment?

man... Why I did not make a sub is factors beyond me. Comps like this really interest me because they require a ton of thinking,preprocessing and general craziness... but then again I was busy with some uni applications(Long story) so time was against me had a ton to do with little time. I have little to no experience with seq2seq. Like most things so long as it is a challenge, I shall attempt (well not in this case). I'll try dust off the code I was writting, complete it and see how it would have performed(*maybe)

@skaak Are you going to try the classification comp? I see it is live

Upvotes 0

replied to wuuthraad30 Nov 2022, 07:01

Ferra Solutions

uni!!!!!! that must be a long story ... hope it succeeds, but skill is skill, probably more so if you pick it up on the street ... uni can be quite academic, but at least employers understand it.

FIFA keeping me busy at the moment, I'm not competing and feeling guilty about it. There is this gravitational wave comp on ... uhm ... giggle. It is spacey stuff I luv, so I was thinking to do that one just for fun. It is holidays and all, so no huge appetite for a challenge atm ...

So you off to uni next year?

@cobusburger - you did well here, care to share how you approached this? Regression with latent variables?

Upvotes 0

replied to skaak30 Nov 2022, 07:51

world cup fever! I had faith in Die Mannschaft to win the cup this time but they are dissapointing.

There's a good chance I am going back to uni next year... Even though I already have the skills in ML. Reason being bureaucrats, they love degrees. Ecspecially in SA which is weird because a good chunk of companies abroad accept "relevant work expierice" in the field as an alternative to a degree. Maybe it's a sign that this dragon must fly off into the wide blue yonder

Well nothing is set in stone yet but I have been juggling a lot in this final stretch of the year.

Tell me how the competition on the blue site "giggle" goes, looks intersting.

Upvotes 0

replied to wuuthraad30 Nov 2022, 08:48

Ferra Solutions

Dragon! You must stay! SA needs you. Uni will bore you, people will try to structure stuff you already know to give it an appearance of higher knowlege. A nice project somewhere will be much better if you can find it. Otherwise zindi or giggle or ...

Anyhow ... space comp is 200+G download ... still on the fence to do it or not ...

@Yudheezus - you rock this LB man ... congrats ... wow #1 ... how?

Upvotes 0

replied to skaak30 Nov 2022, 09:14

hahahahahaha way to be a patriot @skaak... Like I said earlier, nothing is set in stone yet. A lot can happen in a couple of months.

For the gravity comp can't you load and manipulate the dataset with the notebooks on giggle? They give around 30 hrs of free GPU and 15 of TPU if I am not mistaken way better than colab IMO. way simpler than trying to do everything locally. Also read up on rapids if you haven't already for large dataset manipulation.

@Yudheezus seemed to have luck on his side... too bad @DanielBruintjies did not get #1 seemed he was doing the most.

Upvotes 0

replied to wuuthraad30 Nov 2022, 09:22

Lmao are you serious???🤣🤣🤣 participating in these forums is like watching clowns fumbling in a circus. I don't think I was lucky, I just submitted my optimal solution 19 days ago. Guess it was a good decision not to overfit the model. 🥱

Upvotes 0

replied to Yudheezus30 Nov 2022, 09:32

Hahahahaha... Whatever you did worked.

Upvotes 0

replied to Yudheezus30 Nov 2022, 09:38

Ferra Solutions

You see?! Yud's got a degree, now we are clowns ...

@Yudheezus - no you weren't lucky, so we can't help but wonder, what was that model you did not overfit? Forgive me if I stumble, really just wondering ... was it transformer? Did you model each time slot so had four models, or just one model?

Upvotes 0

@skaak it has got nothing to do with having a degree or not. I just read the information provided, assessed the problem and acted within the confines of the competition. And I used a rules based approach, nothing .

30 Nov 2022, 10:06

Upvotes 0

replied to Yudheezus30 Nov 2022, 11:10

Ferra Solutions

Yud - apologies, that comment of mine is a bit harsh ... but, you know what, I did the same as you, and still fumbling (so you are right anyhow!) ... not sure what I am missing (and you obviously getting) ... anyhow, again - congrats. You make it sound easy, believe me, it was not.

Rules based, so lots of if then statements? I'm just guessing, no real rule based system experience.

I was hoping seq2seq would pick up those rules, if they were there, but the time dimension made it difficult. Perhaps I am overthinking it (dragon normally does) but it seems you have simple no-nonsense approach that in the end is most accurate.

Upvotes 0

replied to Yudheezus30 Nov 2022, 12:37

Hey there @skaak apology accepted. I must also apologize, I didn't mean to be so reactive in my own comments and I certainly didn't meant to come off the wrong way.

My solution was a lean 20 liner that I will be happy to share once the auditing process is completed.

The competition was not easy I think I just stumbled into an optimal approach early on and that helped alot with validating where and how continue next.

I figured there would be some data completeness and structuring issues and thus I just did the most practical approach with all the information that was made available. It paid off in the end I guess. But I wish everyone of the best. The income prediction problem looks really cool, I think I am going to hit that off next when I get some time.

Upvotes 2