External Data
Data · 22 Feb 2023, 18:30 · 5

Did I understand correctly during the webinar (https://www.youtube.com/watch?v=cc42eKQXySw&t=2076s&ab_channel=ZindiAfrica at around 34:04) that one can take advantage of external data or was this statement made solely regarding feature engineering [...]?

Discussion 5 answers

Given the rules, I would say no:

> You may use only the datasets provided for this competition. Automated machine learning tools such as automl are not permitted.

> You may use pretrained models as long as they are openly available to everyone.

However, I wonder if pretrained model are useful here...

24 Feb 2023, 08:28
Upvotes 0

Generally, I agree. The competition seems originally to be set up as to not allow external data. However given the statement ~"this information can get you more data in order to do better predictions", I wonder if they changed their mind in this regard and just didn't update the competition texts. I think external data (obviously open/free) would make for a more interesting competition and better models (benefiting both sides) but it should be clearly communicated. As a 'side note' in an (optional) webinar this only adds to my personal confusion. Maybe there will be some official feedback. Let's see.

i want to add some examples to clarify if using additional free data is allowed or not.

cloud removal is a critical part of preprocessing. this quantity for train/val/test and test(competition) datasets are very unsimilar. mean value of about 2 and variance of about 120 for train/val/test and (9.184, 635) for competition dataset.

are we allowed to use data provided by gedi and sentinel 2, if free to use by copyright holder?

20 Mar 2023, 09:24
Upvotes 0

Generally, I think the splits weren't selected optimally. This becomes apparent when plotting the locations of the training data and the test (submission) data on a map - and even more striking when the respective surroundings/egetation are compared. Whether external data is allowed or not but this will result in a winning model that is most likely 'adjusted' to fit the out-of-distribution test data, but potentially won't generalize very well.

Here is a discussion created by 8th on leaderboard.

https://zindi.africa/competitions/africa-biomass-challenge/discussions/16335

I don't try that on my model yet. But if that is the case, the AGBD has a very loose relation with data provided by zindi.