📸 AI in Focus: Doubt on data distribution

ICLR Workshop Challenge #2: Radiant Earth Computer Vision for Crop Detection from Satellite Imagery

Helping Kenya

$5 000 USD

Completed (over 6 years ago)

Skills you will learn

Classification

Earth Observation

654 joined

110 active

Info Data Chat Leaderboard

Start

Feb 03, 20

Mar 28, 20

Reveal

Mar 29, 20

ESA-Philab

Doubt on data distribution

Connect · 30 Mar 2020, 13:07 · 14

Hello everyone. I'm quite new here on Zindi, and I really enjoyed taking part to the crop detection challange. However, now that the competetion is closed, I have one doubt concerning the input data and their distribution. With my numerical models, I have seen that results on internal validation varied a lot using random sampling and stratified sampling. Indeed, this is consistent with the input dataset, where the classes are clearly not balanced. My doubt is the following: according to the partecipants or to Zindi organizers, are the classes in the validation dataset distributed as in the training dataset? What I have seen is that using random sampling for my internal validation I got results around 1.25-1.35, whereas using stratified sampling I always go below 1 no matter what model I use, random forest or deep learning.

Can someone help me understanding? Thank you and congratulation to you all!

Discussion 14 answers

KarimAmer

My local validation score is very close to the leaderboard score. I used stratified sampling on the fields not on individaul pixels so as not to oversample fields with a lot of pixels.

30 Mar 2020, 13:28

Upvotes 0

ESA-Philab

Ok, I noticed the diffence both dealing with pixels and with fields...maybe it also depends on other settings of the model...

replied to KarimAmer30 Mar 2020, 13:31

Upvotes 0

Olayinka_Fadahunsi

Hi KarelAmer, please could you clarify stratifying on pixel vs fields. I am a bit lost. My view is that Per field there are multiple rows showing readings on each location. Its clear if you group the location readings per field into one and then stratify by the target variable - "label". Or do you mean your were stratifying with Field_id?

replied to KarimAmer1 Apr 2020, 09:03

Upvotes 0

KarimAmer

Hi DrFad, I group the pixels of each field in one row in order to have only 3,286 training sample then apply stratified sampling one such data.

1 Apr 2020, 10:38

Upvotes 0

Olayinka_Fadahunsi

Thanks for the clarification KarimAmer. So what's the difference between pixel vs field stratification?

replied to KarimAmer1 Apr 2020, 10:41

Upvotes 0

KarimAmer

Since the metric of the competition is cross entropy, the results will be heavily affected by the class distribution. So if training distribution is different from test distribution, test results will be worse than validation results.

Applying that on our case, it appears that the class distribution in training and testing are very close and splitted on fields not on pixels. So applying pixel stratification will oversample some fields changing the class distribution in training which leads to bad test results.

Here is another example competition where training and test class distributions are different and how fixing that can improve the reuslts: https://www.kaggle.com/c/quora-question-pairs/discussion/31179

replied to Olayinka_Fadahunsi1 Apr 2020, 12:01

Upvotes 0

Olayinka_Fadahunsi

I think I understand what you are trying to say. Pixel stratification is stratifying labels before aggregation. Field stratification is stratifying labels after aggregation i. e . Aggregated to 3,286 training samples. Is this correct?

replied to KarimAmer1 Apr 2020, 12:32

Upvotes 0

KarimAmer

Yes

replied to Olayinka_Fadahunsi1 Apr 2020, 12:37

Upvotes 0

ESA-Philab

Thank you Karim, you clarified the problem I was getting. Now I see there is no more difference between my local results and the results on the leaderboard. Can I ask you what approach did you used to tackle the problem? I have started only recently to study this problem and I'm trying to undestand what are the most promising techniques to get good final results.

replied to KarimAmer1 Apr 2020, 14:36

Upvotes 0

KarimAmer

You are welcome @ESA-Philab. I used deep learning only with various augmentations. I will be happy to share my approach after the code review stage.

replied to ESA-Philab1 Apr 2020, 14:57

Upvotes 0

ESA-Philab

I would be very glad to look at your approach! Thank you for the answers you gave me, I really appreaciate

replied to KarimAmer1 Apr 2020, 15:05

Upvotes 0

KarimAmer

Happy to help

replied to ESA-Philab1 Apr 2020, 15:27

Upvotes 0

ESA-Philab

Karim, first of all congrats for your first position! I would like to ask you if you can share some details of your implementation to study your approach, thank you :)

replied to KarimAmer16 Apr 2020, 15:05

Upvotes 0

KarimAmer

Thanks a lot for your kind words.

I used a 3-layer conv net (shared across time steps) followed by 3-layer bi-directional GRU net. The input to the network is a crop around the field's center pixel. Extensive augmentations were applied including spatial augmentation, mix-up and time augmenation (randomly dropping one time sample). I will let you know when the implementation is uploaded on github.

Let me know if you have any further questions.

replied to ESA-Philab17 Apr 2020, 11:22

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status