Congratulations to the winners and thanks to everybody for participating in this very interesting competition!
Below I explain my solution. According to the rules the winners can't share code, unfortunately. However, I gladly answer any questions.
The solution consists of four base models and one 2nd layer stacking model. I use identical 5-fold data split across all models.
To train models we need to prepare data: crop, create numpy arrays, normalize and save the data. Normalization is a critical step to train CNN models. Imagery data was z-normalized per channel, i.e. the mean was subtracted and divided by the standard deviation for each channel data.
All base models predict crop probabilities per pixel. More specifically I split imagery data into samples that represent a pixel and its neighbourhood and use this data as features. After that, I just take the mean of pixel predictions to calculate field level predictions and use them on the second level. All base models use only 10 channels of the data with 10m and 20m resolution: ['B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B11', 'B12']
All base models use all timestamps of data, i.e. 11.
2 of the base models are 3d-CNN models with a little bit different architectures. These models are the best performers. As input to this model I use per pixel data, so I classify each pixel. Input data is 4d array - (CH, T, H, W):
I apply convolutions across both the spatial and temporal dimensions.
2 of the base models are Random forest models. These models have some different bias, so despite the fact it's much worse than CNN it improves the final result in the ensemble by ~1%. Input for these models is flattened imagery data from the pixel I classify and 8 neighbour pixels from all timestamps - 1188 features (11 timestamps * 12 channels * 9 pixels). 12 channels consist of 10 initial 10m, 20m channels + NDVI and NDWI. One of the models takes only 8 classes(I just don't use class 2, because it's underrepresented) and balances the training data as `1 / sqrt(number of samples)`.
2nd layer model
Lightgbm model is used to classify fields based on 1st level models predictions and other field level features. Other field level features include:
To train one CNN model or RF model takes about 10 minutes per fold for both. Basically I used a tiny CNN which is very easy to train.
Please ask if you have any questions.
All I can say: Impressive!!
Great wonderful Work. You arearehavearearehavehav
Congratulations and thanks for the write up!
If I understand correctly your 3D convolutional models mapped from a rank 4 input with size 10 x 11 x 5 x 5 to a rank 1 output with size 10, each number being the probability of each of the classes?
Can you give any more detail about the sizes of your convolutions?
> 3D convolutional models mapped from a rank 4 input with size 10 x 11 x 5 x 5 to a rank 1 output with size 10
Yes, almost like this, but the output size is 9.
> Can you give any more detail about the sizes of your convolutions?
I hope I don't violate any rules). I use 3-4 layers(blocks) of convolution with small filters: 2, 3 in the spatial dimensions; 3,4 in the temporal dimension. The number of filters grows from 10 to 128.
Thanks! Very interesting, I will definitely give this a go if I deal with temporal image data again.
Well done on the challenge! Could you recommend tools (in python preferably) to extract the image patches? I am new to the satellite images and the training farm images were very tiny for me: that is why I did not even try CNNs. I followed: https://zindi.africa/competitions/farm-pin-crop-detection-challenge/discussions/201
I also used rasterio as in the discussion you mentioned for this part of the task.
The same - rasterio
I have posted my code extracting bands and cropping them to fields on Github, in case anyone will find it useful: https://gist.github.com/akatasonov/cb682ff5a064e7b3cbd4223c8fbcaeeb
Here's my GitHub repo for this project https://github.com/simongrest/farm-pin-crop-detection-challenge