Farm Pin Crop Detection Challenge
$11,000 USD
Classify fields in South Africa by crop type using Sentinel-2 satellite imagery
4 March–15 September 2019 23:59
747 data scientists enrolled, 42 on the leaderboard
2nd place solution
published 30 Sep 2019, 13:05
edited ~1 hour later

Congratulations to the winners and thanks to everybody for participating in this very interesting competition!

Below I explain my solution. According to the rules the winners can't share code, unfortunately. However, I gladly answer any questions.

The solution consists of four base models and one 2nd layer stacking model. I use identical 5-fold data split across all models.

Data preprocessing

To train models we need to prepare data: crop, create numpy arrays, normalize and save the data. Normalization is a critical step to train CNN models. Imagery data was z-normalized per channel, i.e. the mean was subtracted and divided by the standard deviation for each channel data.

Base models

All base models predict crop probabilities per pixel. More specifically I split imagery data into samples that represent a pixel and its neighbourhood and use this data as features. After that, I just take the mean of pixel predictions to calculate field level predictions and use them on the second level. All base models use only 10 channels of the data with 10m and 20m resolution: ['B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B11', 'B12']

All base models use all timestamps of data, i.e. 11.


2 of the base models are 3d-CNN models with a little bit different architectures. These models are the best performers. As input to this model I use per pixel data, so I classify each pixel. Input data is 4d array - (CH, T, H, W):

  • CH - number of channels (10)
  • T - number of timestamps (11)
  • H, W - number of pixels (5)

I apply convolutions across both the spatial and temporal dimensions.


2 of the base models are Random forest models. These models have some different bias, so despite the fact it's much worse than CNN it improves the final result in the ensemble by ~1%. Input for these models is flattened imagery data from the pixel I classify and 8 neighbour pixels from all timestamps - 1188 features (11 timestamps * 12 channels * 9 pixels). 12 channels consist of 10 initial 10m, 20m channels + NDVI and NDWI. One of the models takes only 8 classes(I just don't use class 2, because it's underrepresented) and balances the training data as `1 / sqrt(number of samples)`.

2nd layer model

Lightgbm model is used to classify fields based on 1st level models predictions and other field level features. Other field level features include:

  • Area (1 feature)
  • Subregion (1 feature)
  • Latitude and longitude of the field centroids (2 features)
  • Number of closest neighbour fields(max 8) per class with distance coefficient - (100.0 / dist) ** k, where dist is the distance between a field centroid in question and centroids of neighbour fields (9 features), k - is manually chosen coefficient
  • Square of intersections between a field padded with buff buffer and neighbour fields per field class (9 features), buff- is manually chosen coefficient

Hardware used

  • Intel Core i7 - 6850K
  • 32GB RAM
  • 11GB GTX1080 Ti


To train one CNN model or RF model takes about 10 minutes per fold for both. Basically I used a tiny CNN which is very easy to train.

Please ask if you have any questions.

All I can say: Impressive!!

Great wonderful Work. You arearehavearearehavehav

Congratulations and thanks for the write up!

If I understand correctly your 3D convolutional models mapped from a rank 4 input with size 10 x 11 x 5 x 5 to a rank 1 output with size 10, each number being the probability of each of the classes?

Can you give any more detail about the sizes of your convolutions?

Thank you!

> 3D convolutional models mapped from a rank 4 input with size 10 x 11 x 5 x 5 to a rank 1 output with size 10

Yes, almost like this, but the output size is 9.

> Can you give any more detail about the sizes of your convolutions?

I hope I don't violate any rules). I use 3-4 layers(blocks) of convolution with small filters: 2, 3 in the spatial dimensions; 3,4 in the temporal dimension. The number of filters grows from 10 to 128.

Thanks! Very interesting, I will definitely give this a go if I deal with temporal image data again.

Here are the papers I used to inspire my approach:, Each of the papers presents different approaches which I combined.

Well done on the challenge! Could you recommend tools (in python preferably) to extract the image patches? I am new to the satellite images and the training farm images were very tiny for me: that is why I did not even try CNNs. I followed:

I also used rasterio as in the discussion you mentioned for this part of the task.

I have posted my code extracting bands and cropping them to fields on Github, in case anyone will find it useful: