Apologies for the earlier mix-up
My roommate logged into Zindi on my computer and didn't know he hadn't logged out before making the post.
Models: Ensemble of three models: Catboost, LightGBM, and XGBoost. Aggregated the probabilities of each model (simple averaging)
Datasets used (4-year span)
- Sentinel 1 and 2
- Climate data (temperature, evapotranspiration, land surface temperature, precipitation, etc.
- Slope and elevation from digital elevation models
Data Preprocessing
Downloaded Sentinel 1 and 2 data for locations in the shapefiles. I had noticed that the distance between locations in the train and test data was very far apart.
Locations in Orenburg didn't have Sentinel-1 data from January 2023 to date, so I had to download 4 years of data (Jan 2018 to Dec 2022).
Feature Engineering
I created summary statistics of these datasets for each location ID. Features included.
- Aggregated Sentinel 1 and 2 data for test data into monthly data
- Generated vegetative, bare soil and soil moisture indices from Sentinel 2
- Polarisation ratios for the polarisation channels (VV and VH) in Sentinel 1.
- Water stress index from evapotranspiration datasets
- Next was to aggregate these features for each ID
They include
- Annual min, mean, max and standard deviation
- Because these datasets are cyclical in nature, I created harmonic terms: amplitude and phase, to represent the magnitude/strength of their annual seasonality (variation), and the time when they reach their maximum
- Rate of change per month (speed at which they change per month) and their acceleration (rate at which they change per month). I also included percentage changes
- Rolling statistics: Rolling mean, sum and deviation every 6 months.
- Location-based information: Distance to the nearest site, the average, standard deviation, and number of sites within 10km radius
- Grid-based aggregation: I clustered locations in each region (separately) into four groups and assigned them to grids of 50km length. Summary statistics (min, max, mean, standard dev) of S1 &2, and climate features for ID's location in each grid.
Top features included: slope of the site, distance to closest site, annual summary (min, mean, max, std), peak time, amplitude and rate of change of VV, VH, bare soil (BSI) and moisture stress indices (MSI). These properties mostly relate to semi-arid regions.
Public LB: 0.844xx, Private LB: 0.85xxx
Amazing, and what was the cv, ? I really struggled in this since the sentinel 2 data i was downloading for train had a different distribution to test , leading me to have a very good cv >0.85 but very poor lb
For the three models, I got CVs around 0.87+ on average
Nicee
Thanks!
Well, I can't say anything about the dissimilar distributions in the train and test data. I didn't check for that.
For the sentinel data, I added a 100m buffer to include sentinel data within a 100m radius, just in case there was no data for the site's coordinates.
Same here @Koleshjr, I was gambling😂.
Amazing, Thanks for the writeup @Gozie and congratulations Big man!
Thanks @CodeJoe
100 m buffer from available multiple latitude longitudes or you chose any one last lon pair and took 100 m buffer?
No, polygon of points that is within 100m radius from the given coordinate. On earth engine, you can provide a given buffer and it automatically creates a polygon of points within the specified buffer during data extraction.
I took 100m buffer per lat lonentry it didn't work actually, if possible please give code snippet just few lines