📊 Challenge Chat: It's all about the datasets

TAHMO Incoming Solar Radiation Prediction Challenge

$10 000 USD

Under code review

Skills you will learn

Prediction

Geospatial Analysis

1525 joined

760 active

Info Data Chat Leaderboard

Start

Apr 01, 26

Enrolments close

May 24, 26

Reveal

May 24, 26

fgbfgb

It's all about the datasets

14 May 2026, 20:40 · 6

Full transparency, without grabbing a ton of external datasets, I doubt it is possible to get good scores. So in order to be transparent, let me give a rundown of all the datasets I grabbed/used. I am not sure they are all useful or even necessary - and these are just the raw features, some light feature engineering comes on top of this.

Train/Test CSVs — 40 TAHMO stations, 15-min cadence.

Raw inputs contain: precipitation, relative humidity, temperature, station metadata (lat/lon/elevation/install height/country).

Solar geometry (computed, no external data):

Cosine of solar zenith angle, extraterrestrial radiation (ETR), solar elevation, hour angle, declination — all derivable from timestamp + lat/lon. One could also use pvlib, but not required really; closed-form formulas are accurate to <1%.

ERA5 reanalysis (via Open-Meteo)

Source: https://archive.open-meteo.com/v1/archive (pre-interpolated to station coordinates, hourly).

Variables used: shortwave/direct/diffuse radiation, cloud cover (total + low/mid/high), wind speed/direction at 10 m, CAPE, precipitation, dewpoint, surface pressure.

LSA-SAF satellite products (Meteosat / MSG geostationary, 3 km grid)

Three products from https://datalsasaf.lsasvcs.ipma.pt/PRODUCTS/MSG/, all covering the full African disk:

MDSSFTD: Downwelling shortwave flux + diffuse fraction
MDSLF: Downwelling longwave flux
MLST: Land surface temperature

To extract per-station data: use K nearest grid cells (matched once via KDTree). Bounding-box subsetting keeps per-file I/O cheap. These datasets are HUGE (~1TB total).

NASA POWER (data/power/)

Source: NASA POWER API (MERRA-2 reanalysis with GEOS satellite correction, ~50 km grid, hourly).

Variables: all-sky/clear-sky GHI, direct normal & diffuse irradiance, clearness index, cloud fraction, AOD @ 550 nm, bias-corrected precipitation.

MERRA-2 aerosols

Source: NASA GES DISC, M2T1NXAER collection (0.5° × 0.625° grid, hourly)

Variables: speciated AODs at 550 nm — total, dust, organic carbon, black carbon, sulfate, sea-salt — plus Ångström exponent and PM2.5 surface concentrations for dust & organic carbon.

CAMS EAC4 aerosols

Source: Copernicus Atmosphere Monitoring Service EAC4 reanalysis (3-hourly)

Variables: same speciated AOD set as MERRA-2 (total/dust/OM/BC/sulfate/sea-salt) plus total column water vapor

pvlib clear-sky reference

Pre-computed on the same 15-min station grid

Variables: apparent zenith/elevation, relative & absolute airmass, extraterrestrial radiation, Linke turbidity, and clear-sky GHI/DNI/DHI from both the Ineichen-Perez and Haurwitz models, plus the Ineichen clear-sky index

Discussion 6 answers

Koleshjr

Multimedia university of kenya

can you confirm if this score: Abs MBE - 0.160338816 is also from the above pipeline? If not does it comply with the below rule?

The values in TargetMBE and TargetRMSE should be identical for each corresponding entry of the submission. This format is required for multi-metric evaluation.

15 May 2026, 08:14

Upvotes 2

fgbfgb

MBE is a whole different issue; see my other post on this, that unfortunately went unanswered by the organizers. I have very strong opinions on why the MBE scoring is calibrated incorrectly.. but that's for another discussion. You can get ~0 MBE without using any training data at all, you just have to guess the per-station means :)

These datasets should get RMSE down to below 60. Throw in a few more ensembles and all the tricks you can think of, it can go down to ~57-58.

replied to Koleshjr15 May 2026, 08:53

Upvotes 2

thisiskuhan

Great post!

One dataset that's worth flagging since it didn't make your list: CAMS Solar Radiation Time-Series from Copernicus ADS (cams-solar-radiation-timeseries). That's a different product from the CAMS EAC4 aerosols you mentioned - it gives you 15-min all-sky and clear-sky GHI/BHI/DHI/BNI plus a reliability flag, pre-computed at the exact station coordinates. It's available at native 15-min cadence so no resampling needed to match the target grid, and it covers the full 2016-2020 span the competition spans. Worth a look if you're squeezing the last RMSE points.

On MBE - I'll skip that one, the scoring side of it deserves its own thread and I don't want to muddy yours. Looking forward to the post-deadline write-ups.

15 May 2026, 14:06

Upvotes 4

fgbfgb

I'm late to reply, but yep, this is a really really useful dataset! Thanks for posting it!

replied to thisiskuhan19 May 2026, 06:54

Upvotes 0

manuzrp

But I have doubts. Do we get a higher score by predicting the "biased" station values in the test set or do we need to get the "true" values in the test set to get a higher score? I guess we needed to predict the "biased" station values in the test set to get a higher score...

16 May 2026, 14:41

Upvotes 0

Idrissa42

I think the external are necessary to get good score but without good feature selection is needed then predicting the biased will be good idea I think so

replied to manuzrp16 May 2026, 15:16

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status