Primary competition visual

TAHMO Incoming Solar Radiation Prediction Challenge

$10 000 USD
Under code review
Prediction
Geospatial Analysis
1525 joined
760 active
Starti
Apr 01, 26
Enrolments closei
May 24, 26
Closei
May 24, 26
Reveali
May 24, 26
Congratulations & Here's what I tried in this challenge
25 May 2026, 00:42 ยท 4

Congratulations to everyone who made it to the end!

The deadline's done, so here's the long version of what worked and what didn't for me. Landed at #7 on the private leaderboard — not the top, but I'm happy with it, and hopefully something here is useful for whoever picks up the next iteration.

The setup

40 weather stations. On-station features: timestamp, lat / lon / elevation, installation height, temperature, RH, precipitation. Target: incoming shortwave radiation in W/m², range 0–~1400, about half zero (night or heavy cloud).

Metric: `0.5 · MBE_norm + 0.5 · RMSE_norm` (it was 0.7:0.3 at first, then updated), averaged per station. One bad station hurts a lot.

Two things:

One: train/test is split at the month level for the same station and year — so a per-station model has real history from the exact sensor.

Two: the training data isn't clean. A few stations have sensor problems. Train naively on any of these and they tank your per-station score — and remember, the metric averages per-station, so one bad station hurts the same as a great one helps.

So: fit per-station where the sensor is trustworthy, throw out the bad months everywhere else, and lean on the global head + satellite features for the genuinely broken stations.

Validation

Leave-One-Month-Out per station: For each station, for each training month, train on the other 5 months and predict the held-out one. Average per-station, then across stations.

The model

LightGBM. Two heads, blended.

- Per-station head: one LGBM per station, fit only on that station's months. `num_leaves=63`, `learning_rate=0.05`, `n_estimators=2000`, `min_child_samples=20`, `subsample=0.8`, `colsample_bytree=0.8`, `reg_alpha=0.1`, `reg_lambda=1.0`, early stopping after 50 rounds. Bagged across 6 seeds — predictions averaged. Stations with too little clean data get skipped.

- Global head: one LGBM on all stations together, with lat / lon / elevation as extra features. Same params except `n_estimators=3000` (more data, can grow longer). This carries the geometric backbone.

- Blend: flat 85% global + 15% per-station, every row. Per-station overfits hard on a few thousand daytime rows; its job in the blend isn't to dominate, it's to add residual character. A row-count-weighted blend came in within noise at LOMO; flat was easier to reason about.

Label cleaning before training:

- Sun below horizon → target hard-zeroed.

- Drift months dropped by two rules: >50% daytime zeros in a month when the station overall is below 30% zero (stuck low), or night mean > 10 W/m² (stuck high). The station-overall gate is intentional — it cleanly removes the bad months from stations TA00122 and TA00123 (which are otherwise normal) without nuking stations that just live in cloudier climates.

- On TA00123 and TA00118, drop physically-impossible rows (rad > 1.5× clearsky) and daytime-stuck-low rows (clearsky > 500, rad < 10, sun > 10° up).

Handling the critical stations

Four stations needed special handling. Worth walking through because the per-station metric punishes you hard if you skip this.

- TA00338 (broken across the board). The drift filter above intentionally doesn't fire here — the whole station is too noisy for any month to look obviously "good". So per-station training has nothing trustworthy to fit on. The strategy: let the global head (which sees lat/lon/elevation and the satellite signals from CAMS and LSA SAF) carry the prediction. The blend's 85% global weight is doing most of its work on stations like this one. Effectively: predict the climatology and the satellite-observed clouds, ignore the on-sensor history.

- TA00122 (two stuck months, otherwise normal). Drift filter drops months 3 and 5. The per-station head fits cleanly on the other four months. Standard case.

- TA00123 (two stuck months + scattered garbage labels). Drift filter drops months 7 and 9. The label outlier rule additionally removes single rows where radiation exceeds 1.5× clearsky or where it's near-zero at high sun. That cleanup matters because LightGBM on a few thousand rows will memorize even a handful of physically-impossible labels.

- TA00118 (missing January + scattered garbage). Nothing to drop at the month level (the data that exists looks fine), but the label outlier rule does the cleanup. The missing January is just lived with — 5 clean months is still enough to fit a reasonable per-station head.

The pattern: trust the per-station head only where the sensor is trustworthy, otherwise let geometry + satellites carry the prediction.

Features

~50 features in six families, pre-prune.

1. Solar geometry (9) — `solar_elevation`, `solar_azimuth`, `hour_angle`, `solar_declination`, `day_length`, `time_since_sunrise`, `time_until_sunset`, `clearsky_frac_daily_max` (current clearsky ÷ that day's max clearsky — basically "% of solar noon"), `air_mass`. Deterministic from (lat, lon, timestamp). Sets the upper envelope. I used the Haurwitz (1945) clear-sky GHI model — simple, no aerosol inputs needed, fine for this. Any standard clearsky model would work.

2. Temporal (10) — `hour`, `month`, `day_of_year`, `day_of_month` and their sin/cos cyclic encodings. The encodings help the global head interpolate smoothly across midnight and year boundaries.

3. Weather (11) — raw temp/RH/precip/precip-flag plus interactions: `humidity_x_clearsky`, `temp_x_solar_elevation`, `humidity_sq`, `temp_deviation` (current temp minus the station's mean temp for that calendar day), VPD, dew-point depression, precipitable water.

4. Intraday lags (14) — lagged temperature and humidity at 15 min / 30 min / 1 h plus their deltas, and lagged precipitation and clearsky at 1 h. Each row sees only its own station's recent past.

5. Daily aggregates (3) — `daily_temp_range`, `daily_rh_mean`, `daily_precip_any`.

6. Per-station kt context (9) — the strongest non-geometric per-station signal. `kt` here is the clearness index, `rad / clearsky`, clipped at 1.2. Features: `station_mean_kt_at_hour`, `station_std_kt_at_hour`, `station_mean_rad_at_hour` (station's own kt/rad history aggregated by hour of day, from training months only); `kt_prev_month_same_hour`, `kt_next_month_same_hour`, `kt_interpolated_at_hour` (for a test row in an even month at hour h, the surrounding odd months' kt at the same hour, plus a linear interpolation); `kt_q25_at_hour`, `kt_q75_at_hour`, `kt_iqr_at_hour`. The neighboring odd months are a strong prior for the held-out even month.

External datasets

The challenge encourages external data. I tried a lot.

Kept:

- NASA POWER — daily aerosol + cloud optical depth (`np_total_od`). Adds signal that on-station weather doesn't capture.

- CAMS Solar Radiation Service (Copernicus ADS) — 15-min satellite-modelled GHI/BHI/DHI/BNI plus clear-sky equivalents at exact station coords. The single most useful external dataset.

- LSA SAF MDSSFTD / MDSLF — 15-min EUMETSAT satellite shortwave from MSG. Conceptually similar to CAMS but a different physical retrieval, so each adds incremental signal.

Downloaded, integrated, then turned off:

- ERA5 SSRD — correlated with CAMS, lower resolution. Net zero on top of CAMS+LSA SAF.

- Open-Meteo — cloud cover low/mid/high, surface pressure, dew point, wind direction. Neutral.

- GPM IMERG — half-hourly precip. Nothing meaningful over the on-station gauge.

- Copernicus GLO-30 / SRTM terrain — horizon angle and sky-view factor. Tiny effect; TAHMO stations are mostly open ground.

- LANDSAF spatial aggregation (2×2 pixel window). Mixed.

- CAMS × LSA SAF cross-features: Marginal, dropped to keep things lean.

If one of these worked for you, I'd genuinely like to hear it.

Feature pruning and tuning

After the first training pass, drop any feature with normalized gain below 0.5% of the per-fold total, then retrain. Usually prunes ~10 features. I also tried top-K-per-family — kept weak features around just because their family was diverse, so threshold won. Hyperparameters tuned with Optuna on a 10-station subset (25 trials), then frozen.

Things that did not work

- Neural network heads (MLP, small LSTM, tiny transformer) — could not beat boosting on engineered features.

- XGBoost + CatBoost cross-family ensemble — NNLS-weighted with LGB, came in within ±0.001 at LOMO with ~3× the cost.

- Ridge baseline + LGB residual — slightly worse than pure LGB.

- Per-hour global models — fragments the training set unhelpfully.

- kt-prior shrinkage of `station_mean_kt_at_hour` toward the global mean — net neutral.

- Isotonic recalibration per station — overfits the few-thousand-row test set.

- Neighbor-kt features (k=3 within 300 km) — slightly negative. Nearest neighbors are often 400–800 km away here.

- Bird (1981) aerosol-aware clearsky as a hard cap — hurts RMSE on cloudy days when the cap is a bit generous.

How to replicate this from scratch

If you want to rebuild this without my code, here's the minimal path:

1. Stack: Python 3.12, `lightgbm`, `pandas`, `numpy`. For solar geometry / clearsky, either roll your own (Haurwitz is a one-liner) or use `pvlib`. Optuna only if you want to retune.

2. Data downloads: CAMS Solar Radiation Service, LSA SAF MDSSFTD / MDSLF, NASA POWER

3. First pass (gets you near the benchmark in 4–6 hours of focused work): one LGBM per station with the hyperparameters above, solar geometry + on-station weather only, LOMO validation.

4. Iterate in this order: add CAMS + LSA SAF (big jump) → add kt context features (next biggest) → add the global head and 15/85 blend → 6-seed bag → drift filtering and label outlier rules → 0.5% gain prune.

Everything in this post is in that loop. The didn't-work list is just the dead ends to skip.

On the metric, while we're being honest: Half the score is MBE, which means a number of probe submissions — submit a constant, read back the leaderboard's MBE, adjust your overall mean accordingly — measurably moves the score for very little effort.

------------------

If something looks wrong, or one of the "didn't work" rows actually worked for you, please reply. I'd rather fix this post than leave bad advice up — and honestly, I'll learn more from your corrections than from my own writeup.

Also it would be great to hear from people who have the RMSE ~55-58 and How they achieved it?

Happy learning and once again congrats to everyone who shipped!
Discussion 4 answers

Thank you

25 May 2026, 00:45
Upvotes 1

thanks

25 May 2026, 03:17
Upvotes 1
User avatar
RareGem

Thank you for sharing. Please, what was your roadmap to this challenge? How did you came up with this idea? Please, can you share the GitHub link to this code.

25 May 2026, 09:33
Upvotes 1

Sure! It was an iterative approach...

1. Analyzed the given data, knowing what it is and how it is. (Odd months train, even months test, same 40 stations on both sides → test = same sensors under seasonal drift. Also spotted a few dead/stuck stations here.), their distributions, etc.

2. Built new features!!! Given data is too sparse to build a generalised model! So, I had to include a lot of external data. For me, few worked out, while few didn't — solar geometry, clearsky (Haurwitz), intraday deltas, per-station kt context, external satellite data (CAMS, LSA SAF, NASA POWER).

3. Trained gradient-boosted models (NNets didn't work out) — LightGBM, eventually two heads (one global, one per-station) blended 85/15 with some sweeping experiments.

4. Ended up using CV: LOMO (Leave-One-Month-Out: hold out one odd month per station, train on the rest, score per station, repeat across months).

5. Evaluated what's still missing — which stations, which months, which regimes are off — and got back to step 1.

That's it. Each iteration closed the gap a bit. Late in the comp the same shifted from "build features" to "measure and correct per-station bias directly on the LB" — same iteration, different lever.

(And ofc, AI in the loop helped a lot)

Regarding Git Link, I am cleaning up the repo — pulling out confidential bits and competition data related files. Will share soon.