AirQo Low-Cost Air Quality Monitor Calibration Challenge
$1 000 USD
Powering up low cost air quality monitors in Kampala, Uganda
276 data scientists enrolled, 161 on the leaderboard
SafetyPredictionStructured
Uganda
30 April—6 June
38 days

The data has been collected hourly at three locations across Kampala (US Embassy, Makerere, Nakawa - this data is very unbalanced in ratio of approx 4:2:1) over differing time periods in the last two years.

There should be no NaNs in the PM data although the temp and humidity may have some minor gaps.

The target is the reference value. A reference monitor is a very heavy (30kg) static machine that requires mains electricity and secure mounting. It is not portable so once it's in place there it will stay. It measures PM2.5 (that is the mass of particulate matter smaller than 2.5 microns, or 1/30th the thickness of a human hair, that is found in a cubic metre of air) with incredible accuracy accepted at international standards.

We collocated one of our low cost devices next to each of these devices at the same height, less than 1m apart. A low cost device measures PM2.5 as above but also PM10 (this captures particulate matter smaller than 10 microns in diameter, that is ⅓ the thickness of a human hair, so includes much bigger particles but also includes PM2.5 values. Low cost monitors contain two identical sensors so you will see two values for PM2.5 and two values for PM10. In an ideal world these sensors would record identical values but some variation is expected. Two are used as a backup and as a check.

We also include temperature and humidity values, there are several weather stations located across Kampala and analysis has shown that variation between these values is minimal so the same value will be found at the same time across each of the locations.

We also include metadata about the latitude and longitude, altitude, terrain features, greeness, distance from a major road which you may find useful. This value does not change for the same location over time but is included in de-normalised format for ease of use.

The objective of this challenge is to develop a model that will take low cost device data and other supplementary data and transform it as accurately as possible to the reference value.

Files available for download:

  • Train.csv - contains the target 'ref_pm2_5' column. This is the dataset that you will use to train your model.
  • Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
  • SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv and the ‘ref_pm2_5’ column containing your predictions. The order of the rows does not matter, but the names of the IDs must be correct.

Variable definitions

  • created_at: Hourly timestamp for the date and time the values were recorded. The value is in local East African Time ie UTC+3
  • site: which of the three Reference monitor sites the data was recorded
  • pm2_5: The PM2.5 value recorded on low cost sensor 1 of the low cost device. Unit is ug/m3 or micrograms of particulate matter smaller than 2.5 microns recorded in a cubic metre of air
  • pm10: The PM10 value recorded on low cost sensor 1 of the low cost device. Unit is ug/m3 or micrograms of particulate matter smaller than 2.5 microns recorded in a cubic metre of air
  • s2_pm2_5: The PM2.5 value recorded on sensor 2 of the low cost device. Unit is ug/m3 or micrograms of particulate matter smaller than 2.5 microns recorded in a cubic metre of air
  • s2_pm10: The PM10 value recorded on low cost sensor 2 of the low cost device. Unit is ug/m3 or micrograms of particulate matter smaller than 2.5 microns recorded in a cubic metre of air
  • humidity: given as a decimal proportion of full saturation so cannot be higher than 1
  • temp: prevailing temperature in degrees celsius
  • lat: latitude of the site location in degrees, this is constant for each location
  • long: longitude of the site location in degrees, this is constant for each location
  • altitude: height above sea level in metres, this is constant for each location
  • greenness: an index of the greenness of a location using normalized difference vegetation index (NDVI). The higher the number the more green a location ie more trees, grass etc
  • Landform_90m: an index showing type of terrain over 90m area. A low value (11) represents a peak and a high value a ridge (42). See this link.
  • landform_270m: an index showing type of terrain over 270m area. A low value represents a valley and a high value a ridge. See this link.
  • population: number of people inhabiting a square km for the site
  • dist_major_road: distance in metres from the closest road which has continuous traffic flow
  • ref_pm2_5: the target variable. The PM2.5 value recorded by the Reference monitor.