Given blood spectroscopy readings can you predict which compounds are in the blood?
Traditionally we do blood analysis by collecting blood samples from patients and perform different tests on the samples. In this competition, you will help scientists from to make progress towards non-invasive blood analyses.

For this purpose, you build machine learning models that can classify the level of specific chemical compounds in samples from their spectroscopic data. In simple terms when we direct a beam of light towards a sample, the light is partially absorbed and/or reflected based on the sample's molecular structure (different chemical compounds present). The amount of light absorbed strongly depends on the wavelength of the light source used. Hence, if we use a beam of light containing a range of wavelengths, we can measure the amount of energy absorbed for each wavelength. Such a measurement over different wavelengths (or frequencies) is called a spectrum (or spectral data).

We will use spectral data from the Near Infra-Red (NIR) wavelengths ranges (950 nm to 1350 nm) in this challenge. In contrast to other wavelengths, NIR has the highest penetration power and goes deep into tissues attenuated. The difficulty with Spectral data is that we generally have far too many features compared to the number of data points. Most chemists are yet to adopt machine learning techniques for spectral analysis because of the high risk of overfitting. One major issue you will face in this challenge is to build a model that generalises well without overfitting.

If successful, you will lead the way in creating a new level of health awareness in people by making blood tests to be a commodity and a procedure that can be practiced with no effort many times a day, much the same like measuring our weight. Furthermore, we can incorporate your model into new devices that allow patients to do their blood analysis in less than a minute, even at home and send the results to their doctor. NIR spectroscopy is a sleeping giant waiting to penetrate our everyday lives on our mobile phones, wearables and and many other appliances. Imagine a kiosk machine that is located in the supermarket that suggest a diet on the fly based on our cholesterol levels measured by just shine of a light, imagine a weighing machine that not only tells us our weight but also our vitamin levels in our blood.

Diving into the challenge:

Before jumping into this challenge it is recommended you form a basic fundamental knowledge about spectroscopy, material analysis techniques and get familiarized with the datasets presented in this challenge.

Introduction video from Oded Daniel, the CEO of

About (

We are a team of ambitious, innovative professionals who are passionate about offering a better way to get in touch with and care for our bodies - all with the simple scan of our skin.

Data Collection & Exporting Procedures

Data Overview

Each data set collected consists of scans made using an identical scanner model. Data is collected as raw data, and a result of the scanner pushing light into the target (in this case, a fingertip). The data is displayed as a function of wavelength measured in nanometers (nm), and registers the quantity of light that is reflecting off of the target point. Our application registers the results and creates an array of values in order to account for all wavelength data. The light that reflects back is referred to as “intensity.”

Data Collection Specifications

Each scan cycle pushes light as a function of wavelength, and our wavelength data ranges in intensity from 900 nm to 1700 nm. All light reflected is distinguished and categorized by wavelength (i.e., 900 nm is distinguished from those returning at 905 and so on.) We expect intensity at 900, 904.71, 909.41, 914.12 and much more, resulting in an array of 170 intensities per scan.

We repeat this process 60 times in order to produce a reliable scan and see a holistic picture. We also account for humidity and temperature, keeping in mind that these factors may affect scan results. Therefore, each of the 60 scans comprises 170 intensities where temperature and humidity at the time of the scan are accounted for.

Data Quality Assurance

Scanner Monitoring & Calibration

  • Though every scanner used is the same brand make and model, each scanner has unique sensibilities. As a result, we account for any small misalignments between scanners using a standard piece of equipment called a “white plate.”

The plate is scanned by each scanner every day before data collection begins, then the first scan is compared to the last. The numeric difference between the scan values is considered “scanner decay,” and will alert us as to whether or not the value gap is large enough to warrant replacing the scanner to ensure scan value accuracy. This process is called “calibration” and is also useful for calculating absorbance.

Data Validation Process

  • Each biodata donation consists of 60 rapid-fire scans, and ensuring the consistency and quality of each of the 60 scans is our top priority. Any movement in the scanner or scan target may result in invalid scan data that loses its value. For the most accurate scan results, we calculate the standard deviation of the 60 scans and, if we do not find consistent results, exclude the scans from the data set.

Data Organization

  • The scanner is thoroughly cleaned prior to each donation, and each scan donation is taken just moments before a blood collection. Data is stored in our system and referred to by a unique, donor-specific barcode. Once the laboratory finishes processing the blood test, the blood test results are paired to their corresponding biodata scans within our system. Our staff input the data supported by the alert system to ensure data is between reasonable ranges. This process is double-checked manually by multiple people to avoid human error.

Data Extraction & Absorbance

Calculating absorbance is the ultimate goal of our data collection. The white plate’s main purpose is to test the scan on the “full light reflection,” which gives us the numeric values necessary to determine how much light is being absorbed at a certain wavelength. We use this data to calculate absorbance using the Beer-Lambert law.

The absorbance wavelength gives us otherwise hidden insight into the matter it’s reflecting from. Each compound the light interacts with has a unique absorbance signature and wavelength graph, which will be available to you for the compounds we’d like to track. Each scan absorbance can be thought of as the sum of all compound signatures (keeping in mind that there are more than 4,000 compounds found in human blood).

Data Packaging and File Structure

The data sets we’ve created are comprised of high-quality, well-vetted bio donation scans and their paired blood results. We’ve created 6 different files for each compound.

The file structures include several columns. The first column, “donation_id”, includes 60 scans for each individual donation which may be duplicated within the data set for multiple separate compounds. This is not an error, but means that the donation has been tested for both compounds.

The second column, “scan_id”, refers to the number of scans for each donation (60 total) and goes from 1 - 60. This column is not utilized in the data sets with the average absorbance.

The standard deviation, “std”, is calculated for each donation in the range of 950 nm to 1350 nm. The smaller the standard deviation, the more the absorbances for that donation, the more consistency there is between all 60 scans.

Temperature and humidity readings at the time of the scan are also accounted for, as well as the final scan-associated blood data results that is indicated as a human reading: “high,” “ok,” and “low” depending on the human acceptable ranges for that compound and in numeric values.

The biological window

NIR spectra covers the range of 750-2500 nm on the electromagnetic spectrum. It’s considered the lower part of the infrared band which expands until 5000 nm. Waves length that fall between 650nm to 1320nm have the property to penetrate the human tissue up to the depth of 4mm. This range is labeled as the biological window and its the working range we are going to be working on this challenge. In this way we can obtain signatures of chemical compounds that are present in blood that runs in close to skin surface blood vessels. The following diagram describes the penetration depth in millimeters of each wavelength over the near infra red band.

The Spectral Signature of the compounds

The Spectral Signature is like the “chemical fingerprint” of a compound and can be used to identify it in a sample.The compound in the blood behaves exactly like a pure compound, and the Spectral Signature graphs can help you to build your model based on this information..

The Spectral Signatures of all relevant compounds: Zindi Conest Spectra - Google Sheet - for your comfort The data was arranged to the same 170 wavelength values the spectrometer used to collect our data.

The Spectral Signatures graph:

The Patent

The patent document: Patent describes an approach to solve the same problem.

Please try to find a time to read it! It will provide you with a lot of ideas.


There have been several attempts previously developing noninvasive blood analysis techniques using spectroscopy specifically to measure glucose and cholesterol levels in our human bodies. This challenge offers the largest amount of data ever collected in this domain.

We have added here some documents that describe previous attempts to solve a similar problem.

For more information, you can check the white paper and the references therein.


The evaluation metric for this challenge is Accuracy.

Your submission file should look like this:

Reading ID                             target
ID_00902R9H_hdl_cholesterol_human      ok
ID_00902R9H_hemoglobin(hgb)_human      low
ID_00902R9H_cholesterol_ldl_human      ok
ID_00902HGH_hdl_cholesterol_human      ok 
ID_00902HGH_hemoglobin(hgb)_human      high 
ID_00902HGH_cholesterol_ldl_human      low

