bloods.ai Blood Spectroscopy Classification Challenge
Given blood spectroscopy readings can you predict which compounds are in the blood?
Prize
$7 500 USD
Time
Ended ~1 year ago
Participants
265 active · 1042 enrolled
Helping
Global
Intermediate
Classification
Health
Description

Traditionally we do blood analysis by collecting blood samples from patients and perform different tests on the samples. In this competition, you will help scientists from bloods.ai to make progress towards non-invasive blood analyses.

For this purpose, you build machine learning models that can classify the level of specific chemical compounds in samples from their spectroscopic data. In simple terms when we direct a beam of light towards a sample, the light is partially absorbed and/or reflected based on the sample's molecular structure (different chemical compounds present). The amount of light absorbed strongly depends on the wavelength of the light source used. Hence, if we use a beam of light containing a range of wavelengths, we can measure the amount of energy absorbed for each wavelength. Such a measurement over different wavelengths (or frequencies) is called a spectrum (or spectral data).

We will use spectral data from the Near Infra-Red (NIR) wavelengths ranges (950 nm to 1350 nm) in this challenge. In contrast to other wavelengths, NIR has the highest penetration power and goes deep into tissues attenuated. The difficulty with Spectral data is that we generally have far too many features compared to the number of data points. Most chemists are yet to adopt machine learning techniques for spectral analysis because of the high risk of overfitting. One major issue you will face in this challenge is to build a model that generalises well without overfitting.

If successful, you will lead the way in creating a new level of health awareness in people by making blood tests to be a commodity and a procedure that can be practiced with no effort many times a day, much the same like measuring our weight. Furthermore, we can incorporate your model into new devices that allow patients to do their blood analysis in less than a minute, even at home and send the results to their doctor. NIR spectroscopy is a sleeping giant waiting to penetrate our everyday lives on our mobile phones, wearables and and many other appliances. Imagine a kiosk machine that is located in the supermarket that suggest a diet on the fly based on our cholesterol levels measured by just shine of a light, imagine a weighing machine that not only tells us our weight but also our vitamin levels in our blood.

Diving into the challenge:

Before jumping into this challenge it is recommended you form a basic fundamental knowledge about spectroscopy, material analysis techniques and get familiarized with the datasets presented in this challenge.

Introduction video from Oded Daniel, the CEO of bloods.ai: https://youtu.be/vIUTA25h8Ss

About Bloods.ai (www.bloods.ai)

We are a team of ambitious, innovative professionals who are passionate about offering a better way to get in touch with and care for our bodies - all with the simple scan of our skin.

Data Collection & Exporting Procedures

Data Overview

Each data set collected consists of scans made using an identical scanner model. Data is collected as raw data, and a result of the scanner pushing light into the target (in this case, a fingertip). The data is displayed as a function of wavelength measured in nanometers (nm), and registers the quantity of light that is reflecting off of the target point. Our application registers the results and creates an array of values in order to account for all wavelength data. The light that reflects back is referred to as “intensity.”

Data Collection Specifications

Each scan cycle pushes light as a function of wavelength, and our wavelength data ranges in intensity from 900 nm to 1700 nm. All light reflected is distinguished and categorized by wavelength (i.e., 900 nm is distinguished from those returning at 905 and so on.) We expect intensity at 900, 904.71, 909.41, 914.12 and much more, resulting in an array of 170 intensities per scan.

We repeat this process 60 times in order to produce a reliable scan and see a holistic picture. We also account for humidity and temperature, keeping in mind that these factors may affect scan results. Therefore, each of the 60 scans comprises 170 intensities where temperature and humidity at the time of the scan are accounted for.

Data Quality Assurance

Scanner Monitoring & Calibration

  • Though every scanner used is the same brand make and model, each scanner has unique sensibilities. As a result, we account for any small misalignments between scanners using a standard piece of equipment called a “white plate.”

The plate is scanned by each scanner every day before data collection begins, then the first scan is compared to the last. The numeric difference between the scan values is considered “scanner decay,” and will alert us as to whether or not the value gap is large enough to warrant replacing the scanner to ensure scan value accuracy. This process is called “calibration” and is also useful for calculating absorbance.

Data Validation Process

  • Each biodata donation consists of 60 rapid-fire scans, and ensuring the consistency and quality of each of the 60 scans is our top priority. Any movement in the scanner or scan target may result in invalid scan data that loses its value. For the most accurate scan results, we calculate the standard deviation of the 60 scans and, if we do not find consistent results, exclude the scans from the data set.

Data Organization

  • The scanner is thoroughly cleaned prior to each donation, and each scan donation is taken just moments before a blood collection. Data is stored in our system and referred to by a unique, donor-specific barcode. Once the laboratory finishes processing the blood test, the blood test results are paired to their corresponding biodata scans within our system. Our staff input the data supported by the alert system to ensure data is between reasonable ranges. This process is double-checked manually by multiple people to avoid human error.

Data Extraction & Absorbance

Calculating absorbance is the ultimate goal of our data collection. The white plate’s main purpose is to test the scan on the “full light reflection,” which gives us the numeric values necessary to determine how much light is being absorbed at a certain wavelength. We use this data to calculate absorbance using the Beer-Lambert law.

The absorbance wavelength gives us otherwise hidden insight into the matter it’s reflecting from. Each compound the light interacts with has a unique absorbance signature and wavelength graph, which will be available to you for the compounds we’d like to track. Each scan absorbance can be thought of as the sum of all compound signatures (keeping in mind that there are more than 4,000 compounds found in human blood).

Data Packaging and File Structure

The data sets we’ve created are comprised of high-quality, well-vetted bio donation scans and their paired blood results. We’ve created 6 different files for each compound.

The file structures include several columns. The first column, “donation_id”, includes 60 scans for each individual donation which may be duplicated within the data set for multiple separate compounds. This is not an error, but means that the donation has been tested for both compounds.

The second column, “scan_id”, refers to the number of scans for each donation (60 total) and goes from 1 - 60. This column is not utilized in the data sets with the average absorbance.

The standard deviation, “std”, is calculated for each donation in the range of 950 nm to 1350 nm. The smaller the standard deviation, the more the absorbances for that donation, the more consistency there is between all 60 scans.

Temperature and humidity readings at the time of the scan are also accounted for, as well as the final scan-associated blood data results that is indicated as a human reading: “high,” “ok,” and “low” depending on the human acceptable ranges for that compound and in numeric values.

The biological window

NIR spectra covers the range of 750-2500 nm on the electromagnetic spectrum. It’s considered the lower part of the infrared band which expands until 5000 nm. Waves length that fall between 650nm to 1320nm have the property to penetrate the human tissue up to the depth of 4mm. This range is labeled as the biological window and its the working range we are going to be working on this challenge. In this way we can obtain signatures of chemical compounds that are present in blood that runs in close to skin surface blood vessels. The following diagram describes the penetration depth in millimeters of each wavelength over the near infra red band.

The Spectral Signature of the compounds

The Spectral Signature is like the “chemical fingerprint” of a compound and can be used to identify it in a sample.The compound in the blood behaves exactly like a pure compound, and the Spectral Signature graphs can help you to build your model based on this information..

The Spectral Signatures of all relevant compounds: Zindi Conest Spectra - Google Sheet - for your comfort The data was arranged to the same 170 wavelength values the spectrometer used to collect our data.

The Spectral Signatures graph:

The Patent

The patent document: Patent describes an approach to solve the same problem.

Please try to find a time to read it! It will provide you with a lot of ideas.

Approaches

There have been several attempts previously developing noninvasive blood analysis techniques using spectroscopy specifically to measure glucose and cholesterol levels in our human bodies. This challenge offers the largest amount of data ever collected in this domain.

We have added here some documents that describe previous attempts to solve a similar problem.

For more information, you can check the bloods.ai white paper and the references therein.

Rules

Teams and collaboration

You may participate in competitions as an individual or in a team of up to four people. When creating a team, the team must have a total submission count less than or equal to the maximum allowable submissions as of the formation date. A team will be allowed the maximum number of submissions for the competition, minus the total number of submissions among team members at team formation. Prizes are transferred only to the individual players or to the team leader.

Multiple accounts per user are not permitted, and neither is collaboration or membership across multiple teams. Individuals and their submissions originating from multiple accounts will be immediately disqualified from the platform.

Code must not be shared privately outside of a team. Any code that is shared, must be made available to all competition participants through the platform. (i.e. on the discussion boards).

The Zindi user who sets up a team is the default Team Leader. The Team Leader can invite other data scientists to their team. Invited data scientists can accept or reject invitations. Until a second data scientist accepts an invitation to join a team, the data scientist who initiated a team remains an individual on the leaderboard. No additional members may be added to teams within the final 5 days of the competition or the last hour of a hackathon, unless otherwise stated in the competition rules

A team can be disbanded if it has not yet made a submission. Once a submission is made individual members cannot leave the team.

All members in the team receive points associated with their ranking in the competition and there is no split or division of the points between team members.

Datasets and packages

The solution must use publicly-available, open-source packages only.

You may use only the datasets provided for this competition. Automated machine learning tools such as automl are not permitted.

You may use pretrained models as long as they are openly available to everyone.

The data used in this competition is the sole property of Zindi and the competition host. You may not transmit, duplicate, publish, redistribute or otherwise provide or make available any competition data to any party not participating in the Competition (this includes uploading the data to any public site such as Kaggle or GitHub). You may upload, store and work with the data on any cloud platform such as Google Colab, AWS or similar, as long as 1) the data remains private and 2) doing so does not contravene Zindi’s rules of use.

You must notify Zindi immediately upon learning of any unauthorised transmission of or unauthorised access to the competition data, and work with Zindi to rectify any unauthorised transmission or access.

Your solution must not infringe the rights of any third party and you must be legally entitled to assign ownership of all rights of copyright in and to the winning solution code to Zindi.

Submissions and winning

You may make a maximum of 10 submissions per day.

You may make a maximum of 300 submissions for this competition.

Before the end of the competition you need to choose 2 submissions to be judged on for the private leaderboard. If you do not make a selection your 2 best public leaderboard submissions will be used to score on the private leaderboard.

Zindi maintains a public leaderboard and a private leaderboard for each competition. The Public Leaderboard includes approximately 20% of the test dataset. While the competition is open, the Public Leaderboard will rank the submitted solutions by the accuracy score they achieve. Upon close of the competition, the Private Leaderboard, which covers the other 80% of the test dataset, will be made public and will constitute the final ranking for the competition.

Note that to count, your submission must first pass processing. If your submission fails during the processing step, it will not be counted and not receive a score; nor will it count against your daily submission limit. If you encounter problems with your submission file, your best course of action is to ask for advice on the Competition’s discussion forum.

If you are in the top 20 at the time the leaderboard closes, we will email you to request your code. On receipt of email, you will have 48 hours to respond and submit your code following the submission guidelines detailed below. Failure to respond will result in disqualification.

If your solution places 1st, 2nd, or 3rd on the final leaderboard, you will be required to submit your winning solution code to us for verification, and you thereby agree to assign all worldwide rights of copyright in and to such winning solution to Zindi.

If two solutions earn identical scores on the leaderboard, the tiebreaker will be the date and time in which the submission was made (the earlier solution will win).

If the error metric requires probabilities to be submitted, do not set thresholds (or round your probabilities) to improve your place on the leaderboard. In order to ensure that the client receives the best solution Zindi will need the raw probabilities. This will allow the clients to set thresholds to their own needs.

The winners will be paid via bank transfer, PayPal, or other international money transfer platform. International transfer fees will be deducted from the total prize amount, unless the prize money is under $500, in which case the international transfer fees will be covered by Zindi. In all cases, the winners are responsible for any other fees applied by their own bank or other institution for receiving the prize money. All taxes imposed on prizes are the sole responsibility of the winners. The top 3 winners or team leaders will be required to present Zindi with proof of identification, proof of residence and a letter from your bank confirming your banking details.Winners will be paid in USD or the currency of the competition. If your account cannot receive US Dollars or the currency of the competition then your bank will need to provide proof of this and Zindi will try to accommodate this.

You acknowledge and agree that Zindi may, without any obligation to do so, remove or disqualify an individual, team, or account if Zindi believes that such individual, team, or account is in violation of these rules. Entry into this competition constitutes your acceptance of these official competition rules.

Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution.

Zindi also reserves the right to disqualify you and/or your submissions from any competition if we believe that you violated the rules or violated the spirit of the competition or the platform in any other way. The disqualifications are irrespective of your position on the leaderboard and completely at the discretion of Zindi.

Please refer to the FAQs and Terms of Use for additional rules that may apply to this competition. We reserve the right to update these rules at any time.

Reproducibility of submitted code

  • If your submitted code does not reproduce your score on the leaderboard, we reserve the right to adjust your rank to the score generated by the code you submitted.
  • If your code does not run you will be dropped from the top 10. Please make sure your code runs before submitting your solution.
  • Always set the seed. Rerunning your model should always place you at the same position on the leaderboard. When running your solution, if randomness shifts you down the leaderboard we reserve the right to adjust your rank to the closest score that your submission reproduces.
  • We expect full documentation. This includes:
  • All data used
  • Output data and where they are stored
  • Summary of approach
  • Explanation of features used
  • A requirements file with all packages and versions used
  • Your solution must include the original data provided by Zindi and validated external data (if allowed)
  • All editing of data must be done in a notebook (i.e. not manually in Excel)
  • Environment code to be run. (e.g. Google Colab or the specifications of your local machine)
  • Expected run time for each notebook. This will be useful to the review team for time and resource allocation.

Data standards:

  • Your submitted code must run on the original train, test, and other datasets provided.
  • If external data is allowed, external data must be freely and publicly available, including pre-trained models with standard libraries. If external data is allowed, any data used should be shared with Zindi to be approved and then shared on the discussion forum. Zindi will also make note of the external data available on the data page.
  • Packages:
  • You must submit a requirements file with all packages and versions used.
  • If a requirements file is not provided, solutions will be run on the most recent packages available.
  • Custom packages in your submission notebook will not be accepted.
  • You may only use tools available to everyone i.e. no paid services or free trials that require a credit card.

Consequences of breaking any rules of the competition or submission guidelines:

  • First offence: No prizes for 6 months and 2000 points will be removed from your profile (probation period). If you are caught cheating, all individuals involved in cheating will be disqualified from the challenge(s) you were caught in and you will be disqualified from winning any competitions for the next six months and 2000 points will be removed from your profile. If you have less than 2000 points to your profile your points will be set to 0.
  • Second offence: Banned from the platform. If you are caught for a second time your Zindi account will be disabled and you will be disqualified from winning any competitions or Zindi points using any other account.
  • Teams with individuals who are caught cheating will not be eligible to win prizes or points in the competition in which the cheating occurred, regardless of the individuals’ knowledge of or participation in the offence.
  • Teams with individuals who have previously committed an offence will not be eligible for any prizes for any competitions during the 6-month probation period.

Monitoring of submissions

  • We will review the top 20 solutions of every competition when the competition ends.

We reserve the right to request code from any user at any time during a challenge. You will have 24 hours to submit your code following the rules for code review (see above). Zindi reserves the right not to explain our reasons for requesting code. If you do not submit your code within 24 hours you will be disqualified from winning any competitions or Zindi points for the next six months. If you fall under suspicion again and your code is requested and you fail to submit your code within 24 hours, your Zindi account will be disabled and you will be disqualified from winning any competitions or Zindi points with any other account.

Evaluation

The evaluation metric for this challenge is Accuracy.

Your submission file should look like this:

Reading ID                             target
ID_00902R9H_hdl_cholesterol_human      ok
ID_00902R9H_hemoglobin(hgb)_human      low
ID_00902R9H_cholesterol_ldl_human      ok
ID_00902HGH_hdl_cholesterol_human      ok 
ID_00902HGH_hemoglobin(hgb)_human      high 
ID_00902HGH_cholesterol_ldl_human      low
Prizes

1st Place: $3 750 USD

2nd Place: $2 250 USD

3rd Place: $1 500 USD

Timeline

Competition closes on 13 February 2022.

Final submissions must be received by 11:59 PM GMT.

We reserve the right to update the contest timeline if necessary.