Traditionally we do blood analysis by collecting blood samples from patients and perform different tests on the samples. In this competition, you will help scientists from bloods.ai to make progress towards non-invasive blood analyses.
For this purpose, you build machine learning models that can classify the level of specific chemical compounds in samples from their spectroscopic data. In simple terms when we direct a beam of light towards a sample, the light is partially absorbed and/or reflected based on the sample's molecular structure (different chemical compounds present). The amount of light absorbed strongly depends on the wavelength of the light source used. Hence, if we use a beam of light containing a range of wavelengths, we can measure the amount of energy absorbed for each wavelength. Such a measurement over different wavelengths (or frequencies) is called a spectrum (or spectral data).
We will use spectral data from the Near Infra-Red (NIR) wavelengths ranges (950 nm to 1350 nm) in this challenge. In contrast to other wavelengths, NIR has the highest penetration power and goes deep into tissues attenuated. The difficulty with Spectral data is that we generally have far too many features compared to the number of data points. Most chemists are yet to adopt machine learning techniques for spectral analysis because of the high risk of overfitting. One major issue you will face in this challenge is to build a model that generalises well without overfitting.
If successful, you will lead the way in creating a new level of health awareness in people by making blood tests to be a commodity and a procedure that can be practiced with no effort many times a day, much the same like measuring our weight. Furthermore, we can incorporate your model into new devices that allow patients to do their blood analysis in less than a minute, even at home and send the results to their doctor. NIR spectroscopy is a sleeping giant waiting to penetrate our everyday lives on our mobile phones, wearables and and many other appliances. Imagine a kiosk machine that is located in the supermarket that suggest a diet on the fly based on our cholesterol levels measured by just shine of a light, imagine a weighing machine that not only tells us our weight but also our vitamin levels in our blood.
Diving into the challenge:
Before jumping into this challenge it is recommended you form a basic fundamental knowledge about spectroscopy, material analysis techniques and get familiarized with the datasets presented in this challenge.
Introduction video from Oded Daniel, the CEO of bloods.ai: https://youtu.be/vIUTA25h8Ss
About Bloods.ai (www.bloods.ai)
We are a team of ambitious, innovative professionals who are passionate about offering a better way to get in touch with and care for our bodies - all with the simple scan of our skin.
Data Overview
Each data set collected consists of scans made using an identical scanner model. Data is collected as raw data, and a result of the scanner pushing light into the target (in this case, a fingertip). The data is displayed as a function of wavelength measured in nanometers (nm), and registers the quantity of light that is reflecting off of the target point. Our application registers the results and creates an array of values in order to account for all wavelength data. The light that reflects back is referred to as “intensity.”
Data Collection Specifications
Each scan cycle pushes light as a function of wavelength, and our wavelength data ranges in intensity from 900 nm to 1700 nm. All light reflected is distinguished and categorized by wavelength (i.e., 900 nm is distinguished from those returning at 905 and so on.) We expect intensity at 900, 904.71, 909.41, 914.12 and much more, resulting in an array of 170 intensities per scan.
We repeat this process 60 times in order to produce a reliable scan and see a holistic picture. We also account for humidity and temperature, keeping in mind that these factors may affect scan results. Therefore, each of the 60 scans comprises 170 intensities where temperature and humidity at the time of the scan are accounted for.
Data Quality Assurance
Scanner Monitoring & Calibration
The plate is scanned by each scanner every day before data collection begins, then the first scan is compared to the last. The numeric difference between the scan values is considered “scanner decay,” and will alert us as to whether or not the value gap is large enough to warrant replacing the scanner to ensure scan value accuracy. This process is called “calibration” and is also useful for calculating absorbance.
Data Validation Process
Data Organization
Data Extraction & Absorbance
Calculating absorbance is the ultimate goal of our data collection. The white plate’s main purpose is to test the scan on the “full light reflection,” which gives us the numeric values necessary to determine how much light is being absorbed at a certain wavelength. We use this data to calculate absorbance using the Beer-Lambert law.
The absorbance wavelength gives us otherwise hidden insight into the matter it’s reflecting from. Each compound the light interacts with has a unique absorbance signature and wavelength graph, which will be available to you for the compounds we’d like to track. Each scan absorbance can be thought of as the sum of all compound signatures (keeping in mind that there are more than 4,000 compounds found in human blood).
Data Packaging and File Structure
The data sets we’ve created are comprised of high-quality, well-vetted bio donation scans and their paired blood results. We’ve created 6 different files for each compound.
The file structures include several columns. The first column, “donation_id”, includes 60 scans for each individual donation which may be duplicated within the data set for multiple separate compounds. This is not an error, but means that the donation has been tested for both compounds.
The second column, “scan_id”, refers to the number of scans for each donation (60 total) and goes from 1 - 60. This column is not utilized in the data sets with the average absorbance.
The standard deviation, “std”, is calculated for each donation in the range of 950 nm to 1350 nm. The smaller the standard deviation, the more the absorbances for that donation, the more consistency there is between all 60 scans.
Temperature and humidity readings at the time of the scan are also accounted for, as well as the final scan-associated blood data results that is indicated as a human reading: “high,” “ok,” and “low” depending on the human acceptable ranges for that compound and in numeric values.
The biological window
NIR spectra covers the range of 750-2500 nm on the electromagnetic spectrum. It’s considered the lower part of the infrared band which expands until 5000 nm. Waves length that fall between 650nm to 1320nm have the property to penetrate the human tissue up to the depth of 4mm. This range is labeled as the biological window and its the working range we are going to be working on this challenge. In this way we can obtain signatures of chemical compounds that are present in blood that runs in close to skin surface blood vessels. The following diagram describes the penetration depth in millimeters of each wavelength over the near infra red band.
The Spectral Signature of the compounds
The Spectral Signature is like the “chemical fingerprint” of a compound and can be used to identify it in a sample.The compound in the blood behaves exactly like a pure compound, and the Spectral Signature graphs can help you to build your model based on this information..
The Spectral Signatures of all relevant compounds: Zindi Conest Spectra - Google Sheet - for your comfort The data was arranged to the same 170 wavelength values the spectrometer used to collect our data.
The Spectral Signatures graph:
The Patent
The patent document: Patent describes an approach to solve the same problem.
Please try to find a time to read it! It will provide you with a lot of ideas.
Approaches
There have been several attempts previously developing noninvasive blood analysis techniques using spectroscopy specifically to measure glucose and cholesterol levels in our human bodies. This challenge offers the largest amount of data ever collected in this domain.
We have added here some documents that describe previous attempts to solve a similar problem.
For more information, you can check the bloods.ai white paper and the references therein.
Teams and collaboration
You may participate in competitions as an individual or in a team of up to four people. When creating a team, the team must have a total submission count less than or equal to the maximum allowable submissions as of the formation date. A team will be allowed the maximum number of submissions for the competition, minus the total number of submissions among team members at team formation. Prizes are transferred only to the individual players or to the team leader.
Multiple accounts per user are not permitted, and neither is collaboration or membership across multiple teams. Individuals and their submissions originating from multiple accounts will be immediately disqualified from the platform.
Code must not be shared privately outside of a team. Any code that is shared, must be made available to all competition participants through the platform. (i.e. on the discussion boards).
The Zindi user who sets up a team is the default Team Leader. The Team Leader can invite other data scientists to their team. Invited data scientists can accept or reject invitations. Until a second data scientist accepts an invitation to join a team, the data scientist who initiated a team remains an individual on the leaderboard. No additional members may be added to teams within the final 5 days of the competition or the last hour of a hackathon, unless otherwise stated in the competition rules
A team can be disbanded if it has not yet made a submission. Once a submission is made individual members cannot leave the team.
All members in the team receive points associated with their ranking in the competition and there is no split or division of the points between team members.
Datasets and packages
The solution must use publicly-available, open-source packages only.
You may use only the datasets provided for this competition. Automated machine learning tools such as automl are not permitted.
You may use pretrained models as long as they are openly available to everyone.
The data used in this competition is the sole property of Zindi and the competition host. You may not transmit, duplicate, publish, redistribute or otherwise provide or make available any competition data to any party not participating in the Competition (this includes uploading the data to any public site such as Kaggle or GitHub). You may upload, store and work with the data on any cloud platform such as Google Colab, AWS or similar, as long as 1) the data remains private and 2) doing so does not contravene Zindi’s rules of use.
You must notify Zindi immediately upon learning of any unauthorised transmission of or unauthorised access to the competition data, and work with Zindi to rectify any unauthorised transmission or access.
Your solution must not infringe the rights of any third party and you must be legally entitled to assign ownership of all rights of copyright in and to the winning solution code to Zindi.
Submissions and winning
You may make a maximum of 10 submissions per day.
You may make a maximum of 300 submissions for this competition.
Before the end of the competition you need to choose 2 submissions to be judged on for the private leaderboard. If you do not make a selection your 2 best public leaderboard submissions will be used to score on the private leaderboard.
Zindi maintains a public leaderboard and a private leaderboard for each competition. The Public Leaderboard includes approximately 20% of the test dataset. While the competition is open, the Public Leaderboard will rank the submitted solutions by the accuracy score they achieve. Upon close of the competition, the Private Leaderboard, which covers the other 80% of the test dataset, will be made public and will constitute the final ranking for the competition.
Note that to count, your submission must first pass processing. If your submission fails during the processing step, it will not be counted and not receive a score; nor will it count against your daily submission limit. If you encounter problems with your submission file, your best course of action is to ask for advice on the Competition’s discussion forum.
If you are in the top 20 at the time the leaderboard closes, we will email you to request your code. On receipt of email, you will have 48 hours to respond and submit your code following the submission guidelines detailed below. Failure to respond will result in disqualification.
If your solution places 1st, 2nd, or 3rd on the final leaderboard, you will be required to submit your winning solution code to us for verification, and you thereby agree to assign all worldwide rights of copyright in and to such winning solution to Zindi.
If two solutions earn identical scores on the leaderboard, the tiebreaker will be the date and time in which the submission was made (the earlier solution will win).
If the error metric requires probabilities to be submitted, do not set thresholds (or round your probabilities) to improve your place on the leaderboard. In order to ensure that the client receives the best solution Zindi will need the raw probabilities. This will allow the clients to set thresholds to their own needs.
The winners will be paid via bank transfer, PayPal, or other international money transfer platform. International transfer fees will be deducted from the total prize amount, unless the prize money is under $500, in which case the international transfer fees will be covered by Zindi. In all cases, the winners are responsible for any other fees applied by their own bank or other institution for receiving the prize money. All taxes imposed on prizes are the sole responsibility of the winners. The top 3 winners or team leaders will be required to present Zindi with proof of identification, proof of residence and a letter from your bank confirming your banking details.Winners will be paid in USD or the currency of the competition. If your account cannot receive US Dollars or the currency of the competition then your bank will need to provide proof of this and Zindi will try to accommodate this.
You acknowledge and agree that Zindi may, without any obligation to do so, remove or disqualify an individual, team, or account if Zindi believes that such individual, team, or account is in violation of these rules. Entry into this competition constitutes your acceptance of these official competition rules.
Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution.
Zindi also reserves the right to disqualify you and/or your submissions from any competition if we believe that you violated the rules or violated the spirit of the competition or the platform in any other way. The disqualifications are irrespective of your position on the leaderboard and completely at the discretion of Zindi.
Please refer to the FAQs and Terms of Use for additional rules that may apply to this competition. We reserve the right to update these rules at any time.
Reproducibility of submitted code
Data standards:
Consequences of breaking any rules of the competition or submission guidelines:
Monitoring of submissions
We reserve the right to request code from any user at any time during a challenge. You will have 24 hours to submit your code following the rules for code review (see above). Zindi reserves the right not to explain our reasons for requesting code. If you do not submit your code within 24 hours you will be disqualified from winning any competitions or Zindi points for the next six months. If you fall under suspicion again and your code is requested and you fail to submit your code within 24 hours, your Zindi account will be disabled and you will be disqualified from winning any competitions or Zindi points with any other account.
The evaluation metric for this challenge is Accuracy.
Your submission file should look like this:
Reading ID target ID_00902R9H_hdl_cholesterol_human ok ID_00902R9H_hemoglobin(hgb)_human low ID_00902R9H_cholesterol_ldl_human ok ID_00902HGH_hdl_cholesterol_human ok ID_00902HGH_hemoglobin(hgb)_human high ID_00902HGH_cholesterol_ldl_human low
1st Place: $3 750 USD
2nd Place: $2 250 USD
3rd Place: $1 500 USD
Competition closes on 13 February 2022.
Final submissions must be received by 11:59 PM GMT.
We reserve the right to update the contest timeline if necessary.