🩺 Must-Read: Potential Data Inconsistencies...

Kenya Clinical Reasoning Challenge

Helping Kenya

$10 000 USD

Completed (11 months ago)

Skills you will learn

Prediction

Natural Language Processing

SLM

1668 joined

439 active

Info Data Chat Leaderboard

Start

Apr 03, 25

Jun 29, 25

Reveal

Jun 30, 25

Zambia_Kuchalo

Typaflow Software Systems Limited

Potential Data Inconsistencies in Kenya Clinical Reasoning Challenge Dataset

Data · 4 Jun 2025, 13:50 · 10

Hello Zindians,

I hope all is well with everyone.

I have manually reviewed all 400 “Prompt–Clinician” examples. I noticed some inconsistencies in the data that could affect model performance. I’m sharing these findings in case they are unintentional errors or part of the challenge design and would appreciate any guidance you can provide.

Summary of Findings

Out of the 400 examples, I identified the following types of issues:

1. Context Mismatches (9 examples) Example indices: 26, 137, 148, 267, 318, 332, 352, 363, 366 In these cases, the “Prompt” appears to describe one clinical condition, while the “Clinician” response addresses a different condition.

2. Age Mismatches / Misalignments (13 examples) Example indices: 4, 14, 69, 72, 81, 104, 115, 117, 205, 303, 359, 392, 397 Here, the age mentioned in the “Prompt” (e.g., “a 5‑year‑old child”) does not match the age stated or implied in the “Clinician” response.

3. Spelling Errors (10 examples) Example indices: 185, 189, 196, 217, 234, 280, 284, 286, 289, 301 These entries contain typos or misspelled medical terms that could potentially impact tokenization or keyword matching. e.g poison - prison

4. Day Mismatch (1 example) Example index: 295 In this case, the “Prompt” refers to symptom onset “2 days ago,” but the “Clinician” response treats it as “2 weeks ago” (or another time frame), leading to a temporal inconsistency.

Why This Matters

1. Model Training Quality: Mismatches between “Prompt” and “Clinician” labels can confuse supervised learning—models may learn incorrect associations (e.g., treating a 5‑year‑old case as if it were a 6‑year‑old).

2. Evaluation Impact: If a model predicts based on the (incorrect) label, leaderboard scores may not accurately reflect true performance.

3. Challenge Intent: I’m not certain whether these inconsistencies are intentionally included (to test robustness) or represent genuine data errors. Clarification would help me— and other participants—interpret results correctly.

Request for Guidance

1. Data Verification: Would it be possible for the Zindi Team (or data curators) to confirm whether these are expected anomalies or genuine typos/mismatches?

2. Recommended Approach: If some of these examples are indeed erroneous, should we:

Exclude them from training/validation?
Manually correct them (e.g., adjust ages, fix spellings, align contexts) and use our “cleaned” version for local experimentation?
Treat them as a “challenge within a challenge” and leave them as is, knowing that robustness to label noise is part of the evaluation?

3. Official Errata or Updates: If there is an errata sheet or plan to release an updated dataset with corrections, could you please let us know where to find it and when it might be available?

Example Illustrations

Below are illustrative cases to show precisely what I mean:

Seeking guidance. Please let me know how best to proceed.

Thanks in advance

GoodDay

Discussion 10 answers

Zambia_Kuchalo

Typaflow Software Systems Limited

Amy_Bray

Seeking guidance please.

4 Jun 2025, 14:01

Upvotes 0

stefan027

This is amazing work @Zambia_Kuchalo!

I have also noticed the context mismatches - those seem particularly concerning since we're working with a small, expert-annotated dataset - but I haven't gotten around to quantifying it. Even if I had, there is no way my analysis would have been as detailed as yours! Thanks for sharing.

4 Jun 2025, 14:48

Upvotes 2