💰 This Week on Zindi: Dataset schema

Predictive Insights Youth Income Prediction Challenge

Helping South Africa

R10 000 ZAR

Challenge completed ~2 years ago

Skills you will learn

Prediction

Job Opportunity

637 joined

257 active

Info Data Chat Leaderboard

Start

Jun 08, 23

Oct 01, 23

Reveal

Oct 01, 23

Anoetar

Dataset schema

Data · 11 Jul 2023, 15:48 · 8

Hello,

Is there a place to read the dataset schema to understand what each column means in other to make educative hypotheses? ( I may have missed it. please direct me)

Discussion 8 answers

neilr

We are working on producing one but in the meantime please ask about any variables or aspects you need clarity on.

12 Jul 2023, 05:16

Upvotes 1

Anoetar

Potentially all the columns that have null values:

- Tenure,

- Matric,

- diploma, degree( what is the difference ),

- schoolquintile

- Maths & MathLit ( what does the percentage mean? and what is the difference)

- Home lang and Add-lang( what does the values mean)

- Science

Finally, is the status based on the first round of survey done by the person i.e the initial status regardless on how many rounds of survey the person participated in ? Is that what you mean by baseline details?

replied to neilr12 Jul 2023, 08:31

Upvotes 0

neilr

Tenure: this is the length of time the person has been in that activity.

Matric: the South African school leaving certificate. https://en.wikipedia.org/wiki/Matriculation_in_South_Africa

Diploma/Degree: These correspond to different NQF (qualification levels) in the South African education system. https://en.wikipedia.org/wiki/South_African_Qualifications_Authority

Maths and Maths Lit are two variations of Matehmatics taken in matric. Students have to do one (and cannot do both).

Home language and first additional language are two variations of the language taught in matric depending on the fluency of the learner. https://caps123.co.za/what-is-the-difference-between-caps-english-fal-and-english-hl/

Science is the subject science (which covers both physical science and chemistry) at the matric level.

The percentage in these subjects corresponds to the final school leaving mark the person got in that subject. There is more detail in the matric link.

In South Africa schools are classified into 5 different quintiles which reflect the socio-economic status of the school and learners. Quintile 1 is the poorest and quintile 5 the richest. See https://wcedonline.westerncape.gov.za/comms/press/2013/74_14oct.html

replied to Anoetar12 Jul 2023, 08:41

Upvotes 0

neilr

The education details will (mostly) remain constant across rounds (since most people were surveyed after school (although a very small number may have been in continuing education).

Something like tenure will change in each round as people switch labour market statuses (by for example finding or losing a job).

replied to neilr12 Jul 2023, 08:44

Upvotes 0

Anoetar

Thank you so much.

Just still abit unclear about what tenure or activity you are referring to . Lets say below is a row in the data.

person id - Id_5ch3zwpdef, survey date - 2022-03-16, round- 2, status - unemployed, tenure - 810.0

what does 810 mean ? does it relate to the person being unemployed? is it in terms of months or days? Is the status fixed based on the first round survey info or based on their current status as at when the 2nd round was done?

replied to neilr12 Jul 2023, 11:42

Upvotes 0

neilr

This is a great question. So this would be the number of days the person had been unemployed on the 16th March 2022 when they were interviewed. This would have been the second round of the broader survey (but the person need not have participated in the first round).

A sample of people were surveyed in round 1 of the survey and then followed up in round 2. In round 2 a new cohort was added that was then followed up in round 3 etc.

The outcome or target variable is their labour market status when they were reinterviewed in the following round of the survey. The only variable coming from the 'follow-up' round is their status (since we want to predict what they will be doing the next time we interview them).

Tenure is thus 'right censored' - we know what it is at the time of the 'baseline' survey but it is probably more than that becasue we don't observe the 'closing' of that labour market spell.

replied to Anoetar12 Jul 2023, 11:53

Upvotes 0

testgorilla

i have a few questions:

first question: when you survey a person and re-interview them after a year, do you only add the target variable and the survey date, or do you change the value of other features as well ( such as the value of tenure for example).

my second question is regarding the rounds

if i understood you correctly, you're saying that you interviewed a set of people multiple times across the rounds, my question is: how come the person_ids are unique across the train and test set, meaning "technically" no person was interviewed twice?

thank you

replied to neilr26 Jul 2023, 15:14

Upvotes 0

neilr

Great questions:

1. We do collect the same set of information each time we survey someone but we have not shared this information in this data for the follow-up round except for the target variable. So, yes, we only add the target variable to the information we collected previously.

The reason is that we want to be able to predict the labour market outcomes of a person in the future - if we survey someone now we want to know how likely they will be in employment in 6 months to a year.

2. Yes. In the broader data from the survey (which we have not shared) there are people who have been surveyed multiple times. We decided to that it was better for the competition to make the data 'rectangular' - I.e only have two observations per person (the initial survey and the follow-up).

replied to testgorilla27 Jul 2023, 04:18

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status