Is there a place to read the dataset schema to understand what each column means in other to make educative hypotheses? ( I may have missed it. please direct me)
Potentially all the columns that have null values:
- Tenure,
- Matric,
- diploma, degree( what is the difference ),
- schoolquintile
- Maths & MathLit ( what does the percentage mean? and what is the difference)
- Home lang and Add-lang( what does the values mean)
- Science
Finally, is the status based on the first round of survey done by the person i.e the initial status regardless on how many rounds of survey the person participated in ? Is that what you mean by baseline details?
The education details will (mostly) remain constant across rounds (since most people were surveyed after school (although a very small number may have been in continuing education).
Something like tenure will change in each round as people switch labour market statuses (by for example finding or losing a job).
Just still abit unclear about what tenure or activity you are referring to . Lets say below is a row in the data.
person id - Id_5ch3zwpdef, survey date - 2022-03-16, round- 2, status - unemployed, tenure - 810.0
what does 810 mean ? does it relate to the person being unemployed? is it in terms of months or days? Is the status fixed based on the first round survey info or based on their current status as at when the 2nd round was done?
This is a great question. So this would be the number of days the person had been unemployed on the 16th March 2022 when they were interviewed. This would have been the second round of the broader survey (but the person need not have participated in the first round).
A sample of people were surveyed in round 1 of the survey and then followed up in round 2. In round 2 a new cohort was added that was then followed up in round 3 etc.
The outcome or target variable is their labour market status when they were reinterviewed in the following round of the survey. The only variable coming from the 'follow-up' round is their status (since we want to predict what they will be doing the next time we interview them).
Tenure is thus 'right censored' - we know what it is at the time of the 'baseline' survey but it is probably more than that becasue we don't observe the 'closing' of that labour market spell.
first question: when you survey a person and re-interview them after a year, do you only add the target variable and the survey date, or do you change the value of other features as well ( such as the value of tenure for example).
my second question is regarding the rounds
if i understood you correctly, you're saying that you interviewed a set of people multiple times across the rounds, my question is: how come the person_ids are unique across the train and test set, meaning "technically" no person was interviewed twice?
1. We do collect the same set of information each time we survey someone but we have not shared this information in this data for the follow-up round except for the target variable. So, yes, we only add the target variable to the information we collected previously.
The reason is that we want to be able to predict the labour market outcomes of a person in the future - if we survey someone now we want to know how likely they will be in employment in 6 months to a year.
2. Yes. In the broader data from the survey (which we have not shared) there are people who have been surveyed multiple times. We decided to that it was better for the competition to make the data 'rectangular' - I.e only have two observations per person (the initial survey and the follow-up).
We are working on producing one but in the meantime please ask about any variables or aspects you need clarity on.
Potentially all the columns that have null values:
- Tenure,
- Matric,
- diploma, degree( what is the difference ),
- schoolquintile
- Maths & MathLit ( what does the percentage mean? and what is the difference)
- Home lang and Add-lang( what does the values mean)
- Science
Finally, is the status based on the first round of survey done by the person i.e the initial status regardless on how many rounds of survey the person participated in ? Is that what you mean by baseline details?
Tenure: this is the length of time the person has been in that activity.
Matric: the South African school leaving certificate. https://en.wikipedia.org/wiki/Matriculation_in_South_Africa
Diploma/Degree: These correspond to different NQF (qualification levels) in the South African education system. https://en.wikipedia.org/wiki/South_African_Qualifications_Authority
Maths and Maths Lit are two variations of Matehmatics taken in matric. Students have to do one (and cannot do both).
Home language and first additional language are two variations of the language taught in matric depending on the fluency of the learner. https://caps123.co.za/what-is-the-difference-between-caps-english-fal-and-english-hl/
Science is the subject science (which covers both physical science and chemistry) at the matric level.
The percentage in these subjects corresponds to the final school leaving mark the person got in that subject. There is more detail in the matric link.
In South Africa schools are classified into 5 different quintiles which reflect the socio-economic status of the school and learners. Quintile 1 is the poorest and quintile 5 the richest. See https://wcedonline.westerncape.gov.za/comms/press/2013/74_14oct.html
The education details will (mostly) remain constant across rounds (since most people were surveyed after school (although a very small number may have been in continuing education).
Something like tenure will change in each round as people switch labour market statuses (by for example finding or losing a job).
Thank you so much.
Just still abit unclear about what tenure or activity you are referring to . Lets say below is a row in the data.
person id - Id_5ch3zwpdef, survey date - 2022-03-16, round- 2, status - unemployed, tenure - 810.0
what does 810 mean ? does it relate to the person being unemployed? is it in terms of months or days? Is the status fixed based on the first round survey info or based on their current status as at when the 2nd round was done?
This is a great question. So this would be the number of days the person had been unemployed on the 16th March 2022 when they were interviewed. This would have been the second round of the broader survey (but the person need not have participated in the first round).
A sample of people were surveyed in round 1 of the survey and then followed up in round 2. In round 2 a new cohort was added that was then followed up in round 3 etc.
The outcome or target variable is their labour market status when they were reinterviewed in the following round of the survey. The only variable coming from the 'follow-up' round is their status (since we want to predict what they will be doing the next time we interview them).
Tenure is thus 'right censored' - we know what it is at the time of the 'baseline' survey but it is probably more than that becasue we don't observe the 'closing' of that labour market spell.
i have a few questions:
first question: when you survey a person and re-interview them after a year, do you only add the target variable and the survey date, or do you change the value of other features as well ( such as the value of tenure for example).
my second question is regarding the rounds
if i understood you correctly, you're saying that you interviewed a set of people multiple times across the rounds, my question is: how come the person_ids are unique across the train and test set, meaning "technically" no person was interviewed twice?
thank you
Great questions:
1. We do collect the same set of information each time we survey someone but we have not shared this information in this data for the follow-up round except for the target variable. So, yes, we only add the target variable to the information we collected previously.
The reason is that we want to be able to predict the labour market outcomes of a person in the future - if we survey someone now we want to know how likely they will be in employment in 6 months to a year.
2. Yes. In the broader data from the survey (which we have not shared) there are people who have been surveyed multiple times. We decided to that it was better for the competition to make the data 'rectangular' - I.e only have two observations per person (the initial survey and the follow-up).