Sea Turtle Rescue: Error Detection Challenge
Cash and prizes worth $1,950 USD
Help Local Ocean Conservation clean their sea turtle rescues database
30 November 2018–28 April 2019 23:59
281 data scientists enrolled, 53 on the leaderboard
What is considered 'erroneous'? Clarification Needed around NA Values [+sample_code]
published 5 Mar 2019, 08:23
edited ~23 hours later

It is clear that to get target values one needs to compare values in the 'dirty' dataset with what's in the 'cleaned' dataset. However, it seems not every difference is considered an 'error'. Hence the need for clarification about what is considered 'erroneous' and what is not. Especially when it comes to NA values.

Take the case of the Sex field for example. There are 3139 NA values in the dirty dataset which were labeled as 'Unknown' in the cleaned dataset. Are those considered errors?

Below is my code to get target values:

cleaned = pd.read_csv("cleaned_data.csv", encoding="latin-1").fillna("NA")
dirty   = pd.read_csv("dirty_data.csv", encoding="latin-1").fillna("NA")
columns = [i for i in cleaned.columns if i not in ["Rescue_ID"]] 
target = (dirty[columns][:len(cleaned)] != cleaned[columns]).values.astype("i")
# if you want to stack them up
target = np.vstack([target[:,[i,i+1]] for i in range(0,target.shape[1],2)])

Anyone can tell what's my mistake?

Hi @mendrika, sorry for the delay! We treated NA (a blank field) as *NOT* the same as "Unknown", i.e. it would be coded as an error. We basically want to help Local Ocean quickly find fields that need some kind of attention, and we have considered even writing in "unknown" in a blank field as a type of cleaning they need to do, even though the meaning is almost the same. Thank you for the question! Good luck!

Hi @Zindi team, thank you for the reply!

That is very weird because I do get a score improvement when I consider ["Unknown", "not_recorded", "Not_Recorded"] the same as NAs.

edited less than a minute later

@mendrika. It might seem weird, but it is not completely. Consider that these were humans that cleaned the data and defined what the "clean" data should look like at the end of each year, so there may be patterns that may not seem consistent on the surface.

This is an unusual ML problem, but very real, and definitely going to be very useful to Local Ocean Conservation. Good luck!

Hi @mendrika,

Thanks for raising this, I have had the same question and glad to see it answered by Zindi.

:)

Just to be clear, NaN is'nt the same as any of [unknown, not_recorded]

IndexError: index 25 is out of bounds for axis 1 with size 25
can you help me to get the target variable am having the error above @hakymulla

good morning, i don't understand your question. pls expantiate and add your lines of code, thanks.

The code to get the target variable i mean to the dataframe.

columns = [i for i in cleaned.columns if i not in ["Rescue_ID"]]

target = (dirty[columns][:len(cleaned)] != cleaned[columns]).values.astype("i")

# if you want to stack them up

target = np.vstack([target[:,[i,i+1]] for i in range(0,target.shape[1],2)])

OMG, sorry, do you still need help with this?