Hello All, I hope you are doing fun with this awesome competition. I would like to encounter a critical point that made me confused about whether to the Train.csv or not.
Okay for example, when i wanted to dive into EDA and see how values of year 2021 have been collected, i remarked that labeling might have some issues. For example, for the Impala company, the pdf is ESG-spreads.pdf, I started by selecting Train.csv rows that have this Group to know what different metrics it has. (As the photo below shows : ).
Focusing in metric 128: Total Direct CO2
I am back to the pdfs to found out that these are not the actual values for the metric and they are different from those mentioned in Train.csv. Shall we rely on train.csv in that case ?
Photo Link:
https://drive.google.com/file/d/1wG60luQtKMb_fZymvCLMBGH-wFy9OQt1/view?usp=sharing
Was about to start this thread..yes the train.csv is terrible and its not only for Impalla. Data entry issues?? If the file we are scored against is also having these issues, its even a big issue. @Zindi ??
Sorry there was an error with the picture and now it is uploaded well.
We need clarification into this issue as it might affect our approchs.
Actually it's not an error. Dive into the data and understand how they got that value. Because it's actually a correct value. I had this assumption when I started but after more analysis, I found that it's not a data entry issue
This is interesting, let me have a look
@Koleshjr, you're right!... It introduces a level of complexity that I cannot manage deal with in these final few hours 😅.
😂😂 told you guys the data was clean