The problem description states that the the train set is values from the years 2019-2021 and the test year is 2022. The data folder `Data Sources.zip` seems to be missing test data (an annual report from 2022) for some of the companies.
The following companies have a 2022 annual report:
- Absa - `2022-Absa-Group-limited-Environmental-Social-and-Governance-Data-sheet`
- Clicks - `Clicks-Sustainability-Report-2022`
- Distell - `DISTELL ESG Appendix 2022`
- Oceana - `Oceana_ESG_Databook_FY2022` and `Oceana_Group_Sustainability_Report_2022`
- SSW - `ssw-IR22`
The following companies have a 2023 annual report, but not a 2022 report:
- Impala- `ESG-spreads`
- Pick n Pay - `picknpay-esg-report-spreads-2023`
- SASOL - `SASOL Sustainability Report 2023 20-09_0`
Should we use the 2023 annual report as the test report for these companies?
The following companies do not have a 2022 annual report (or a 2023 report):
- Tongaat - only pdf provided is for 2021 (`2021ESG`)
- UCT - only pdf provided is for 2020-2021 (`UCT_Carbon_Footprint_Report_2020-2021`)
This is very true.... i think the whole data is a bit messed up
Remember you also have to ground the model not to give wrong answers if that information is not there. So the reason why people can get perfect scores using a sample sub filled with zeros is because a high percentage of these values are missing.