@Zindi kindly check that the reference file or evaluation metric is correct. The sample submission file with zeroes already gives a >90% accuracy, any slight deviation gives an accuracy of around 0%.. (I could be mistaken)
Hi Julius, I initially shared your confusion regarding why accuracy is the metric and why a simple submission can achieve such high accuracy. Here's my understanding, though I could be wrong:
This is a multiclass classification challenge where the classes represent the values extracted from the PDFs. The simple submission reveals that about 90% of the values are 0. This occurs because many AMKEYS are not mentioned in the PDF, leading to their values being set to 0. Similarly, in the training data, these values are set to null, which comprises about 91% of the data. Thus, there is a similar imbalance between the training and evaluation datasets.
The task involves extracting 511 AMKEY values from 12 companies for the year 2022. If an AMKEY is not found in a document, a 0 is assigned to its value.
Certainly, @Wajdi_Hajji... Nevertheless, when I attempted to manually modify just five values for five distinct companies associated with a specific AMKEY key that I was confident were accurate, the accuracy plummeted to 0%.
Hi @JuliusFx, Please ensure the "2022_Value" column in your submission is of type float(same as the original target variable type) since for a task like this challenge even "0" and "0.0" are considered different "classes".
yea it now scores, thanks @Nelly43. maybe you could now check that if two submissions have similar scores, only the earliest is considered. I noted that my latest submission that scores similarly to an earlier submission is considered and I moved down the leaderboard.
I'd recommend reaching out to Zindi directly to double-check and clarify the accuracy concerns. They should be able to assist you and ensure everything aligns properly.
Hi Julius, We will look into this and get back to you by 3 January.
Hi Julius, I initially shared your confusion regarding why accuracy is the metric and why a simple submission can achieve such high accuracy. Here's my understanding, though I could be wrong:
This is a multiclass classification challenge where the classes represent the values extracted from the PDFs. The simple submission reveals that about 90% of the values are 0. This occurs because many AMKEYS are not mentioned in the PDF, leading to their values being set to 0. Similarly, in the training data, these values are set to null, which comprises about 91% of the data. Thus, there is a similar imbalance between the training and evaluation datasets.
The task involves extracting 511 AMKEY values from 12 companies for the year 2022. If an AMKEY is not found in a document, a 0 is assigned to its value.
Certainly, @Wajdi_Hajji... Nevertheless, when I attempted to manually modify just five values for five distinct companies associated with a specific AMKEY key that I was confident were accurate, the accuracy plummeted to 0%.
Hi @JuliusFx, Please ensure the "2022_Value" column in your submission is of type float(same as the original target variable type) since for a task like this challenge even "0" and "0.0" are considered different "classes".
yea it now scores, thanks @Nelly43. maybe you could now check that if two submissions have similar scores, only the earliest is considered. I noted that my latest submission that scores similarly to an earlier submission is considered and I moved down the leaderboard.
I'd recommend reaching out to Zindi directly to double-check and clarify the accuracy concerns. They should be able to assist you and ensure everything aligns properly.
Yea, I think it was clarified.