Hi! I have to say that the competition is interesting and challenging. But maybe the metric can be improved.
Consider the case that there is a hard page that nobody's model know where to put it:
correct answer: page 1-page 2-page 3-page 4(hard page)
A fairly good model would still order the rest correct(and in sequential order) but achieve an accuracy of 0:
page 4-page 1-page 2-page 3(solution a: acc 0)
A somewhat random model might get a non-zero accuracy, even it has no ability to logically tell the easy pages in sequential order:
page 2-page 4-page 3-page 1(solution b: acc 0.25)
I would say that solution a is a better one, considering that it has a longer valid sub-group of pages. Solution b got no sequential order right(the reader would be so confused), but still it manages to acquire a better score in our current metric. It might reveal that our current metric is not robust.
This is irrelevant now(considering none of us is getting a good model) and since the metric has been announced, we should evaluate our works based on it.
But maybe we should consider using a different metric in following competitions with similar task or for evaluating late submissions.
I would propose something similiar to "longest common substring" or a 'fuzzy' version of it(you can make a few mistakes in the substring but for each one you make, your score, which is length, would be mutiplied by a decimal; and the score would be the highest one u can make).
Thanks again for hosting this interesting competition.
The current score is very strict.
if you use something like you describe, then the score will go up a bit, even for the same random sub. So you get a bit of perhaps unfair benefit.
Your point is valid though, but perhaps also too late, given that the competition has already been running for a while on a different metric. To change the score in mid competition is not fair.
If you started from scratch, then maybe something like 10 * length of longest correct stretch + 9 * length of second longest stretch + ... + 1 for each page in correct spot or something similar. Here I used weights 10, 9 ... 1 for ten longest stretches but perhaps even better is to just make score = sum of lengths of correct stretches + number of pages in correct spot. Then if you have perfect solution you will have top score.
Of course, then you run into second (and third and ...) round effects. What if you have a real good solution, with every second page in the correct spot, and every other page random. That would be marvellous, but how to score it fairly? Then perhaps you need a penalty system, where you penalise by 1 for each page not in the correct spot, by 1 for each page if not followed by the correct page, by 1 for each page if the page 2 pages on is not correct etc. So 0 is perfect score.
And, as you see, you run into complexity problems :-) so perhaps current score is the best ...
The fundamental problem with run length scoring is that the scoring then is based on the typical solution algorithm, where you look at similarity between pages. The scoring should be natural, it should be closely based on the problem, and not too closely on the way in which you solve it.
Hey, it has been interesting reading this!
Typically, a sequential type metric would be better for a challenge like this but as this model won't be put into production as it is more a research/fun problem the need for a hardcore metric is needed and requested by the host ;)
Accuracy is easier to compare with results from the 4 previous winners.