First of all, thanks to zindi and unbound for a really interesting puzzle and fun challenge. Also, thanks to all the zindians for a number of robust debates about how to approach this and what is allowed and what not. Then finally, thanks to @Moto for collaborating with me and as always in providing guidance in this difficult journey.
I would love to share more details but for now, while the dust still settles, will just share a very basic outline of our approach. There are some of the details I will thus keep back for now, but which I will share and would love to discuss as soon as possible.
So - how to solve this?
I suspect most others proceeded as we did, by breaking up the text into sentences and then by passing consecutive sentences or groups of sentences to some NLP to learn which sentences belong together.
We used this to then construct a probability matrix where each entry i, j represents the probability that page i is followed by page j. This matrix can then be used to construct a naive solution by using some kind of an optimiser. We used a greedy approach, where at each step you select the page pair that has the highest probability of being a pair. We also tried a travelling saleman approach (thanks @Moto for coding this up) to optimise the overall probability of a complete ordering of pages rather than just making the best single step move.
The greedy approach proved to be more robust here (against expectation) and so we used that in the end.
However, none of these approaches made any real progress. Even after numerous refinements, it was clear that this would yield just a handful of pages in the correct positions. So how to make progress?
Next we started to add a number of constraints to the optimiser.
These constrains are based on a human's interpretation of the text, so not pure NLP, but this human calibration approach is essential in solving any difficult problem.
The text contains a number of references that clearly follow on one another. The author for example takes a first pill with tea and later a second pill and cleary the first pill page should come before the second pill page. This is one constraint that can be added to the optimiser. Other similar ones is when the author refers to Barbara and later informally to Babs or Babbie. Obviously, the Barbara page should be before the more informal ones.
This is how I suppose a human-only solution will be found - by scanning the text for these references and using them to painstakingly put the pages in the correct order. To do this seems to require lots of geographical and cultural knowledge that an NLP simply would not posess and so the optimiser requires even more guidance in order to make any progress.
Another way in which to assist the optimiser then is to seed it with any pages that are known to be in a given position. This is another constraint that the optimiser can then use to select a solution but also changes the approach a bit. Earlier, when no page positions were known, the way to solve it was to determine the optimal ordering in a greedy or global (travelling salesman) way. Now, one can start with a number of seeds and grow a solution from that.
The fact that these constrains are human-determined have been discussed in the forums and zindi still has to clarify its stance on this, so keep in mind our approach is somewhat controversial at the moment and our solution may even be disqualified for it. However, I am convinced that without this it would be more or less impossible to bite into this problem so, if all else fails, at least this provides a rough sense of what can be achieved with and without such human guided calibration.
Anyhow, that is a very broad outline of how we tackled this. I will post more details soon and look forward to any comments on this or any other approaches being outlined here.
Hello Skak... Thanks for this summary of your approach. Did you consider using any transformer model template that has been vastly trained with other data for a similar NLP task? Do you think approaching this problem using a transformer would have yielded any better results? Your approach doesn't seem to give much importance to the train set... Was the train data set totally ignored in your approach or just marginally considered?
I would like to get your opinions on these if it's not too much to ask. Thanks again.
What you describe was sort of our starting point. Initially we would use transformer models to try and determine page pairs. We tried many alternatives. I basically trawled hugging face for models and data to use and we tried 'em all. Towards the end, I used big bird quite a bit. This promises to be able to handle longer sentences and so this allowed us to e.g. use the last sentence of one page with the first sentence on the next, the last two sentences combined with the first two sentences on the next and so on. So towards the end we were getting pretty refined in how we constructed page pairs, but even so, this was a dismal failure.
The reason is that the text is cryptic and deceptive on purpose. I did use both books to train and was looking to add more books, but here you have a problem. There are only two pages in the whole book with unfinished sentences - again on purpose! Each page is a independent unit and it is often not clear which page follows on it. How will any transformer be able to solve that?
To properly train a transformer for this, you would need something like big bird that can handle very long sentences, so that you can feed it a whole page at a time. Also, you will need to find books that were written like this one, with pages all finishing on a full stop and with multiple possibilities on what to follow. This is nearly impossible to do imho.
The optimiser will need to have some logic to deal with this also. It will have to construct pages in a given way and, if it discovers that the logic is broken, it will have to backtrack and re-assemble the pages in a different way. This at least, is within reach.
So we did the latter. You can grep the text and discover that some pages need to follow on others. Then you can add this logic to the optimiser to force it to maintain that order. The trick is you have to discover it yourself. A transformer that can parse many pages may be able to see that the "second pill" page must come after the "first pill" page, but this is hypothetical, as I doubt such a model really exists at the moment. So you grep the text for such references and then you hard code that into the optimiser and you use the simple transformer to try and get page pairs out of the text.
I hope this answers your question? If this was a "normal" book with a coherent, flowing story and with pages that break mid-sentence, I think a transformer would have a fighting chance. Here we did do that and we used exactly that approach to get the probability matrix, but given the cryptic nature of the puzzle, we had to add more as I described in the original post. So to conclude, we did follow your approach initially and up to a point, but it really did not work here. We did not discard it, we had to augment it with human conclusions and the optimiser, not the transformer, became central to the solution.
I want to discuss in more detail the kind of constraints we used to tackle this, but will just wait a bit longer before diving into that.
However, I want to expand a bit more on our approach. One of the beautiful products of our approach is the probability matrix. It makes thinking about the problem and fitting NLP models to it really very simple and straightforward. Remember, the probability matrix uses some NLP model to estimate the probability that a given page follows on another one. With this in hand, you can build an ensemble of NLP models based on multiple ways in which you can segment the text, and just update the matrix. Nice and easy.
As with so many things, this is also not entirely on target - reality is a tad more complex.
The probabilities, for such a tough puzzle as Cain, are essentially random. Even if we generously assume relatively accurate probabilities, then an otherwise accurate solution have to be off by just 1 position to score 0!
The solution asks for the position of pages, not the order. So to move closer to the problem, you have to e.g. train an NLP to determine the first page, then the second page of the book and so on for example. I don't think NLP is really built for this.
If you use the probabilities, it will give you a type of cyclical ordering and you have to be somewhat - ahem - creative to find the head and tail of it.
So while the probabilities helped shape our thinking and implementation, and especially helped us get going with the optimiser, it later on became a bit of a dead weight as the optimiser grew in importance but still heavily relied on these probabilities.
I will share how this transpired and how we dealth with it in a follow up post to this thread in the near future.
The problem with the metric here is that it asks for precise positions. A page must be in the correct position.
Most of what I've described so far tries to get the page order right, but that is not really solving the problem. It is also thus difficult to know if one is making progress in getting a better order as you work on the optimiser when you are scored on correct positions.
This is an interesting discussion, how to build a model that somehow can estimate the position for a page rather than try to order it correctly? Any ideas? @victorkras2008 ?
It seems that, just like a human approaching this, you need to find all the hidden conditions in the text. It is not enough to know e.g. just that the first pill page will come before the second pill page. You need such a condition for every page. Thus to fully determine the solution, you need to figure out multiple conditions for each page and those conditions, not the probabilities, will dictate the order and those conditions need to be fed into the optimiser rather than the probabilities.
The problem here is that those conditions are mostly human-derived. Unless if you have a way to extract them using NLP? Any ideas? @NikitaChurkin ?