Congrats @Milind on winning this with an absolutely fantastic score! You must have some special features and a lucky seed somewhere.
This was a nice competition. A week ago I was home-alone and bored and to stay out of mischief, entered this comp. I looked at the really nice starter, but decided it probably overcomplicates things, and abandoned it in favour of something simpler - or so I hoped.
I had no idea what I was doing, but soon bumped into tesseract and combined that with LayoutML through simpletransformers. How can I go wrong, I mean, even @wuuthraad is a simpletransformers contributer.
Except, as soon as I upgraded all my pip packages, simpletransformers stopped working!
So my initial model no longer worked and after a day or two of trying to get LayoutML working using some hugging face model directly, I gave up. Not enough time ... simpletransformers earlier took care of this for me but I could never figure out myself how to package the bboxes correctly for the model.
So I replaced my NLP with just simple word counts and Naive Bayes, but this was very slow to fit, and so eventually I replaced NB with a perceptron and passive-aggressive classifier. This was followed by a nice ensemble that used the NLP outcome together with features such as box coordinates and ratio of digits to letters to make the final decision.
I now had a nice pipeline going, but the biggest stumbling block seemed to be the inaccurate tesseract OCR, and so I started working a preparing the images for better tesseract results.
I think your score here would correlate well with the accuracy of your OCR.
Anyhow, the preprocessing also took way too long, this is now mid-week in a comp in which I only have a week to play, so I started looking for a better OCR and a spell checker. The spell checker only made things worse it seems, but then I remembered, Apple stuff has a built in OCR (called Vision) that I always wanted to try. It took but 50 lines of nice Objective-C code to have a drop-in tesseract replacement and the performance improvement was huge! I had a chance!
Soon I placed well, I was #3 based on Apple's OCR. Unlike tesseract, this OCR just gives you a line at a time, so you have to figure out if two lines are e.g. part of the same title or paragraph.
The rest of this comp for me really was about calibrating and tweaking some quick home-grown algo to turn lines into sentences or paragraphs. If I had more time and implementation skill and knowledge, perhaps I'd replace this with some cluster model. It seems quite non-linear and tricky, but again, no time for such niceties.
Then I realised, this comp only allows open-source solutions. Initially I thought I'd just stick with the Apple OCR, but after a while I stumbled into paddleOCR. Wow! How nice, even if I wasn't able to get it to do what I wanted in the limited time left by now, I could use it as a replacement for Apple OCR.
So the remaining day or two I tweaked my line-combining algo, always wanting to replace it with something better but never finding the time. Likewise trying to customise paddle but struggling too much and eventually just using defaults.
But there you go, the bell rings and I'm #11 at the end. It comes with well-deserved humble pie, but it is also nice, because anything better and I'd have to clean up the mess I've created by now to submit for zindi's code review.
What I do find truly amazing is how others started playing even after me and did so well (e.g. @CapitainData and @ASSAZZIN - congrats to you on such a sterling performance). My attempt was so rushed, I don't want to imagine what they did. Perhaps I should have stayed with the starter ...
Hi Skaak, I also enrolled in this comp but never make a submission. I guess my hardware is to small, but i went to colab, encountered too many hiccups and decided not too continue with it. I saw hugging faces launched a nice comp over the weekend, "movie genre prediction". Here is the link: https://huggingface.co/competitions, you must check it out let me know if you have some ideas. Great week ahead my friend.
Thanks - your comment warms my heart. I hope you also have a great week.
That comp looks pretty straightforward NLP actually. Nothing much too it, but that is a quick conclusion, the one that gets you into trouble.
If colab took ZAR I'd have been a customer, but I don't think you need too heavy computing power for that, if you stick to the road. Might not win it that way, but at least you'd have a nice, lightweight pipeline.
Been chewing on this all morning - its very tempting ... I'm a bit stretched at the moment but what the heck, if you want we can join forces on this one.
Hi Skaak, I will get back to you, I just want to work through the starter code from the youtube vid. And Im still busy with my cpi magnus opus...lol
Hi @Skaak.
Thanks for sharing your approach.
I have stuck with the starternnotebook, trained different models and then ensembled them. Ensembling really helped with handling overfitting. Data augmentation also, but it didn't seem promising based on public score.
I have also experimented hardware limitations and colab hiccups like @Jaw22.
In colab, it was really slow(batch size = 2)
So, I turned to Kaggle GPU which have sightly more dedicated memories than colab ones with a batch size of 4.
Congrats to everyone. We've definitely learnt from this contest.
Thanks to @Zindi and the organizing partners.
You can find my approach here:
https://github.com/SIMSON20/Title-Extraction-in-Lecture-Slides-Challenge-by-ITU-AI-ML-in-5G-Challenge
Hi @CaptainData,
Thank you for sharing your approach, much appreciated, will look into it and Congratulations to you.
Thanks for sharing so freely!
Kaggle is quite generous, you actually get a lot and seemingly much better than colab's free offering.
Congrats again, you did so well in this one and in record time also.