Ever wondered what goes into launching a data science competition?
If so, this post is for you. I spent the last few days working on the Fowl Escapades: Southern African Bird Call Audio Identification Challenge on Zindi, and thought it would be fun to take you behind the scenes a little to show how it all came together.
Many competitions spring from an existing problem in need of a solution. For example, you may want a way to predict when your delivery will arrive based on weather, traffic conditions and the route your driver will take. In cases like this, an organization will reach out to Zindi with this problem statement, and move to stage 2 to see if it’s a viable competition idea. But this isn’t the only way competitions are born!
Sometimes, we find a cool dataset that naturally lends itself to answering an interesting problem. Sometimes we start with an interesting problem, and go looking for data that could help find answers. And occasionally, we start with nothing but a passing question at the end of a meeting: ‘does anyone have any other competition ideas?’. This was the case here.
I had been wanting to try my hand at something involving audio data. Since I happen to be an avid birder, I thought automatic birdsong identification would be an interesting topic. For this to work, we’d need bird calls – lot’s of them. Fortunately, after a bit of searching I found the star of this competition: https://www.xeno-canto.org/. Hundreds of thousands of calls from all over the world! A competition idea was born.
To run a competition, you need some data (unless you’re going to ask the participants to find it for themselves!). This must:
By the time a dataset is released as part of a competition, it’s usually been through several stages of preparation. Let’s use the birdsong example and look at a few of there steps.
Many of these steps are the same when approaching a data science project for your own work. It’s still important to clean and check the data before launching into the modelling process, and masking is useful if you’ll need to share results or experiments without necessarily sharing all your secret info.
Aside from getting the data ready, there are all sorts of extra little steps required to arrive at something you’re happy to share with the world. An incomplete list of TODOs for our latest launch:
A confusion matrix – one way to quickly see how well a classification algorithm is working. (from the starter notebook)
Fine-tuning the benchmark model
I could add lots more. I’ve worked on quite a few contests with the Zindi team, but usually I’m just part of the data cleaning and modelling steps. I’ve had such a ball moving this one from start to finish alongside the rest of the team, and I really appreciate all the hard work they do to keep us DS peeps entertained!
I hope this has been interesting. As I said, this whole process has been a blast. So if you’re sitting on some data, or know of a cool dataset, why not reach out and host a competition? You might even convince them to let you name it something almost as fun as ‘Fowl Escapades’. 🙂
Jonathan Whitaker (Johnowhitaker) enjoys playing with data, writing tutorials and sharing his findings as he trawls through the world of data science. Johno was the 2nd Zindian to join Zindi and we are grateful for his contributions to the platform.
You can read the original blog post here.
Johno has a ‘tutorials’ section and a datasets page cataloging useful sources from around the web and any new data he has generated.