Uber Movement SANRAL Cape Town Challenge
$5,500 USD
Predict when and where road incidents will occur next in Cape Town
11 October 2019–9 February 2020 23:59
738 data scientists enrolled, 112 on the leaderboard
A streamlined setup for using Colab for Zindi competitions
published 5 Feb 2020, 16:22

Hey guys,

With this competition coming to a close, I thought Id share with you all something I learned during the process.

When Johno shared his Colab notebooks, I naturally wanted to run them and play with it, but I quickly ran into issues with loading my data into Colab. With large datasets, it can take quite a while for the file to upload, and when the runtime disconnects you have to do that all over again. I'm sure I'm not the only one who thought, "hmmm, there must be a better way for dealing with this". And so I want to share how I went about efficiently using Colab. I think it's something that could be really useful to many new Zindians and Data Scientists alike. Unfortunately I'll have to skim over some of the finer details, but if there's interest please indicate so and maybe I can do a little blog post for Zindi ^_^

Firstly, I want to quickly brush over what a proper repo structure looks like for any data science project. The people at DrivenData developed a tool called cookiecutter which is widely used in the industry, and they describe the details very well (see https://drivendata.github.io/cookiecutter-data-science/). In essence, all it does is it gives you a template for starting a new project. So for example, if you start a new competition, you can create a new repo using the cookiecutter and viola! The structure is set up in such a way that will make it easier to deal with your data in a reproducible way.

What's really important though is the bit about how notebooks should be treated as directed acyclic graphs (DAGs). All it comes down to is that you shouldn't mess with your raw data, and a notebook (or analysis) should take an input and have an output. For example, if you do some preprocessing on your data, I've seen many Kagglers and Zindians do all those steps in the same notebook where they run their models. Thats not the DAG approach. The DAG approach would be to create a data_processing.ipynb notebook which takes in the raw data from /data/raw/* and outputs a new dataset (lets say data_v1.csv) to data/processed/* (or data/modelling/*). Then you create a modelling.ipynb notebook which only ever takes in a prepared and processed dataset, does NO data processing and outputs a model and predictions / submission. Typically, you'd have some feature engineering notebooks and stuff like that in between pre-processing and modelling. Naturally, some feature engineering such as target encoding and over sampling need to happen with-fold so that's a hard constraint, but thats the exception. (Notice how I put a version on the dataset, something you might want to consider if you're trying out different feature engineering / processing techniques)

Once you have your repo set up in a way that produces datasets that are ready for modelling, the next step is to get those datasets to automatically sync to a place where Colab can access it. Naturally, the obvious choice is Google Drive, but you can do fancy rsync things with S3 or Google Buckets if you want. This post is about how to use Google Drive. The trick here though, is that you DONT want to sync the output data folder on your local machine to your Google Drive. The reason for this is that if you sync a local folder, it goes to a place where you can't access it from Colab (or at least I had many pains trying to).

So to get around that, what you can do is this:

1. Create a folder on Google Drive (using the website or in the Google Drive folder on your PC) called "datasets" or anything you like.

2. In that folder, create a new folder for the project you're working on. For this competition, mine was called "Google Drive/datasets/zindi-uber-movement".

3. Now the sneaky bit - create a symlink (see https://www.shellhacks.com/symlink-create-symbolic-link-linux/) between that folder and the folder that contains your modelling datasets.

Now, due to the symlink, Google Drive will see that there's content in that folder and automatically sync it to a folder in your drive which IS accessible from Colab. Then you can mount your drive within Colab and load the dataset like this:

--

from google.colab import drive

drive.mount('/content/drive')

drive_data_path = '/content/drive/My Drive/datasets/zindi-uber-movement'

data_version = 'v1'

f_data = os.path.join(drive_data_path,f'data_{data_version}.parquet.gzip')

f_sub = os.path.join(drive_data_path, f'submission_{data_version}.parquet.gzip')

def load_data(f, process=True):

'''

Load a file

Process does basic manipulation like sorting.

'''

if os.path.isfile(f):

data = pd.read_parquet(f)

if process:

data = preprocess(data)

print('Loaded dataset with shape:',data.shape)

return data

else:

print("Could not find: ",os.path.basename(f))

data = load_data(f_data)

sub_data = load_data(f_sub)

--

What all of this gives you is a system where, if you come up for a new idea for feature engineering, or merged in a new dataset to your existing one, you run the applicable notebook which then outputs a new dataset - lets say data_v2.csv. Once that notebook is finished saving the file to your disk, the new dataset will automatically be synced to Google Drive in a folder which you can easily access withing Colab using the built-in mounting functionality.

I hope this explanation helped. When the competition is done I will be opensourcing my solution repo and post the link in the comments.

Happy hacking for the next few days and I hope the private leaderboard is kind to your local validation scheme ;-)

ugh, the code formatting is terrible, but you guys get the idea.

replying to RenierBotha
edited less than a minute later

Renier, thank you for the idea. Indeed it is a very useful one. It can be taken one step further though. Use DVC (https://dvc.org/)

nice thanks i applied something similar but mine was manually.

This is very useful, thank you!

Hello @RenierBotha can you please do am complete blogpost on this for better understanding. Thank you