How to download data files from Zindi to Colab
Getting started · 24 Feb 2022, 17:57

If you’ve participated in a Zindi competition or any other data science hackathon, there is a good chance you used Google Colab to explore the data and engineer new features. Did you know that there is a way to connect Google Colab directly with a Zindi competition so that you don’t have to download the data to your local machine? Join Zindi’s data science guru Brainiac as he takes you through this easy step-by-step method.

Most data science hackathon winners have one thing in common: they spend the majority of their time carrying out experiments, exploring the data, and engineering new features. The more experiments a data scientist does, the more likely they are to win a competition. One of the most popular cloud processing platforms for this type of work is Google Colab.

To reduce the time and bandwidth needed to download data files from Zindi and then upload them to Colab repeatedly, we can directly download files from Zindi to Colab. This is how 😊

Finding the authorisation token on Zindi

To download data from a Zindi competition, we need:

  • A link to the data
  • An authorisation token, this shows that you have logged in and joined the competition

First we will need to login to the Zindi platform (click here).

The next step is to navigate to the competition that we want to download data from. For illustration, we will download files for the Hulkshare Recommendation Algorithm Challenge.

Next, navigate to the data section that has the heading Files:

Right-click on the page and select inspect:

On the emerging window, CTRL + F and then search for “https://api.zindi”. All Zindi files start with this link. The total number of files is also shown at the end of the line:

Next, we need to get the link and auth_token from the site.

To copy the link, right-click on the selected data form and choose edit.

Then copy the link and auth_token. Note the link will be the same for everyone, but the auth_token will be different for each user:

Now we have all the ingredients to download data from Zindi to Colab.

In Colab, import the requests library that will fetch data from Zindi.

# Import libraries
import requests
from tqdm.auto import tqdm

Next, get the data link and token

# Data url and token
data_url = "https://api.zindi.africa/v1/competitions/hulkshare-recommendation-algorithm-challenge/files/test_frames1.zip" # url
token = {'auth_token': ''} # Use your own token

Next, we define a function that will download the data for us.

The function will take in the url, auth_token and the name of the file and will return the downloaded file.

# Function to download data
def zindi_data_downloader(url, token, file_name):
    # Get the competition data
    competition_data = requests.post(url = data_url, data= token, stream=True)
    
    # Progress bar monitor download
    pbar = tqdm(desc=file_name, total=int(competition_data.headers.get('content-length', 0)), unit='B', unit_scale=True, unit_divisor=512)
    # Create and Write the data to colab drive in chunks
    handle = open(file_name, "wb")
    for chunk in competition_data.iter_content(chunk_size=512): # Download the data in chunks
        if chunk: # filter out keep-alive new chunks
                handle.write(chunk)
        pbar.update(len(chunk))
    handle.close()
    pbar.close()

Finally, we call the function to download the data

We can confirm that the data has been downloaded

Next, we unzip the downloaded data

# Unzip data
!unzip -q /content/test_frames1.zip

Putting the code altogether

# Import libraries
import requests
from tqdm.auto import tqdm
# Function to download data
def zindi_data_downloader(url, token, file_name):
    # Get the competition data
    competition_data = requests.post(url = data_url, data= token, stream=True)
    
    # Progress bar monitor download
    pbar = tqdm(desc=file_name, total=int(competition_data.headers.get('content-length', 0)), unit='B', unit_scale=True, unit_divisor=512)
    # Create and Write the data to colab drive in chunks
    handle = open(file_name, "wb")
    for chunk in competition_data.iter_content(chunk_size=512): # Download the data in chunks
        if chunk: # filter out keep-alive new chunks
                handle.write(chunk)
        pbar.update(len(chunk))
    handle.close()
    pbar.close()
    
# Data url, token and file_name
data_url = "https://api.zindi.africa/v1/competitions/hulkshare-recommendation-algorithm-challenge/files/test_frames1.zip" # url
token = {'auth_token': ''} # Use your own token
file_name = 'test_frames1.zip'
# Download data
zindi_data_downloader(url = data_url, token = token, file_name = file_name)
# Unzip data
!unzip -q /content/test_frames1.zip