There is data all around us, but how do we collect it, clean it and present it in a way that makes it useful for building models and apps? Our Zindi Ambassador Davis David has created a notebook to walk you through scraping data from Twitter for a dataset collection challenge.
Zindi has hosted two dataset collection challenges: AI4D and GIZ Indigenous African Language Collection Challenge and most recently the AFD Gender-Based Violence Dataset Collection Challenge. These challenges are difficult as there is little accessible data surrounding these two fields, especially in the African context.
The AFD Gender-Based Violence Dataset Collection Challenge calls on the Zindi community to help create, curate and collate quality datasets on GBV. The objective of this challenge is to help shed light on this topic, to lay the groundwork for informed actions and to support data-driven solutions to contribute to the battle to end GBV.
When collecting data you need to think about:
Over and above this, your dataset needs to be complete and accurate; without addressing this, your data and potential insights can lead to incorrect conclusions and potential bias. Read this article by Florencia Mangini on how she categorises completeness and accuracy when building datasets.
In this tutorial, we will walk you through the steps required to scrape data from Twitter using the tweepy Python library, for use in curating a GBV dataset to submit to the AFD Gender-Based Violence Dataset Collection Challenge. You can follow the steps on this blog or via Davis’ notebook here [link].
Trigger Warning: This notebook collects tweets that could contain sensitive information for some readers.
Install the following python packages that will help you to collect data from twitter.com
!pip install tweepy
!pip install unidecode
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
from unidecode import unidecode
from tqdm import tqdm
import pandas as pd
import numpy as np
You will need to apply for a developer account to access the API. The Standard APIs are sufficient for this tutorial. They’re free, but have some limitations that we’ll learn to work around in this tutorial. Once your developer account is set up, create an app that will make use of the API:
Now that you have created a developer account and an app, you should have a set of keys to connect to the Twitter API. Specifically, you’ll have an:
These could be inserted directly into your code to connect to the Twitter API, as shown below.
consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
def tweetSearch(query, limit):
This function will search a query provided in the twitter and,
retun a list of all tweets that have a query.
# Create a blank variable
tweets = 
# Iterate through Twitter using Tweepy to find our query with our defined limit
for page in tweepy.Cursor(
api.search, q=query, count=limit, tweet_mode="extended"
for tweet in page:
# return tweets
This function will receive tweets and collect specific data from it such as place, tweet's text,likes
retweets and save them into a pandas data frame.
This function will return a pandas data frame that contains data from twitter.
df = pd.DataFrame(data=[tweet.full_text.encode('utf-8') for tweet in tweets], columns=["Tweets"])
df["id"] = np.array([tweet.id for tweet in tweets])
df["lens"] = np.array([len(tweet.full_text) for tweet in tweets])
df["date"] = np.array([tweet.created_at for tweet in tweets])
df["place"] = np.array([tweet.place for tweet in tweets])
df["coordinateS"] = np.array([tweet.coordinates for tweet in tweets])
df["lang"] = np.array([tweet.lang for tweet in tweets])
df["source"] = np.array([tweet.source for tweet in tweets])
df["likes"] = np.array([tweet.favorite_count for tweet in tweets])
df["retweets"] = np.array([tweet.retweet_count for tweet in tweets])
STEP 7: ADD TWITTER HASHTAGS RELATED TO GENDER-BASED VIOLENCE
# add hashtags in the following list
hashtags = ['#GBV', '#sexism', '#rape']
Here you might want to think about other relevant hashtags or search terms that could be useful in building a useful dataset, depending on what you want to achieve.
total_tweets = 0
The following for loop will collect a tweets that have the hashtags
mentioned in the list and save the tweets into csv file
for n in tqdm(hashtags):
# first we fetch all tweets that have specific hashtag
hash_tweets = tweetSearch(query=n,limit=7000)
total_tweets += int(len(hash_tweets))
# second we convert our tweets into datarame
df = tweets_to_data_frame(hash_tweets)
#third we save the dataframe into csv file
# show total number of tweets collected
For more tweepy configuration please read the tweepy documentation here.
Davis David is Zindi Ambassador for Tanzania and a data scientist at ParrotAI. He is passionate about artificial intelligence, machine learning, deep learning and big data. He is a co-organizer and facilitator of the AI movement in Tanzania; conducting AI meetups, workshops and events with a passion to build a community of data scientists to solve local problems. He can be reached on Twitter @Davis_McDavid.