30 Apr 2021, 11:37

A beginner’s guide to scraping data from social media

There is data all around us, but how do we collect it, clean it and present it in a way that makes it useful for building models and apps? Our Zindi Ambassador Davis David has created a notebook to walk you through scraping data from Twitter for a dataset collection challenge.

Zindi has hosted two dataset collection challenges: AI4D and GIZ Indigenous African Language Collection Challenge and most recently the AFD Gender-Based Violence Dataset Collection Challenge. These challenges are difficult as there is little accessible data surrounding these two fields, especially in the African context.

The AFD Gender-Based Violence Dataset Collection Challenge calls on the Zindi community to help create, curate and collate quality datasets on GBV. The objective of this challenge is to help shed light on this topic, to lay the groundwork for informed actions and to support data-driven solutions to contribute to the battle to end GBV.

When collecting data you need to think about:

  • The potential impact of the insights: Does the dataset lend itself to meaningful applications (including machine learning applications) or analysis that would likely change thinking or even drive actions, mitigate risks or ad impact?
  • Does it fill a gap?: Does the dataset cover a topic, a population, or other aspect of a target in Africa for which little data exists. Does the data create potential for new insights that currently don’t exist or are currently under-represented and researched.
  • The quality of the dataset: Size of the dataset. Is the dataset robust, clean, complete, consistent, and usable?
  • Documentation and presentation of the dataset: Is the dataset formatted in a logical and usable way? Are variables well-defined and assumptions and sources well-documented?

Over and above this, your dataset needs to be complete and accurate; without addressing this, your data and potential insights can lead to incorrect conclusions and potential bias. Read this article by Florencia Mangini on how she categorises completeness and accuracy when building datasets.

In this tutorial, we will walk you through the steps required to scrape data from Twitter using the tweepy Python library, for use in curating a GBV dataset to submit to the AFD Gender-Based Violence Dataset Collection Challenge. You can follow the steps on this blog or via Davis’ notebook here [link].

Trigger Warning: This notebook collects tweets that could contain sensitive information for some readers.

STEP 1: PYTHON PACKAGES INSTALLATION

Install the following python packages that will help you to collect data from twitter.com
!pip install tweepy
!pip install unidecode

STEP 2: IMPORT IMPORTANT PACKAGES

#import dependencies
import tweepy
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
from unidecode import unidecode
import time
import datetime
from tqdm import tqdm
import pandas as pd
import numpy as np

STEP 3: AUTHENTICATING TO TWITTER'S API

You will need to apply for a developer account to access the API. The Standard APIs are sufficient for this tutorial. They’re free, but have some limitations that we’ll learn to work around in this tutorial. Once your developer account is set up, create an app that will make use of the API:

  • click on your username in the top right corner to open the drop down menu
  • click “Apps”
  • select “Create an app” and fill out the form

Now that you have created a developer account and an app, you should have a set of keys to connect to the Twitter API. Specifically, you’ll have an:

  • API key
  • API secret key
  • Access token
  • Access token secret

These could be inserted directly into your code to connect to the Twitter API, as shown below.

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

STEP 4: CONNECT TO TWITTER API USING THE SECRET KEY AND ACCESS TOKEN

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

STEP 5: DEFINE A FUNCTION THAT WILL TAKE OUR SEARCH QUERY

def tweetSearch(query, limit):
"""
This function will search a query provided in the twitter and,
retun a list of all tweets that have a query.
"""
# Create a blank variable
tweets = []
# Iterate through Twitter using Tweepy to find our query with our defined limit
for page in tweepy.Cursor(
api.search, q=query, count=limit, tweet_mode="extended"
).pages(limit):
for tweet in page:
tweets.append(tweet)
# return tweets
return tweets

STEP 6: CREATE A FUNCTION TO SAVE TWEETS INTO A DATAFRAME

def tweets_to_data_frame(tweets):
"""
This function will receive tweets and collect specific data from it such as place, tweet's text,likes
retweets and save them into a pandas data frame.
This function will return a pandas data frame that contains data from twitter.
"""
df = pd.DataFrame(data=[tweet.full_text.encode('utf-8') for tweet in tweets], columns=["Tweets"])
df["id"] = np.array([tweet.id for tweet in tweets])
df["lens"] = np.array([len(tweet.full_text) for tweet in tweets])
df["date"] = np.array([tweet.created_at for tweet in tweets])
df["place"] = np.array([tweet.place for tweet in tweets])
df["coordinateS"] = np.array([tweet.coordinates for tweet in tweets])
df["lang"] = np.array([tweet.lang for tweet in tweets])
df["source"] = np.array([tweet.source for tweet in tweets])
df["likes"] = np.array([tweet.favorite_count for tweet in tweets])
df["retweets"] = np.array([tweet.retweet_count for tweet in tweets])
return df

STEP 7: ADD TWITTER HASHTAGS RELATED TO GENDER-BASED VIOLENCE

# add hashtags in the following list
hashtags = ['#GBV', '#sexism', '#rape']

Here you might want to think about other relevant hashtags or search terms that could be useful in building a useful dataset, depending on what you want to achieve.

STEP 8: RUN BOTH FUNCTIONS TO COLLECT DATA FROM TWITTER RELATED TO THE HASHTAGS LISTED ABOVE

total_tweets = 0
"""
The following for loop will collect a tweets that have the hashtags
mentioned in the list and save the tweets into csv file
"""
for n in tqdm(hashtags):
# first we fetch all tweets that have specific hashtag
hash_tweets = tweetSearch(query=n,limit=7000)
total_tweets += int(len(hash_tweets))
# second we convert our tweets into datarame
df = tweets_to_data_frame(hash_tweets)
#third we save the dataframe into csv file
df.to_csv("data/{}_tweets.csv".format(n))
# show total number of tweets collected
print("total_tweets: {}".format(total_tweets))

For more tweepy configuration please read the tweepy documentation here.

About the author

Davis David is Zindi Ambassador for Tanzania and a data scientist at ParrotAI. He is passionate about artificial intelligence, machine learning, deep learning and big data. He is a co-organizer and facilitator of the AI movement in Tanzania; conducting AI meetups, workshops and events with a passion to build a community of data scientists to solve local problems. He can be reached on Twitter @Davis_McDavid.