The data has been split into a test and training set.
train.json (zipped) is the dataset that you will use to train your model. This dataset includes about 2,400 consecutive tweets from each of the companies listed below, for a total of 96,562 tweets.
test_questions.json (zipped) is the dataset to which you will apply your model to test how well it performs. Use your model and this dataset to predict the number of retweets a tweet will receive. The test set are the consecutive tweets that followed the first tweets provided in the training sets. There are a maximum of 800 tweets per company in this test set. This dataset includes the same fields as train.json except for the retweet_count and favorite_count variables.
sample_submission.csv is a table to provide an example of what your submission file should look like.
Notes on the data: This data was downloaded from Twitter on 23 August 2018. So represents the retweets and favorites at that point in time.
Variables in train.json and test_questions.json are as described in the twitter documentation:
Tweet Object - https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
User Object - https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object
Entities Object- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object
GeoObject - https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/geo-objects
Companies included in this dataset:
Nigeria
Ghana
South Africa
Kenya
Uganda
Join the largest network for
data scientists and AI builders