🦺 Must-Read: Validation and Test results ar...

Gender-Based Violence Tweet Classification Challenge

Helping Global

2000 Points

Completed (~4 years ago)

Skills you will learn

Natural Language Processing

Classification

634 joined

140 active

Info Data Chat Leaderboard

Start

Aug 09, 21

Nov 14, 21

Reveal

Nov 14, 21

yukioandre

Validation and Test results are great, but submission is terrible, what might be happening? (Update: Problem Solved!)

27 Sep 2021, 15:36 · edited 1 day later · 5

I left 10% of the train dataset for validation. Then, I applied train_test_split in the remaining dataset. In other words, I have two datasets to validate my model, validation and test set. Both datasets are getting great results, something like 92% of f1-score (macro) and accuracy. The problem is: my submission results are about 70%. Any ideas on what might be happening? I couldn't find any sort of data leakage, overfitting, it seems like there's nothing wrong. Also, I use sklearn pipelines to help me not do anything stupid. Any ideas on what is the problem here?

My code is something like this:

# load training data

train_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Train.csv')

# leave 10% for validation

train = train_set.loc[:35685, ["Tweet_ID", "tweet", "type"]]

validation = train_set.loc[35685:, ["Tweet_ID", "tweet"]]

# load the test set

submission_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Test.csv')

# load submission file

submission_file = pd.read_csv('gender-based-violence-tweet-classification-challenge/SampleSubmission.csv')

def preprocess_text(text):

STOPWORDS = stopwords.words("english")