Primary competition visual

Gender-Based Violence Tweet Classification Challenge

Helping Global
2000 Points
Challenge completed almost 4 years ago
Natural Language Processing
Classification
634 joined
140 active
Starti
Aug 09, 21
Closei
Nov 14, 21
Reveali
Nov 14, 21
Validation and Test results are great, but submission is terrible, what might be happening? (Update: Problem Solved!)
27 Sep 2021, 15:36 · edited 1 day later · 5

I left 10% of the train dataset for validation. Then, I applied train_test_split in the remaining dataset. In other words, I have two datasets to validate my model, validation and test set. Both datasets are getting great results, something like 92% of f1-score (macro) and accuracy. The problem is: my submission results are about 70%. Any ideas on what might be happening? I couldn't find any sort of data leakage, overfitting, it seems like there's nothing wrong. Also, I use sklearn pipelines to help me not do anything stupid. Any ideas on what is the problem here?

My code is something like this:

# load training data

train_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Train.csv')

# leave 10% for validation

train = train_set.loc[:35685, ["Tweet_ID", "tweet", "type"]]

validation = train_set.loc[35685:, ["Tweet_ID", "tweet"]]

# load the test set

submission_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Test.csv')

# load submission file

submission_file = pd.read_csv('gender-based-violence-tweet-classification-challenge/SampleSubmission.csv')

def preprocess_text(text):

STOPWORDS = stopwords.words("english")

# Check characters to see if they are in punctuation

nopunc = [char for char in text if char not in string.punctuation]

# Join the characters again to form the string.

nopunc = "".join(nopunc)

# Now just remove any stopwords

return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

X = train["tweet"]

y = train["type"]

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.15, random_state=42

)

pipe = Pipeline([ ])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

===========================================================

UPDATE: PROBLEM SOLVED!

Discussion 5 answers

u need to balance the data. the dataset is so imbalanced (for example there is a class with 30 000 samples and another with only 200 ).

there are multiple techniques to balance the data

I tested already: SMOTE(), class_weight and even undersampling the dataset. Nothing changed.

how did you solve the problem? I have the same thing.

I have same problem, what did you do?

22 Oct 2021, 11:47
Upvotes 0